acl acl2013 acl2013-80 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
Reference: text
sentIndex sentText sentNum sentScore
1 We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. [sent-10, score-0.681]
2 We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. [sent-11, score-0.567]
3 For example, Figure 1(b) shows the structure of the word “建筑业 (construction and building industry)”, where the characters “建 (construction)” and “筑 (building)” form a coordination, and modify the character “业 (industry)”. [sent-16, score-0.424]
4 Figure 1: Word-based and character-level phrasestructure trees for the sentence “中 国建筑业呈现 新格NR-局r (China’s NaNr-rchitecturVeV bindVVu-istryJJ tshows NnNe-wt patterns)”, where “l”, “r”, “c” denote the directions of head characters (see section 2). [sent-23, score-0.483]
5 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 125–134, (constituent) trees, adding recursive structures of characters for words. [sent-28, score-0.403]
6 Using these annotations, we transform CTB-style constituent trees into character-level trees (Figure 1(b)). [sent-30, score-0.49]
7 We build a character-based Chinese parsing model to parse the character-level syntax trees. [sent-34, score-0.349]
8 With regard to task of parsing itself, an important advantage ofthe character-level syntax trees is that they allow word segmentation, part-of-speech (POS) tagging and parsing to be performed jointly, using an efficient CKY-style or shift-reduce algorithm. [sent-37, score-0.953]
9 Luo (2003) exploited this advantage by adding flat word structures without manually annotation to CTB trees, and building a generative character-based parser. [sent-38, score-0.477]
10 Compared to a pipeline system, the advantages of a joint system include reduction of error propagation, and the integration of segmentation, POS tagging and syntax features. [sent-39, score-0.398]
11 With hierarchical structures and head character information, our annotated words are more informative than flat word structures, and hence can bring further improvements to phrase-structure parsing. [sent-40, score-0.717]
12 To analyze word structures in addition to phrase structures, our character-based parser naturally performs joint word segmentation, POS tagging and parsing jointly. [sent-41, score-1.027]
13 We extend their shift-reduce framework, adding more transition actions for word segmentation and POS tagging, and defining novel features that capture character information. [sent-43, score-0.711]
14 Our word annotations lead to further improvements to the joint system, especially for phrase-structure parsing accuracy. [sent-50, score-0.46]
15 Our parser work falls in line with recent work of joint segmentation, POS tagging and parsing (Hatori et al. [sent-51, score-0.675]
16 Compared with related work, our model gives the best published results for joint segmentation and POS tagging, as well as joint phrase-structure parsing on standard CTB5 evaluations. [sent-53, score-0.868]
17 Figure 2 shows the structures of the four words “库存 (repertory)”, “ 考古 126 NN-c NN-r NN-r N卧N-b 虎NN-i 藏NN-i 龙NN-i h卧 r 虎 e 藏 (crouching) (tiger) (hidden) o 龙 (dragon) Figure 3: Character-level word structure of “卧虎 藏龙 (crouching tiger hidden dragon)”. [sent-65, score-0.369]
18 They made use of this information to help joint word segmentation and POS tagging. [sent-71, score-0.552]
19 Figure 4(a) illustrates the morphological structures of the words “ 朋 友 们 (friends)” and “教育界 (educational world)”, in which the characters “们 (plural)” and “界 (field)” can be treated as suffix morphemes. [sent-73, score-0.44]
20 For leaf characters, we follow previous work on word segmentation (Xue, 2003; Ng and Low, 2004), and use “b” and “i” to indicate the beginning and nonbeginning characters of a word, respectively. [sent-86, score-0.581]
21 For example, “制 服” means “dominate” when it is tagged as a verb, of which the head is the left character; the same word means “uniform dress” when tagged as a noun, of which the head is the right character. [sent-89, score-0.377]
22 Using our annotations, we can extend CTBstyle syntax trees (Figure 1(a)) into characterlevel trees (Figure 1(b)). [sent-92, score-0.41]
23 In particular, we mark the original nodes that represent POS tags in CTBstyle trees with “-t”, and insert our word structures as unary subnodes of the “-t” nodes. [sent-93, score-0.601]
24 For the rest of the paper, we refer to the “-t” nodes as full-word nodes, all nodes above full-word nodes as phrase 127 nodes, and all nodes below full-word nodes as subword nodes. [sent-94, score-0.506]
25 For example, the head characters of words can be populated up to phraselevel nodes, and serve as an additional source of information that is less sparse than head words. [sent-96, score-0.369]
26 In this paper, we build a parser that yields characterlevel trees from raw character sequences. [sent-97, score-0.474]
27 In addition, we use this parser to study the effects of our annotations to character-based statistical Chinese parsing, showing that they are useful in improving parsing accuracies. [sent-98, score-0.38]
28 3 Character-based Chinese Parsing To produce character-level trees for Chinese NLP tasks, we develop a character-based parsing model, which can jointly perform word segmentation, POS tagging and phrase-structure parsing. [sent-99, score-0.723]
29 Trained using annotated word structures, our parser also analyzes the internal structures of Chinese words. [sent-101, score-0.506]
30 Our character-based Chinese parsing model is based on the work of Zhang and Clark (2009), which is a transition-based model for lexicalized constituent parsing. [sent-102, score-0.472]
31 The system can provide binarzied CFG trees in Chomsky Norm Form, and they present a reversible conversion procedure to map arbitrary CFG trees into binarized trees. [sent-106, score-0.343]
32 We make two extensions to their work to enable joint segmentation, POS tagging and phrasestructure parsing from the character level. [sent-109, score-0.805]
33 First, we modify the actions of the transition system for 1We use a left-binarization process for flat word structures that contain more than two characters. [sent-110, score-0.515]
34 The candidate transition action A at each step is defined as follows: • SHIFT-SEPARATE (t ) : remove the head cShHaIraFcTte-rS cj PfrAoRmA Q, pushing a souvbewo thred n heodade onto S, assigning S0. [sent-126, score-0.403]
35 Note that the parse tree S0 must correspond to a full-word or a phrase node, and the character cj is the first character of the next word. [sent-128, score-0.444]
36 Scj02 • SHIFT-APPEND: remove the head character Scj0 IfroFmT- Q, pushing a souvbewthoerdh neoaddec aocntteor S. [sent-130, score-0.368]
37 cj will eventually be combined with all the subword nodes on top of S to form a word, and thus we must have S0. [sent-131, score-0.324]
38 cj • REDUCE-SUBWORD (d) : pop the top two nRoEdDeUs S0 aSnUdB S1 oRDff( S, pushing a new s tuwboword node onto S. [sent-134, score-0.347]
39 3For the head direction “coordination”, we extract the head character from the left node. [sent-141, score-0.45]
40 • • REDUCE-WORD: pop the top node S0 off S, pushing a full-word node po nntood Se. [sent-143, score-0.344]
41 The argument l denotes the constituent label of S0, and the argument d specifies the lexical head direction of S0, which can be either “left” or “right”. [sent-146, score-0.382]
42 S1SS00 • REDUCE-UNARY (l) : pop the top node S0 oRfEfD SU, pushing a unary phrase en toodpe oen Sto SS00 S. [sent-148, score-0.343]
43 First, we split the original SHIFT action into SHIFT-SEPARATE (t ) and SHIFT-APPEND, which jointly perform the word segmentation and POS tagging tasks. [sent-152, score-0.74]
44 Second, we add an extra REDUCE-SUBWORD (d) operation, which is used for parsing the inner structures of words. [sent-153, score-0.483]
45 Third, we add REDUCE-WORD, which applies a unary rule to mark a completed subword node as a full-word node. [sent-154, score-0.379]
46 The string features are used for word segmentation and POS tagging, and are adapted from a state-of-the-art joint segmentation and tagging model (Zhang and Clark, 2010). [sent-165, score-1.165]
47 Since our model can jointly process word segmentation, POS tagging and phrase-structure parsing, we evaluate our model for the three tasks, respectively. [sent-170, score-0.404]
48 For word segmentation and POS tagging, standard metrics of word precision, recall and F-score are used, where the tagging accuracy is the joint accuracy of word segmentation and POS tagging. [sent-171, score-1.265]
49 As our constituent trees are based on characters, we follow previous work and redefine the boundary of a constituent span by its start and end characters. [sent-173, score-0.515]
50 In addition, we evaluate the performance of word (a) Joint segmentation and POS tagging F-scores. [sent-174, score-0.647]
51 Figure 6: Accuracies against the training epoch for joint segmentation and tagging as well as joint phrase-structure parsing using beam sizes 1, 4, 16 and 64, respectively. [sent-176, score-1.132]
52 The character-level parsing model has the advantage that deep character information can be extracted as features for parsing. [sent-192, score-0.43]
53 For example, the head character of a word is exploited in our model. [sent-193, score-0.38]
54 3 Final Results In this section, we present the final results of our model, and compare it to two baseline systems, a pipelined system and a joint system that is trained with automatically generated flat words structures. [sent-198, score-0.343]
55 The baseline pipelined system consists of the joint segmentation and tagging model proposed by 130 TaskPRF PipelineSTPa erg se89 731. [sent-199, score-0.838]
56 4 The model for joint segmentation and POS tagging is trained with a 16beam, since it achieves the best performance. [sent-210, score-0.735]
57 The joint system trained with flat word structures serves to test the effectiveness of our joint parsing system over the pipelined baseline, since flat word structures do not contain additional sources of information over the baseline. [sent-214, score-1.383]
58 We can see that both character-level joint models outperform the pipelined system; our model with annotated word structures gives an improvement of 0. [sent-217, score-0.577]
59 The results also demonstrate that the annotated word structures are highly effective for syntactic parsing, giving an absolute improvement of 0. [sent-220, score-0.349]
60 82% in phrase-structure parsing accuracy over the joint model with flat word structures. [sent-221, score-0.566]
61 In particular, the performance of parsing OOV word structure is an important metric of our parser. [sent-227, score-0.331]
62 43%, while if we do not consider the influences of segmentation and tagging errors, counting only the correctly segmented and tagged words, the recall is 87. [sent-229, score-0.611]
63 4 Comparison with Previous Work In this section, we compare our model to previous systems on the performance of joint word segmentation and POS tagging, and the performance of joint phrase-structure parsing. [sent-232, score-0.706]
64 (2009), which is a lattice-based joint word segmentation and POS tagging model; Sun ’ 11 denotes a subword based stacking model for joint segmentation and POS tagging (Sun, 2011), which uses a dictionary of idioms; Wang+ ’ 11 denotes a semisupervised model proposed by Wang et al. [sent-235, score-1.85]
65 Our model achieved the best performance on both joint segmentation and tagging as well as joint phrase-structure parsing. [sent-238, score-0.857]
66 Our final performance on constituent parsing is by far the best that we are aware of for the Chinese data, and even better than some state-of-the-art models with gold segmentation. [sent-239, score-0.408]
67 45%5 in parsing accuracy on the test corpus, and our pipeline constituent parsing model achieves 83. [sent-248, score-0.668]
68 The main differences between word-based and character-level parsing models are that character-level model can exploit character features. [sent-252, score-0.43]
69 Zhao (2009) studied character-level dependencies for Chinese word segmentation by formalizing segmentsion task in a dependency parsing framework. [sent-255, score-0.698]
70 Li and Zhou (2012) also exploited the morphologicallevel word structures for Chinese dependency parsing. [sent-259, score-0.361]
71 They proposed a unified transition-based model to parse the morphological and dependency structures of a Chinese sentence in a unified framework. [sent-260, score-0.481]
72 According to their results, the final performances of their model on word segmentation and POS tagging are below the state-of-the-art joint segmentation and POS tagging models. [sent-265, score-1.412]
73 Compared to their work, we consider the character-level word structures for Chinese parsing, presenting a unified framework for segmentation, POS tagging and phrasestructure parsing. [sent-266, score-0.63]
74 Our character-level parsing model is inspired by the work of Zhang and Clark (2009), which is a transition-based model with a beam-search decoder for word-based constituent parsing. [sent-268, score-0.505]
75 Our work is based on the shift-reduce operations of their work, while we introduce additional operations for segmentation and POS tagging. [sent-269, score-0.364]
76 By such an extension, our model can include all the features in their work, together with the features for segmentation and POS tagging. [sent-270, score-0.396]
77 In addition, we propose novel features related to word structures and interactions between word segmentation, POS tagging and word-based constituent parsing. [sent-271, score-0.749]
78 They use it as a joint framework to perform Chinese word segmentation, POS tagging and syntax parsing. [sent-273, score-0.464]
79 In addition, instead of using flat structures, we manually annotate hierarchal tree structures of Chinese words for converting word-based constituent trees into character-based constituent trees. [sent-276, score-0.853]
80 (2012) proposed the first joint work for the word segmentation, POS tagging and dependency parsing. [sent-278, score-0.445]
81 Their work demonstrates that a joint model can improve the performance of the three tasks, particularly for POS tagging and dependency parsing. [sent-280, score-0.411]
82 Qian and Liu (2012) proposed a joint decoder for word segmentation, POS tagging and word-based constituent parsing, although they trained models for the three tasks separately. [sent-281, score-0.618]
83 In our work, we employ a single character-based discriminative model to perform segmentation, POS tagging and phrase-structure parsing jointly, and study the influence of annotated word structures. [sent-283, score-0.577]
84 6 Conclusions and Future Work We studied the internal structures of more than 37,382 Chinese words, analyzing their structures as the recursive combinations of characters. [sent-284, score-0.55]
85 Using these word structures, we extended the CTB into character-level trees, and developed a characterbased parser that builds such trees from raw char- acter sequences. [sent-285, score-0.329]
86 Our character-based parser performs segmentation, POS tagging and parsing simultaneously, and significantly outperforms a pipelined baseline. [sent-286, score-0.656]
87 In summary, our contributions include: • We annotated the internal structures of Chinese words, w thheic inh are potentially f us Cefhuilto character-based studies of Chinese NLP. [sent-288, score-0.332]
88 We extend CTB-style constituent trees into character-level trees using our annotations. [sent-289, score-0.49]
89 • We developed a character-based parsing mWoede dl tvhealto can produce our bchasaeradct pera-rlesvinegl constituent trees. [sent-290, score-0.408]
90 Our parser jointly performs word segmentation, POS tagging and syntactic parsing. [sent-291, score-0.477]
91 We investigated the effectiveness of our joint parser over pipelined baseline, sasn do ft oheu rejf foeicnttiveness of our annotated word structures in improving parsing accuracies. [sent-292, score-0.881]
92 Incremental joint approach to word segmentation, pos tagging, and dependency parsing in chinese. [sent-305, score-0.688]
93 An error-driven word-character hybrid model for joint chinese word segmentation and pos tagging. [sent-310, score-1.173]
94 Unified dependency parsing of chinese morphological and syntactic structures. [sent-315, score-0.695]
95 Parsing the internal structure of words: A new paradigm for chinese word segmentation. [sent-320, score-0.538]
96 A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging. [sent-350, score-0.941]
97 Improving chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. [sent-360, score-1.236]
98 The penn chinese treebank: Phrase structure annotation of a large corpus. [sent-365, score-0.432]
99 Transitionbased parsing of the chinese treebank using a global discriminative model. [sent-384, score-0.585]
100 A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. [sent-389, score-0.585]
wordName wordTfidf (topN-words)
[('segmentation', 0.364), ('chinese', 0.357), ('pos', 0.232), ('parsing', 0.228), ('structures', 0.22), ('tagging', 0.217), ('subword', 0.186), ('constituent', 0.18), ('character', 0.17), ('trees', 0.155), ('characters', 0.151), ('joint', 0.122), ('flat', 0.118), ('head', 0.109), ('parser', 0.108), ('pipelined', 0.103), ('coordination', 0.1), ('node', 0.097), ('unary', 0.096), ('zhang', 0.092), ('hatori', 0.09), ('pushing', 0.089), ('dragon', 0.083), ('clark', 0.081), ('beam', 0.079), ('qian', 0.079), ('internal', 0.078), ('cj', 0.074), ('np', 0.07), ('subwords', 0.07), ('transition', 0.069), ('phrasestructure', 0.068), ('word', 0.066), ('yue', 0.065), ('denotes', 0.064), ('nodes', 0.064), ('crouching', 0.062), ('pop', 0.061), ('syntax', 0.059), ('unified', 0.059), ('jointly', 0.057), ('industry', 0.056), ('ctb', 0.055), ('kruengkrai', 0.054), ('li', 0.053), ('branching', 0.051), ('ctbstyle', 0.047), ('nroeddeus', 0.047), ('repertory', 0.047), ('tiger', 0.046), ('friend', 0.044), ('annotations', 0.044), ('actions', 0.042), ('morphological', 0.041), ('characterlevel', 0.041), ('logy', 0.041), ('zhongguo', 0.041), ('xue', 0.041), ('dependency', 0.04), ('nianwen', 0.04), ('perceptron', 0.039), ('annotation', 0.038), ('plural', 0.038), ('degenerate', 0.038), ('sti', 0.038), ('structure', 0.037), ('vadas', 0.036), ('action', 0.036), ('exploited', 0.035), ('inner', 0.035), ('annotated', 0.034), ('jun', 0.033), ('ichi', 0.033), ('child', 0.033), ('decoder', 0.033), ('binarized', 0.033), ('left', 0.033), ('china', 0.033), ('zhou', 0.033), ('recursive', 0.032), ('model', 0.032), ('yiou', 0.031), ('queue', 0.031), ('performances', 0.03), ('nn', 0.03), ('tagged', 0.03), ('encode', 0.03), ('usefulness', 0.03), ('parse', 0.03), ('harper', 0.03), ('association', 0.029), ('syntactic', 0.029), ('direction', 0.029), ('treated', 0.028), ('korea', 0.028), ('incremental', 0.027), ('cfg', 0.027), ('luo', 0.027), ('onto', 0.026), ('ip', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 80 acl-2013-Chinese Parsing Exploiting Characters
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
2 0.43490466 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue
Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1
3 0.42827246 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu
Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.
4 0.35178223 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun
Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-,-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182
5 0.33524767 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
Author: Xipeng Qiu ; Qi Zhang ; Xuanjing Huang
Abstract: The growing need for Chinese natural language processing (NLP) is largely in a range of research and commercial applications. However, most of the currently Chinese NLP tools or components still have a wide range of issues need to be further improved and developed. FudanNLP is an open source toolkit for Chinese natural language processing (NLP) , which uses statistics-based and rule-based methods to deal with Chinese NLP tasks, such as word segmentation, part-ofspeech tagging, named entity recognition, dependency parsing, time phrase recognition, anaphora resolution and so on.
6 0.33339605 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
7 0.32590884 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
8 0.29145125 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
9 0.26565415 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
10 0.25175387 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
11 0.21458024 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
12 0.21002883 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing
13 0.202888 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
14 0.16110559 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
15 0.15992656 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing
16 0.15681204 288 acl-2013-Punctuation Prediction with Transition-based Parsing
17 0.14982568 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
18 0.14776008 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
19 0.14093295 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
20 0.13689299 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
topicId topicWeight
[(0, 0.313), (1, -0.249), (2, -0.478), (3, 0.122), (4, 0.257), (5, -0.081), (6, -0.101), (7, 0.019), (8, -0.002), (9, 0.134), (10, -0.044), (11, 0.115), (12, 0.07), (13, -0.015), (14, 0.066), (15, 0.002), (16, 0.075), (17, -0.026), (18, -0.022), (19, 0.071), (20, -0.048), (21, -0.026), (22, -0.006), (23, -0.044), (24, 0.001), (25, -0.022), (26, -0.007), (27, -0.029), (28, 0.071), (29, 0.016), (30, 0.001), (31, -0.018), (32, 0.03), (33, -0.087), (34, -0.048), (35, 0.005), (36, 0.025), (37, 0.023), (38, -0.01), (39, -0.008), (40, 0.003), (41, 0.016), (42, 0.002), (43, -0.025), (44, -0.015), (45, -0.016), (46, 0.002), (47, 0.011), (48, 0.003), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.97917283 80 acl-2013-Chinese Parsing Exploiting Characters
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
2 0.9312278 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue
Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1
3 0.88899249 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
Author: Wenbin Jiang ; Meng Sun ; Yajuan Lu ; Yating Yang ; Qun Liu
Abstract: Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. In this paper we propose a discriminative learning algorithm to take advantage of the linguistic knowledge in large amounts of natural annotations on the Internet. It utilizes the Internet as an external corpus with massive (although slight and sparse) natural annotations, and enables a classifier to evolve on the large-scaled and real-time updated web text. With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves sig- nificant improvement on a series of testing sets from different domains, even with a single classifier and local features.
4 0.83693612 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
Author: Longkai Zhang ; Li Li ; Zhengyan He ; Houfeng Wang ; Ni Sun
Abstract: Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctuation information of unlabeled micro-blog data by introducing characters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training framework to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, especially in OOV-recall. 1 INTRODUCTION Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging. A micro-blog differs from a traditional blog in that it is typically smaller in size. Furthermore, texts in micro-blogs tend to be informal and new words occur more frequently. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data. The poor segmentation results will hurt subsequent analysis on micro-blog text. ∗Corresponding author Manually labeling the texts of micro-blog is time consuming. Luckily, punctuations provide useful information because they are used as indicators of the end of previous sentence and the beginning of the next one, which also indicate the start and the end of a word. These ”natural boundaries” appear so frequently in micro-blog texts that we can easily make good use of them. TABLE 1 shows some statistics of the news corpus vs. the micro-blogs. Besides, English letters and digits are also more than those in news corpus. They all are natural delimiters of Chinese characters and we treat them just the same as punctuations. We propose a method to enlarge the training corpus by using punctuation information. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process. We test our method on microblog texts and experiments show good results. This paper is organized as follows. In section 1 we introduce the problem. Section 2 gives detailed description of our approach. We show the experi- ment and analyze the results in section 3. Section 4 gives the related works and in section 5 we conclude the whole work. 2 Our method 2.1 Punctuations Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. To be simple, one can use label ’B’ to indicate a character is the beginning of a word, and use ’N’ to indicate a character is not the beginning of a word. We also use the 2-tag in our work. Other tag sets like the ’BIES’ tag set are not suiteable because the puctuation information cannot decide whether a character after punctuation should be labeled as ’B’ or ’S’(word with Single 177 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 177–182, micNreow-bslogC68h56i. n73e%%seE10n1.g6.8l%i%shN20u. m76%%berPu1n13c9.t u03a%%tion Table 1: Percentage of Chinese, English, number, punctuation in the news corpus vs. the micro-blogs. character). Punctuations can serve as implicit labels for the characters before and after them. The character right after punctuations must be the first character of a word, meanwhile the character right before punctuations must be the last character of a word. An example is given in TABLE 2. 2.2 Algorithm Our algorithm “ADD-N” is shown in TABLE 3. The initially selected character instances are those right after punctuations. By definition they are all labeled with ’B’ . In this case, the number of training instances with label ’B’ is increased while the number with label ’N’ remains unchanged. Because of this, the model trained on this unbalanced corpus tends to be biased. This problem can become even worse when there is inexhaustible supply of texts from the target domain. We assume that labeled corpus of the source domain can be treated as a balanced reflection of different labels. Therefore we choose to estimate the balanced point by counting characters labeling ’B’ and ’N’ and calculate the ratio which we denote as η . We assume the enlarged corpus is also balanced if and only if the ratio of ’B’ to ’N’ is just the same to η of the source domain. Our algorithm uses data from source domain to make the labels balanced. When enlarging corpus using characters behind punctuations from texts in target domain, only characters labeling ’B’ are added. We randomly reuse some characters labeling ’N’ from labeled data until ratio η is reached. We do not use characters ahead of punctuations, because the single-character words ahead of punctuations take the label of ’B’ instead of ’N’ . In summary our algorithm tackles the problem by duplicating labeled data in source domain. We denote our algorithm as ”ADD-N”. We also use baseline feature templates include the features described in previous works (Sun and Xu, 2011; Sun et al., 2012). Our algorithm is not necessarily limited to a specific tagger. For simplicity and reliability, we use a simple MaximumEntropy tagger. 3 Experiment 3.1 Data set We evaluate our method using the data from weibo.com, which is the biggest micro-blog service in China. We use the API provided by weibo.com1 to crawl 500,000 micro-blog texts of weibo.com, which contains 24,243,772 characters. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data, which contain 2,420,037 characters. We manually segment 2038 randomly selected microblogs.We follow the segmentation standard as the PKU corpus. In micro-blog texts, the user names and URLs have fixed format. User names start with ’ @ ’, followed by Chinese characters, English letters, numbers and ’ ’, and terminated when meeting punctuations or blanks. URLs also match fixed patterns, which are shortened using ”http : / /t . cn /” plus six random English letters or numbers. Thus user names and URLs can be pre-processed separately. We follow this principle in following experiments. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. We choose the PKU data in our experiment because our baseline methods use the same segmentation standard. We compare our method with three baseline methods. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. ICTCLAS, on the other hand, is a HMMbased Chinese word segmenter. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework. F-score 1http : / / open . we ibo .com/wiki 2http : / /www . s ighan .org/bakeo f f2 0 0 5 / 3http : / / i c l .org/ ct as 4http : / / nlp . st an ford . edu /pro j ect s / chine s e-nlp . shtml \ # cws 178 评B论-是-风-格-,-评B论-是-能-力-。- BNBBNBBNBBNB Table 2: The first line represents the original text. The second line indicates whether each character is the Beginning of sentence. The third line is the tag sequence using ”BN” tag set. is used as the accuracy measure. The recall of out-of-vocabulary is also taken into consideration, which measures the ability of the model to correctly segment out of vocabulary words. 3.2 Main results methods on the development data. TABLE 4 summarizes the segmentation results. In TABLE 4, Li-Sun is the method in Li and Sun (2009). Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. The next 4 methods all require a 100 iteration of self-training. No-punc is the method that only uses self-training while no punctuation information is added. Nobalance is similar to ADD N. The only difference between No-balance and ADD-N is that the former does not balance label ’B’ and label ’N’ . The comparison of Maxent and No-punctuation shows that naively adding confident unlabeled instances does not guarantee to improve performance. The writing style and word formation of the source domain is different from target domain. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set. The comparison of Maxent, No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance. Both the f-score and OOV-recall increase. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV-Recall. However, considering it will improve OOV-Recall by about +1.6% and the fscore +0.2%. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. TABLE 5 shows different f-scores and OOV-Recalls on different unlabeled data set. We note that when the number of texts changes from 0 to 50,000, the f-score and OOV both are improved. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data. This result comes from the fact that the method ’ADD-N’ only uses characters behind punctua179 Tabl152S0eiz 0:Segm0.8nP67ta245ion0p.8Rer6745f9om0a.8nF57c6e1witOh0 .d7Vi65f-2394Rernt size of unlabeled data tions from target domain. Taking more texts into consideration means selecting more characters labeling ’N’ from source domain to simulate those in target domain. If too many ’N’s are introduced, the training data will be biased against the true distribution of target domain. 3.3 Characters ahead of punctuations In the ”BN” tagging method mentioned above, we incorporate characters after punctuations from texts in micro-blog to enlarge training set.We also try an opposite approach, ”EN” tag, which uses ’E’ to represent ”End of word”, and ’N’ to rep- resent ”Not the end of word”. In this contrasting method, we only use charactersjust ahead ofpunctuations. We find that the two methods show similar results. Experiment results with ADD-N are shown in TABLE 6 . 5DU0an0lt a b0Tsiealzbe lde6:0.C8Fo7”m5BNpa”rO0itsOa.o7gVn7-3oRfBN0.8aFn”7E0dNEN”Ot0.aO.g7V6-3R 4 Related Work Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). These supervised methods show good results, however, are unable to incorporate information from new domain, where OOV problem is a big challenge for the research community. On the other hand unsupervised word segmentation Peng and Schuurmans (2001); Goldwater et al. (2006); Jin and Tanaka-Ishii (2006); Feng et al. (2004); Maosong et al. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. However, they usually are less accurate and more complicated than supervised ones. Meanwhile semi-supervised methods have been applied into NLP applications. Bickel et al. (2007) learns a scaling factor from data of source domain and use the distribution to resemble target domain distribution. Wu et al. (2009) uses a Domain adaptive bootstrapping (DAB) framework, which shows good results on Named Entity Recognition. Similar semi-supervised applications include Shen et al. (2004); Daum e´ III and Marcu (2006); Jiang and Zhai (2007); Weinberger et al. (2006). Besides, Sun and Xu (201 1) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation. There are previous works using punctuations as implicit annotations. Riley (1989) uses it in sentence boundary detection. Li and Sun (2009) proposed a compromising solution to by using a clas- sifier to select the most confident characters. We do not follow this approach because the initial errors will dramatically harm the performance. Instead, we only add the characters after punctuations which are sure to be the beginning of words (which means labeling ’B’) into our training set. Sun and Xu (201 1) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach. Our method is different from theirs. We use characters after punctuations directly. 5 Conclusion In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts. In our approach, punctuation information of unlabeled micro-blog data is used, as well as a self-training framework to incorporate confident instances. Experiments show that our approach improves performance, especially in OOV-recall. Both the punctuation information and the self-training phase contribute to this improve- ment. Acknowledgments This research was partly supported by National High Technology Research and Development Program of China (863 Program) (No. 2012AA01 1101), National Natural Science Foundation of China (No.91024009) and Major National Social Science Fund of China(No. 12&ZD227;). 180 References Bickel, S., Br¨ uckner, M., and Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings ofthe 24th international conference on Machine learning, pages 81–88. ACM. Chen, W., Zhang, Y., and Isahara, H. (2006). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia. Daum e´ III, H. and Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1): 101–126. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004). Accessor variety criteria for chinese word extraction. Computational Linguistics, 30(1):75–93. Goldwater, S., Griffiths, T., and Johnson, M. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 673–680. Association for Computational Linguistics. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Annual Meeting-Association For Computational Linguistics, volume 45, page 264. Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 428–435. Association for Computational Linguistics. Li, Z. and Sun, M. (2009). Punctuation as implicit annotations for chinese word segmentation. Computational Linguistics, 35(4):505– 512. Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 164. Jeju Island, Korea. Maosong, S., Dayang, S., and Tsou, B. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 1 7th international conference on Computational linguistics-Volume 2, pages 1265–1271 . Association for Computational Linguistics. Pan, S. and Yang, Q. (2010). A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): 1345–1359. Peng, F. and Schuurmans, D. (2001). Selfsupervised chinese word segmentation. Advances in Intelligent Data Analysis, pages 238– 247. Riley, M. (1989). Some applications of tree-based modelling to speech and language. In Proceedings of the workshop on Speech and Natural Language, pages 339–352. Association for Computational Linguistics. Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C. (2004). Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 589. Association for Computational Linguistics. Sun, W. and Xu, J. (201 1). Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics. Sun, X., Wang, H., and Li, W. (2012). Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 253–262, Jeju Island, Korea. Association for Computational Linguistics. Weinberger, K., Blitzer, J., and Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In In NIPS. Citeseer. Wu, D., Lee, W., Ye, N., and Chieu, H. (2009). Domain adaptive bootstrapping for named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1523–1532. Association for Computational Linguistics. Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Zhao, H., Huang, C., and Li, M. (2006a). An improved chinese word segmentation system with 181 conditional random field. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July. Zhao, H., Huang, C., Li, M., and Lu, B. (2006b). Effective tag set selection in chinese word segmentation via conditional random field modeling. In Proceedings pages of PACLIC, volume 20, 87–94. 182
5 0.83075982 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
Author: Jonathan K. Kummerfeld ; Daniel Tse ; James R. Curran ; Dan Klein
Abstract: Aspects of Chinese syntax result in a distinctive mix of parsing challenges. However, the contribution of individual sources of error to overall difficulty is not well understood. We conduct a comprehensive automatic analysis of error types made by Chinese parsers, covering a broad range of error types for large sets of sentences, enabling the first empirical ranking of Chinese error types by their performance impact. We also investigate which error types are resolved by using gold part-of-speech tags, showing that improving Chinese tagging only addresses certain error types, leaving substantial outstanding challenges.
6 0.82581294 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
7 0.81545913 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
8 0.77988046 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
9 0.71071994 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
10 0.68022984 288 acl-2013-Punctuation Prediction with Transition-based Parsing
11 0.67081285 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
12 0.65675712 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing
13 0.60876125 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
14 0.5921362 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
15 0.56903678 128 acl-2013-Does Korean defeat phonotactic word segmentation?
16 0.5669201 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance
17 0.53802675 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
18 0.52045 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
19 0.51474375 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing
20 0.48747715 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
topicId topicWeight
[(0, 0.046), (6, 0.036), (11, 0.098), (14, 0.028), (15, 0.018), (16, 0.026), (24, 0.062), (26, 0.079), (35, 0.052), (42, 0.084), (48, 0.061), (53, 0.104), (70, 0.108), (88, 0.029), (90, 0.025), (95, 0.069)]
simIndex simValue paperId paperTitle
1 0.90963513 370 acl-2013-Unsupervised Transcription of Historical Documents
Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein
Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.
same-paper 2 0.90844119 80 acl-2013-Chinese Parsing Exploiting Characters
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
3 0.90084732 270 acl-2013-ParGramBank: The ParGram Parallel Treebank
Author: Sebastian Sulger ; Miriam Butt ; Tracy Holloway King ; Paul Meurer ; Tibor Laczko ; Gyorgy Rakosi ; Cheikh Bamba Dione ; Helge Dyvik ; Victoria Rosen ; Koenraad De Smedt ; Agnieszka Patejuk ; Ozlem Cetinoglu ; I Wayan Arka ; Meladel Mistica
Abstract: This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (LexicalFunctional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are avail- able in other treebanks, that represents me ladel .mi st ica@ gmai l com . deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS information.
4 0.85902166 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
Author: Ji Ma ; Jingbo Zhu ; Tong Xiao ; Nan Yang
Abstract: In this paper, we combine easy-first dependency parsing and POS tagging algorithms with beam search and structured perceptron. We propose a simple variant of “early-update” to ensure valid update in the training process. The proposed solution can also be applied to combine beam search and structured perceptron with other systems that exhibit spurious ambiguity. On CTB, we achieve 94.01% tagging accuracy and 86.33% unlabeled attachment score with a relatively small beam width. On PTB, we also achieve state-of-the-art performance. 1
5 0.85651535 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu
Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.
6 0.85459697 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
7 0.84523404 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
8 0.84216696 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
9 0.83515048 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars
10 0.83301264 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
11 0.83082294 169 acl-2013-Generating Synthetic Comparable Questions for News Articles
12 0.83073092 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction
13 0.8290754 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
14 0.82872409 318 acl-2013-Sentiment Relevance
15 0.82867515 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
16 0.82597721 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation
17 0.82486862 275 acl-2013-Parsing with Compositional Vector Grammars
18 0.82406104 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
19 0.8216697 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
20 0.81905109 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching