acl acl2011 acl2011-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
Reference: text
sentIndex sentText sentNum sentScore
1 Chinese sentence segmentation as comma classification Nianwen Xue and Yaqin Yang Brandeis University, Computer Science Department Waltham, MA, 02453 {xuen ,yaqin} @brande i . [sent-1, score-0.66]
2 edu s Abstract We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. [sent-2, score-0.658]
3 Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. [sent-3, score-0.411]
4 Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries. [sent-4, score-0.631]
5 1 Introduction Sentence segmentation, or the detection of sentence boundaries, is very much a solved problem for English. [sent-5, score-0.137]
6 Sentence boundaries can be determined by looking for periods, exclamation marks and question marks. [sent-6, score-0.285]
7 Although the symbol (dot) that is used to represent period is ambiguous because it is also used as the decimal point or in abbreviations, its resolution only requires local context. [sent-7, score-0.081]
8 Chinese also uses periods (albeit with a different symbol), question marks, and exclamation marks to indicate sentence boundaries. [sent-9, score-0.36]
9 Where these punctuation marks exist, sentence boundaries can be unambiguously detected. [sent-10, score-0.422]
10 The difference is that the Chinese comma also functions similarly as the English period in some context and signals the boundary of a sentence. [sent-11, score-0.655]
11 As a result, if the commas are not disambiguated, Chinese would have these “run-on” sen631 tences that can only be plausibly translated into multiple English sentences. [sent-12, score-0.601]
12 An example is given in (1), where one Chinese sentence is plausibly translated into three English sentences. [sent-13, score-0.126]
13 “I have been paying attention to this Nano 3 recently, [1] and Ieven visited a few computer stores in person. [sent-15, score-0.031]
14 [2] Comparatively speaking, [3] Zhuoyue ’ s prices are relatively low, [4] and they can also guarantee that their products are genuine. [sent-16, score-0.057]
15 ” In this paper, we formulate Chinese sentence segmentation as a comma disambiguation problem. [sent-18, score-0.699]
16 The problem is basically one of separating commas that mark sentence boundaries (such as [2] and [5] in (1)) from those that do not (such as [1], [3] and [4]). [sent-19, score-0.8]
17 Sentences that can be split on commas are generally loosely coordinated structures that are syntactically and semantically complete on their own, and they do not have a close syntactic relation with one another. [sent-20, score-0.785]
18 We believe that a sentence boundary detection task that disambiguates commas, if successfully 低 的 , 而且能 保证 是 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-21, score-0.199]
19 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 631–635, solved, simplifies downstream tasks such as parsing and Machine Translation. [sent-23, score-0.03]
20 2 Obtaining data To our knowledge, there is no data in the public domain with commas explicitly annotated based on whether they mark sentence boundaries. [sent-31, score-0.669]
21 One could imagine using parallel data where a Chinese sentence is word-aligned with multiple English sentences, but such data is generally noisy and commas are not disambiguated based on a uniform standard. [sent-32, score-0.688]
22 The CTB does not disambiguate commas explicitly, and just like the Penn English Treebank (Marcus et al. [sent-34, score-0.583]
23 , 1993), the sentence boundaries in the CTB are identified by periods, exclamation and question marks. [sent-35, score-0.277]
24 However, there are clear syntactic patterns that can be used to disambiguate the two types of commas. [sent-36, score-0.07]
25 Commas that mark sentence boundaries delimit loosely coordinated toplevel IPs, as illustrated in Figure 1, and commas that don’t cover all other cases. [sent-37, score-0.973]
26 One such example is Figure 2, where a PP is separated from the rest of the sentence with a comma. [sent-38, score-0.108]
27 We devised a heuristic algorithm to detect loosely coordinated structures in the Chinese Treebank, and labeled each comma with either EOS (end of a sentence) or Non-EOS (not the end of a sentence). [sent-39, score-0.765]
28 3 Learning After the commas are labeled, we have basically turned comma disambiguation into a binary classification problem. [sent-40, score-1.143]
29 The syntactic structures are an obvious source of information for this classification task, so we parsed the entire CTB 6. [sent-41, score-0.074]
30 0 into 10 portions, and parsed each portion with a model trained on other portions, using the Berkeley parser (Petrov and Klein, 2007). [sent-44, score-0.062]
31 We first established a baseline by applying the same 。 监督 检查 heuristic algorithm to the automatic parses. [sent-46, score-0.071]
32 This will give us a sense of how accurately commas can be disambiguated given imperfect parses. [sent-47, score-0.61]
33 The research question we’re trying to address here basically is: can we improve on the baseline accuracy with a machine learning model? [sent-48, score-0.075]
34 All features are described relative to the comma being classified and the context is the sentence that the comma is in. [sent-51, score-1.082]
35 The actual feature values for the first comma in Figure 1 are given as examples: 1. [sent-52, score-0.502]
36 The string representation of the following word if it occurs more than 12,000 times in sentenceinitial positions in a large corpus external to our training and test data. [sent-60, score-0.03]
37 phrase label of their right sibling in the syntactic parse tree, as well as their conjunction, e. [sent-68, score-0.162]
38 g, f6=IP, f7=IP, f8=IP+IP The conjunction of the ancestors, the phrase label of the left sibling, and the phrase label of the right sibling. [sent-69, score-0.107]
39 The ancestor is defined as the path from the parent of the comma to the root node of the parse tree, e. [sent-70, score-0.567]
40 The search starts at the comma and stops at the previous punctuation mark or the beginning of the sentence, e. [sent-76, score-0.691]
41 Whether the parent of the comma is a coordinating IP construction. [sent-79, score-0.608]
42 A coordinating IP construction is an IP that dominates a list of coordinated IPs, e. [sent-80, score-0.182]
43 Whether the comma is a top-level child, defined as the child of the root node of the syntactic tree, e. [sent-83, score-0.571]
44 Whether the parent of the comma is a top-level coordinating IP construction, e. [sent-86, score-0.608]
45 The punctuation mark template for this sen- tence, e. [sent-89, score-0.161]
46 whether the length difference between the left and right segments of the comma is smaller than 7. [sent-92, score-0.605]
47 The left (right) segment spans from the previous (next) punctuation mark or the beginning (end) of the sentence to the comma, e. [sent-93, score-0.269]
48 , f15=>7 4 Results and discussion Our comma disambiguation models are trained and evaluated on a subset of the Chinese TreeBank (CTB) 6. [sent-95, score-0.541]
49 The automatic parses in each test set are produced by retraining the Berkeley parser on its corresponding training set, plus the unused portion of the CTB 6. [sent-103, score-0.148]
50 , 1991), the parsing accuracy on the CTB test set stands at 83. [sent-106, score-0.03]
51 90T1 -e4s90t31 There are 1,510 commas in the test set, and our heuristic baseline algorithm is able to correctly label 1,321 or 87. [sent-111, score-0.624]
52 6% of them are EOS commas that mark sentence boundaries and 1,260 of them are Non-EOS commas. [sent-114, score-0.751]
53 The baseline precision and recall for the EOS commas are 59. [sent-116, score-0.607]
54 For Non-EOS commas, the baseline precision and recall are 95. [sent-120, score-0.054]
55 The learned maximum classifier achieved a modest improvement over the baseline. [sent-124, score-0.027]
56 For Non-EOS commas, the precision and recall are 95. [sent-131, score-0.054]
57 Other than a list of most frequent words that start a sentence, all the features are extracted from the sentence the comma occurs in. [sent-135, score-0.58]
58 Given that the heuristic algorithm and the learned model use essentially the same source of information, we attribute the improvement to the use of lexical features that the heuristic algorithm cannot easily take advantage of. [sent-136, score-0.169]
59 42m1 and the learned model all accuracy on the development set, some of the features (3 and 8) actually hurt the overall performance slightly on the test set. [sent-145, score-0.063]
60 What’s interesting is while the heuristic algorithm that is based entirely on syntactic structure produced a strong baseline, when formulated as features they are not at all effective. [sent-146, score-0.142]
61 In particular, feature groups 7, 8, 9 are explicit reformulations of the heuristic algorithm, but they all contributed very little to or even slightly hurt the overall performance. [sent-147, score-0.107]
62 What this suggests is that we can get reasonable sentence segmentation accuracy without having to parse the sentence (or rather, the multi-sentence group) first. [sent-149, score-0.262]
63 The sentence segmentation can thus come before parsing in the processing pipeline even in a language like Chinese where sentences are not unambiguously marked. [sent-150, score-0.241]
64 5 Related work There has been a fair amount ofresearch on punctuation prediction or generation in the context of spoken 634 language processing (Lu and Ng, 2010; Guo et al. [sent-151, score-0.123]
65 The task presented here is different in that the punctuation marks are already present in the text and we are only concerned with punctuation marks that are semantically ambiguous. [sent-153, score-0.418]
66 Our specific focus is on the Chinese comma, which sometimes signals a sentence boundary and sometimes doesn’t. [sent-154, score-0.178]
67 The Chinese comma has also been studied in the context of syntactic parsing for long sentences (Jin et al. [sent-155, score-0.601]
68 , 2005), where the study of comma is seen as part of a “divide-and-conquer” strategy to syntactic parsing. [sent-157, score-0.542]
69 Long sentences are split into shorter sentence segments on commas before they are parsed, and the syntactic parses for the shorter sentence segments are then assembled into the syntactic parse for the original sentence. [sent-158, score-0.987]
70 We study comma disambiguation in its own right aimed at helping a wide range of NLP applications that include parsing and Machine Translation. [sent-159, score-0.602]
71 6 Conclusion The main goal of this short paper is to bring to the attention of the field a problem that has largely been taken for granted. [sent-160, score-0.031]
72 We show that while sentence boundary detection in Chinese is a relatively easy task if formulated based on purely orthographic grounds, the problem becomes much more challeng- ing if we delve deeper and consider the semantic and possibly the discourse basis on which sentences are segmented. [sent-161, score-0.234]
73 Seen in this light, the central problem to Chinese sentence segmentation is comma disambiguation. [sent-162, score-0.687]
74 Roukos, A proce- dure for quantitively comparing the syntactic coverage of English grammars. [sent-185, score-0.04]
wordName wordTfidf (topN-words)
[('commas', 0.553), ('comma', 0.502), ('ctb', 0.237), ('chinese', 0.221), ('ip', 0.202), ('punctuation', 0.123), ('coordinated', 0.115), ('eos', 0.112), ('nano', 0.104), ('exclamation', 0.091), ('marks', 0.086), ('boundaries', 0.082), ('treebank', 0.081), ('segmentation', 0.08), ('periods', 0.079), ('sentence', 0.078), ('loosely', 0.077), ('heuristic', 0.071), ('zhuoyue', 0.069), ('coordinating', 0.067), ('sibling', 0.065), ('boundary', 0.064), ('unused', 0.061), ('disambiguated', 0.057), ('mallet', 0.056), ('reynar', 0.056), ('xue', 0.053), ('period', 0.053), ('unambiguously', 0.053), ('basically', 0.049), ('plausibly', 0.048), ('comparatively', 0.048), ('conjunction', 0.046), ('nianwen', 0.044), ('vv', 0.044), ('jin', 0.042), ('guo', 0.042), ('segments', 0.042), ('ips', 0.041), ('syntactic', 0.04), ('parent', 0.039), ('disambiguation', 0.039), ('marcus', 0.039), ('mark', 0.038), ('hurt', 0.036), ('signals', 0.036), ('portions', 0.034), ('parses', 0.034), ('parsed', 0.034), ('denoting', 0.033), ('speaking', 0.032), ('santorini', 0.031), ('detection', 0.031), ('right', 0.031), ('attention', 0.031), ('formulated', 0.031), ('pnp', 0.03), ('chengqing', 0.03), ('amounting', 0.03), ('deg', 0.03), ('delve', 0.03), ('inferencing', 0.03), ('kachites', 0.03), ('sentenceinitial', 0.03), ('toplevel', 0.03), ('yaqin', 0.03), ('penn', 0.03), ('parsing', 0.03), ('separated', 0.03), ('disambiguate', 0.03), ('black', 0.03), ('left', 0.03), ('guarantee', 0.029), ('child', 0.029), ('long', 0.029), ('petrov', 0.029), ('portion', 0.028), ('solved', 0.028), ('decimal', 0.028), ('prices', 0.028), ('zong', 0.028), ('anlp', 0.028), ('gdaniec', 0.028), ('ingria', 0.028), ('stops', 0.028), ('subordinating', 0.028), ('precision', 0.028), ('central', 0.027), ('shorter', 0.027), ('learned', 0.027), ('kim', 0.027), ('recall', 0.026), ('question', 0.026), ('disambiguates', 0.026), ('hindle', 0.026), ('parse', 0.026), ('berkeley', 0.026), ('lu', 0.025), ('visit', 0.025), ('retraining', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 66 acl-2011-Chinese sentence segmentation as comma classification
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
2 0.15231335 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
3 0.13296498 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
4 0.10810352 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach
Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu
Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.
5 0.097624213 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Author: Weiwei Sun
Abstract: The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based segmenter, one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second, the outputs of the three predictors are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is effi- cient, while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature.
6 0.092884913 28 acl-2011-A Statistical Tree Annotator and Its Applications
7 0.090566032 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
8 0.089804374 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
9 0.08920148 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation
10 0.088861451 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results
11 0.083411425 192 acl-2011-Language-Independent Parsing with Empty Elements
12 0.073841378 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
13 0.072530292 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
14 0.067108653 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
15 0.064844258 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
16 0.062443633 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
17 0.060283769 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
18 0.059042726 238 acl-2011-P11-2093 k2opt.pdf
19 0.058166303 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
20 0.056194734 313 acl-2011-Two Easy Improvements to Lexical Weighting
topicId topicWeight
[(0, 0.142), (1, -0.039), (2, -0.006), (3, -0.072), (4, -0.042), (5, 0.007), (6, 0.015), (7, 0.019), (8, 0.04), (9, 0.028), (10, -0.034), (11, 0.02), (12, -0.066), (13, -0.053), (14, 0.009), (15, 0.001), (16, 0.033), (17, -0.044), (18, 0.189), (19, 0.189), (20, 0.067), (21, -0.008), (22, -0.095), (23, 0.071), (24, 0.036), (25, 0.002), (26, -0.014), (27, 0.052), (28, 0.092), (29, 0.01), (30, 0.006), (31, 0.028), (32, 0.021), (33, -0.026), (34, 0.018), (35, -0.065), (36, 0.002), (37, -0.113), (38, 0.026), (39, -0.026), (40, 0.026), (41, 0.027), (42, 0.058), (43, -0.008), (44, -0.039), (45, -0.017), (46, -0.049), (47, 0.092), (48, -0.022), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.93891013 66 acl-2011-Chinese sentence segmentation as comma classification
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
2 0.83189315 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
3 0.75236386 192 acl-2011-Language-Independent Parsing with Empty Elements
Author: Shu Cai ; David Chiang ; Yoav Goldberg
Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.
4 0.71349007 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Author: Weiwei Sun
Abstract: The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based segmenter, one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second, the outputs of the three predictors are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is effi- cient, while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature.
5 0.68808603 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru
Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1
6 0.65232891 28 acl-2011-A Statistical Tree Annotator and Its Applications
7 0.62378174 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach
8 0.5986191 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
9 0.58867377 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
10 0.58126092 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
11 0.57216364 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
12 0.55996269 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
13 0.48578766 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
14 0.46296063 238 acl-2011-P11-2093 k2opt.pdf
15 0.42129636 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
16 0.40907186 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
17 0.39085394 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
19 0.36342239 330 acl-2011-Using Derivation Trees for Treebank Error Detection
20 0.3553496 115 acl-2011-Engkoo: Mining the Web for Language Learning
topicId topicWeight
[(5, 0.031), (17, 0.058), (37, 0.097), (39, 0.104), (41, 0.039), (53, 0.286), (55, 0.02), (59, 0.034), (72, 0.044), (91, 0.036), (96, 0.145), (97, 0.014)]
simIndex simValue paperId paperTitle
1 0.84202731 159 acl-2011-Identifying Noun Product Features that Imply Opinions
Author: Lei Zhang ; Bing Liu
Abstract: Identifying domain-dependent opinion words is a key problem in opinion mining and has been studied by several researchers. However, existing work has been focused on adjectives and to some extent verbs. Limited work has been done on nouns and noun phrases. In our work, we used the feature-based opinion mining model, and we found that in some domains nouns and noun phrases that indicate product features may also imply opinions. In many such cases, these nouns are not subjective but objective. Their involved sentences are also objective sentences and imply positive or negative opinions. Identifying such nouns and noun phrases and their polarities is very challenging but critical for effective opinion mining in these domains. To the best of our knowledge, this problem has not been studied in the literature. This paper proposes a method to deal with the problem. Experimental results based on real-life datasets show promising results. 1
2 0.8284235 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
3 0.80815744 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Author: Dipanjan Das ; Slav Petrov
Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.
same-paper 4 0.7987116 66 acl-2011-Chinese sentence segmentation as comma classification
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
5 0.71414602 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
Author: Qin Gao ; Stephan Vogel
Abstract: We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The filtered corpus is used for training phrase-based translation models, which can be used directly in translation tasks or combined with baseline models. Experimental results on ChineseEnglish machine translation tasks show an average improvement of 0.45 BLEU and 1.22 TER points across 5 different NIST test sets.
6 0.69215083 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
7 0.64488435 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
8 0.64044261 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models
9 0.63555831 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
10 0.63232684 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
11 0.6258527 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
12 0.62301219 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
13 0.61915886 136 acl-2011-Finding Deceptive Opinion Spam by Any Stretch of the Imagination
14 0.61850929 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates
15 0.61094099 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
16 0.60538876 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
17 0.60528398 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
18 0.6011644 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
19 0.59830654 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
20 0.59631586 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation