acl acl2011 acl2011-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract Word is usually adopted as the smallest unit in most tasks of Chinese language processing. [sent-5, score-0.137]
2 However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. [sent-6, score-0.414]
3 So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. [sent-7, score-0.088]
4 In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. [sent-8, score-0.954]
5 Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. [sent-9, score-0.587]
6 Our analysis suggests several key reasons behind this finding. [sent-10, score-0.029]
7 1 Introduction White space serves as the word delimiter in Latin alphabet-based languages. [sent-11, score-0.054]
8 However, in written Chinese text, there is no word delimiter. [sent-12, score-0.028]
9 Thus, in almost all tasks of Chinese natural language processing (NLP), the first step is to segment a Chinese sentence into a sequence of words. [sent-13, score-0.07]
10 This is the task of Chinese word segmentation (CWS), an important and challenging task in Chinese NLP. [sent-14, score-0.194]
11 Some linguists believe that word (containing at least one character) is the appropriate unit for Chinese language processing. [sent-15, score-0.073]
12 When treating CWS as a standalone NLP task, the goal is to segment a sentence into words so that the segmentation matches the human gold-standard segmentation with the highest F-measure, but without considering the performance of the end-to-end NLP application that uses the segmentation output. [sent-16, score-0.628]
13 sg machine translation (SMT), it can happen that the most accurate word segmentation as judged by the human gold-standard segmentation may not produce the best translation output (Zhang et al. [sent-20, score-0.949]
14 While state-of-the-art Chinese word segmenters achieve high accuracy, some errors still remain. [sent-22, score-0.028]
15 , 2004) that relied on characters (without CWS) performed slightly worse than when it used segmented words. [sent-25, score-0.246]
16 It has been recognized that varying segmentation granularities are needed for SMT (Chang et al. [sent-26, score-0.166]
17 To evaluate the quality of Chinese translation output, the International Workshop on Spoken Language Translation in 2005 (IWSLT'2005) used the word-level BLEU metric (Papineni et al. [sent-28, score-0.309]
18 However, IWSLT'08 and NIST'08 adopted character-level evaluation metrics to rank the sub- mitted systems. [sent-30, score-0.421]
19 Although there is much work on automatic evaluation of machine translation (MT), whether word or character is more suitable for automatic evaluation of Chinese translation output has not been systematically investigated. [sent-31, score-0.799]
20 In this paper, we utilize various machine translation evaluation metrics to evaluate the quality of Chinese translation output, and compare their correlation with human assessment when the Chinese translation output is segmented into words versus characters. [sent-32, score-1.605]
21 Since there are several CWS tools that can segment Chinese sentences into words and their segmentation results are different, we use four representative CWS tools in our experiments. [sent-33, score-0.334]
22 Our experimental results reveal that character-level meProceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-34, score-0.028]
23 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 159–164, trics correlate with human assessment better than word-level metrics. [sent-36, score-0.345]
24 That is, CWS is not essential for automatic evaluation of Chinese translation output. [sent-37, score-0.327]
25 Our analysis suggests several key reasons behind this finding. [sent-38, score-0.029]
26 2 Chinese Translation Evaluation Automatic MT evaluation aims at formulating au- tomatic metrics to measure the quality of MT output. [sent-39, score-0.347]
27 Compared with human assessment, automatic evaluation metrics can assess the quality of MT output quickly and objectively without much human labor. [sent-40, score-0.574]
28 An example to show an MT system translation and multiple reference translations being segmented into characters or words. [sent-42, score-0.649]
29 To evaluate English translation output, automatic MT evaluation metrics take an English word as the smallest unit when matching a system translation and a reference translation. [sent-43, score-1.094]
30 On the other hand, to evaluate Chinese translation output, the smallest unit to use in matching can be a Chinese word or a Chinese character. [sent-44, score-0.391]
31 ” a Chinese system translation (or a reference translation) can be segmented into characters (Figure 1(a)) or words (Figure 1(b)). [sent-46, score-0.582]
32 A variety of automatic MT evaluation metrics have been developed over the years, including BLEU (Papineni et al. [sent-47, score-0.367]
33 Some automatic MT evaluation metrics perform deeper linguistic analysis, such as part-of-speech tagging, synonym matching, semantic role labeling, etc. [sent-51, score-0.452]
34 Since part-of-speech tags are only defined for Chinese words and not for Chinese characters, we restrict the automatic MT evaluation metrics explored in this paper to those metrics listed above which do not require part-of- speech tagging. [sent-52, score-0.645]
35 3 CWS Tools Since there are a number of CWS tools and they give different segmentation results in general, we experimented with four different CWS tools in this paper. [sent-53, score-0.264]
36 The segmentation standard adopted in this paper is CTB (Chinese Treebank). [sent-59, score-0.209]
37 Stanford Chinese word segmenter (STANFORD): The Stanford Chinese word segmenter is another well-known CWS tool (Tseng et al. [sent-60, score-0.337]
38 The version we used was released on 2008-05-21 and the standard adopted is CTB. [sent-62, score-0.043]
39 Urheen: Urheen is a CWS tool developed by (Wang et al. [sent-63, score-0.027]
40 1 Experimental Results Data To compare the word-level automatic MT evaluation metrics with the character-level metrics, we conducted experiments on two datasets, in the spoken language translation domain and the newswire translation domain. [sent-69, score-0.843]
41 The IWSLT'08 English-to-Chinese ASR challenge task evaluated the translation quality of 7 machine translation systems (Paul, 2008). [sent-70, score-0.51]
42 The test set contained 300 segments with human assessment of system translation quality. [sent-71, score-0.467]
43 Human assessment of translation quality was carried out on the fluency and adequacy of the translations, as well as assigning a rank to the output of each sys- tem. [sent-73, score-0.694]
44 For the rank judgment, human graders were asked to "rank each whole sentence translation from best to worst relative to the other choices" (Paul, 2008). [sent-74, score-0.391]
45 Due to the high manual cost, the fluency and adequacy assessment was limited to the output of 4 submitted systems, while the human rank assessment was applied to all 7 systems. [sent-75, score-0.694]
46 Table 1 and 2 show the segment-level consistency or correlation between human judgments and automatic metrics. [sent-77, score-0.36]
47 The “Character” row shows the segment-level consistency or correlation between human judgments and automatic metrics after the system and reference translations are segmented into characters. [sent-78, score-0.942]
48 The “ICTCLAS”, “NUS”, “STANFORD”, and “Urheen” rows show the scores when the system and reference translations are segmented into words by the respective Chinese word segmenters. [sent-79, score-0.332]
49 The character-level metrics outperform the best word-level metrics by 2−5% on the IWSLT’08 CT-EC task, and 4−13% on the NIST’08 EC task. [sent-80, score-0.556]
50 luEaxtpioenri mbaesnetdal o rne sruanltks ionng fislu erenpcoyr taendd iand tehqisu apcay- Method BLEU NIST METEOR GTM T1E−R judgment also agree with the results on human Character 0. [sent-82, score-0.111]
51 60 rank assessment, but are not included in this paper ICTCLAS 0. [sent-87, score-0.065]
52 We asked native speak- jeurds gomf enCth ionne ase f itvoe -ppeorfinotr msc afl eu. [sent-112, score-0.026]
53 e Hncuym aann da sasdesesqmuaecnyt Method BLEU NIST METEOR GTM T1E−R was done on the first 30 documents (355 segments) Character 0. [sent-113, score-0.028]
54 51 manually scoring the 11 submitted Chinese system STANFORD 0. [sent-131, score-0.043]
55 50 translations of each segment is the same as that used in (Callison-Burch et al. [sent-136, score-0.137]
56 The adequacy score indicates the overlap of the meaning expressed in the reference translations with a system translation, while the fluency score indicates how fluent a system translation is. [sent-138, score-0.538]
57 2 tion Segment-Level Consistency or Correla- For human fluency and adequacy judgments, the Pearson correlation coefficient is used to compute the segment-level correlation between human Urheen 0. [sent-140, score-0.528]
58 3 System-Level Correlation We measure correlation at the system level using Spearman's rank correlation coefficient. [sent-148, score-0.311]
59 The system-level correlations of word-level metrics and character-level metrics are summarized in Table 3 and 4. [sent-149, score-0.584]
60 Because there are only 7 systems that have human assessment in the IWSLT’08 CT-EC task, the judgments and automatic metrics. [sent-150, score-0.338]
61 Human rank judgment is not an absolute score and thus Pearson correlation coefficient cannot be used. [sent-151, score-0.266]
62 We calcu- between character-level metrics and wordlevel metrics is very small. [sent-152, score-0.556]
63 However, it still shows that character-level metrics perform no worse than gap late segment-level consistency as follows: word-level metrics. [sent-153, score-0.346]
64 For the NIST’08 EC task, the ThTe h ceo n tositsatle n ntu nmubmerb e orf o pfa i pra-irw-iswe is ceo m copmapriasroinsosns tseymstes mw etr ea nas slaetsisoneds mofa nthuea l 1y1. [sent-154, score-0.102]
65 Esxucbempti tf oedr thMeT G sTyMs- metric, character-level metrics outperform word- 161 level metrics. [sent-155, score-0.278]
66 For BLEU and TER, character-level three types: exact match, partial match, and no metrics yield up to 6−9% improvement over wordmatch. [sent-156, score-0.278]
67 The statistics on the output translations of level metrics. [sent-157, score-0.12]
68 It shows that trics reduce about 2−3 erroneous system rankings. [sent-159, score-0.064]
69 “exact match” accounts for 71% (29/41) and “no When the number of systems increases, the differmatch” only accounts for 7% (3/41). [sent-160, score-0.06]
70 This means ence between the character-level metrics and wordthat words that share some common characters are level metrics will become larger. [sent-161, score-0.663]
71 Therefore, character-level metrics do a better job at matching Chinese translaMethod BLEU NIST METEOR GTM T1E−R tions. [sent-163, score-0.309]
72 −8R6 tw rao nArsd nlesoa ti nhso enart hmrae a tfyestrohben en cwisenhecygtormanw seiosn lrtade t-dniloetnvw ,eolistrihmdnscethetr ina cs easgtpamestyiresf ntoirtec m adl ICTCLAS NUS STANFORD Urheen 0. [sent-178, score-0.028]
73 5 Analysis We have analyzed the reasons why character-level metrics better correlate with human assessment than word-level metrics. [sent-201, score-0.588]
74 Compared to word-level metrics, character-level metrics can capture example, Figure more synonym matches. [sent-202, score-0.363]
75 1 gives For the system translation and _ _ a reference translation segmented into words: _钱 Reference: 钱的 多少 伞伞_ 吗 ? 这些 雨雨伞伞_ 多少 ? “伞伞” is a synonym for the Translation: _的 _多 The word _吗 _钱 钱 _? _? word “雨雨 伞伞”, and both words are translations of the English word “umbrella”. [sent-203, score-0.949]
76 If a word-level metric is used, the word “伞伞” in the system translation will not match the word “雨雨伞伞” in the reference translation. [sent-204, score-0.466]
77 However, if the system and reference translation are segmented into characters, the word “伞伞” in the system translation shares the same character “伞伞” with the word “ 雨雨伞伞” in the reference. [sent-205, score-0.833]
78 We can classify the semantic relationships of words that share some common characters into 162 您_在_京京_ 都都 _做_什么_? Reference: Here the word “京京都都” is the Chinese translation of the English word “Kyoto”. [sent-207, score-0.401]
79 However, it is segmented into two words, “京京” and “都都”, in the reference translation by the same CWS tool. [sent-208, score-0.475]
80 When this happens, a word-level metric will fail to match them in the system and reference translation. [sent-209, score-0.172]
81 While the accuracy of state-of-the-art CWS tools is high, segmentation errors still exist and can cause such mismatches. [sent-210, score-0.215]
82 To summarize, character-level metrics can capture more synonym matches and the resulting segmentation into characters is guaranteed to be consistent, which makes character-level metrics more suitable for the automatic evaluation of Chinese translation output. [sent-211, score-1.241]
83 6 Conclusion In this paper, we conducted a detailed study of the relative merits of word-level versus character-level metrics in the automatic evaluation of Chinese translation output. [sent-212, score-0.605]
84 Our experimental results have shown that character-level metrics correlate better with human assessment than word-level metrics. [sent-213, score-0.559]
85 Thus, CWS is not needed for automatic evaluation of Chinese translation output. [sent-214, score-0.327]
86 Our study provides the needed justification for the use of characterlevel metrics in evaluating SMT systems in which Chinese is the target language. [sent-215, score-0.342]
wordName wordTfidf (topN-words)
[('chinese', 0.407), ('cws', 0.388), ('metrics', 0.278), ('translation', 0.238), ('urheen', 0.191), ('assessment', 0.169), ('ictclas', 0.168), ('segmentation', 0.166), ('nus', 0.161), ('iwslt', 0.149), ('segmented', 0.139), ('segmenter', 0.127), ('correlation', 0.123), ('sighan', 0.115), ('characters', 0.107), ('nist', 0.105), ('gtm', 0.103), ('mt', 0.101), ('reference', 0.098), ('bakeoff', 0.096), ('meteor', 0.089), ('synonym', 0.085), ('chengqing', 0.084), ('kun', 0.078), ('zong', 0.078), ('adequacy', 0.073), ('segment', 0.07), ('bleu', 0.07), ('consistency', 0.068), ('stanford', 0.067), ('translations', 0.067), ('rank', 0.065), ('character', 0.064), ('characterlevel', 0.064), ('trics', 0.064), ('fluency', 0.062), ('human', 0.06), ('makhoul', 0.056), ('tseng', 0.056), ('judgments', 0.055), ('automatic', 0.054), ('output', 0.053), ('correlate', 0.052), ('judgment', 0.051), ('hwee', 0.05), ('tools', 0.049), ('smallest', 0.049), ('kiat', 0.049), ('tou', 0.048), ('jeju', 0.046), ('smt', 0.046), ('snover', 0.046), ('unit', 0.045), ('adopted', 0.043), ('submitted', 0.043), ('melamed', 0.042), ('island', 0.042), ('ec', 0.042), ('doddington', 0.04), ('jin', 0.039), ('ng', 0.038), ('banerjee', 0.038), ('metric', 0.037), ('match', 0.037), ('ceo', 0.037), ('academy', 0.037), ('zhang', 0.036), ('evaluation', 0.035), ('wang', 0.035), ('papineni', 0.034), ('quality', 0.034), ('chang', 0.034), ('beijing', 0.031), ('paul', 0.031), ('matching', 0.031), ('pearson', 0.031), ('accounts', 0.03), ('lavie', 0.03), ('reasons', 0.029), ('correlations', 0.028), ('word', 0.028), ('aann', 0.028), ('adl', 0.028), ('graders', 0.028), ('huihsin', 0.028), ('keiji', 0.028), ('maoxi', 0.028), ('pfa', 0.028), ('pichuan', 0.028), ('yasuda', 0.028), ('reveal', 0.028), ('tool', 0.027), ('coefficient', 0.027), ('barcelona', 0.026), ('ter', 0.026), ('afl', 0.026), ('ruiqiang', 0.026), ('csidm', 0.026), ('dahlmeier', 0.026), ('delimiter', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
2 0.25295511 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
Author: Rafael E. Banchs ; Haizhou Li
Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1
Author: Chi-kiu Lo ; Dekai Wu
Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1
4 0.20546448 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
5 0.18188195 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
Author: Xiaojun Wan
Abstract: Cross-language document summarization is defined as the task of producing a summary in a target language (e.g. Chinese) for a set of documents in a source language (e.g. English). Existing methods for addressing this task make use of either the information from the original documents in the source language or the information from the translated documents in the target language. In this study, we propose to use the bilingual information from both the source and translated documents for this task. Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction. Experimental results on the DUC2001 dataset with manually translated reference Chinese summaries show the effectiveness of the proposed methods. 1
6 0.17617945 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
7 0.16589615 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
8 0.16517127 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
9 0.15988255 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
10 0.1445553 264 acl-2011-Reordering Metrics for MT
11 0.13717082 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
12 0.13296498 66 acl-2011-Chinese sentence segmentation as comma classification
13 0.13231559 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
14 0.13099717 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
15 0.12746441 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
16 0.12691043 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
17 0.10754807 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
18 0.10748096 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
19 0.10746945 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
20 0.10689507 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
topicId topicWeight
[(0, 0.216), (1, -0.148), (2, 0.109), (3, 0.126), (4, -0.01), (5, 0.048), (6, 0.072), (7, -0.003), (8, 0.138), (9, 0.052), (10, -0.031), (11, -0.081), (12, -0.099), (13, -0.224), (14, -0.075), (15, 0.045), (16, -0.016), (17, -0.136), (18, 0.229), (19, 0.229), (20, 0.118), (21, 0.074), (22, -0.072), (23, 0.043), (24, 0.006), (25, -0.015), (26, -0.089), (27, -0.0), (28, 0.116), (29, 0.072), (30, -0.069), (31, 0.027), (32, -0.036), (33, 0.129), (34, -0.035), (35, 0.007), (36, 0.067), (37, 0.008), (38, 0.057), (39, 0.009), (40, 0.088), (41, -0.027), (42, 0.05), (43, 0.103), (44, -0.001), (45, 0.056), (46, -0.011), (47, 0.073), (48, 0.083), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.97542161 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
2 0.71625203 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
Author: Rafael E. Banchs ; Haizhou Li
Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1
3 0.65700001 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru
Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1
Author: Chi-kiu Lo ; Dekai Wu
Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1
5 0.65344304 66 acl-2011-Chinese sentence segmentation as comma classification
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
6 0.62541175 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
7 0.60823441 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
8 0.60547996 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
9 0.60143447 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
10 0.58898038 264 acl-2011-Reordering Metrics for MT
11 0.58565801 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
12 0.55111718 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
13 0.54798317 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
14 0.53971529 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
15 0.53928685 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
16 0.51852155 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
17 0.49235135 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
18 0.48835722 151 acl-2011-Hindi to Punjabi Machine Translation System
19 0.48694122 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
20 0.48055997 313 acl-2011-Two Easy Improvements to Lexical Weighting
topicId topicWeight
[(5, 0.019), (17, 0.014), (26, 0.011), (31, 0.016), (37, 0.052), (39, 0.047), (41, 0.032), (59, 0.014), (72, 0.026), (91, 0.018), (96, 0.671)]
simIndex simValue paperId paperTitle
same-paper 1 0.99889219 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
2 0.99786019 25 acl-2011-A Simple Measure to Assess Non-response
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
3 0.99576575 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles
Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose
Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).
4 0.99549967 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers
Author: Daniel Emilio Beck
Abstract: In this paper I present a Master’s thesis proposal in syntax-based Statistical Machine Translation. Ipropose to build discriminative SMT models using both tree-to-string and tree-to-tree approaches. Translation and language models will be represented mainly through the use of Tree Automata and Tree Transducers. These formalisms have important representational properties that makes them well-suited for syntax modeling. Ialso present an experiment plan to evaluate these models through the use of a parallel corpus written in English and Brazilian Portuguese.
5 0.99461538 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
Author: Delip Rao ; David Yarowsky
Abstract: This paper presents an original approach to semi-supervised learning of personal name ethnicity from typed graphs of morphophonemic features and first/last-name co-occurrence statistics. We frame this as a general solution to an inference problem over typed graphs where the edges represent labeled relations between features that are parameterized by the edge types. We propose a framework for parameter estimation on different constructions of typed graphs for this problem using a gradient-free optimization method based on grid search. Results on both in-domain and out-of-domain data show significant gains over 30% accuracy improvement using the techniques presented in the paper.
7 0.99297345 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
8 0.9889189 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
9 0.98451591 82 acl-2011-Content Models with Attitude
10 0.97144538 41 acl-2011-An Interactive Machine Translation System with Online Learning
11 0.9690423 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
12 0.95935524 266 acl-2011-Reordering with Source Language Collocations
13 0.95093673 264 acl-2011-Reordering Metrics for MT
14 0.94592834 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
15 0.94501376 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
16 0.94383818 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
17 0.94223911 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
18 0.93921775 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
19 0.93730885 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
20 0.93530005 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages