acl acl2011 acl2011-152 knowledge-graph by maker-knowledge-mining

152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Source: pdf

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Jinxi Xu and Jinying Chen Raytheon BBN Technologies 10 Moulton Street, Cambridge, MA 02138, USA { j xu , j chen } @bbn . [sent-2, score-0.057]

2 com Abstract Word alignment is a central problem in statistical machine translation (SMT). [sent-3, score-0.737]

3 In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. [sent-4, score-1.348]

4 The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. [sent-5, score-0.719]

5 Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). [sent-6, score-0.181]

6 We treated the human alignment as the oracle of supervised alignment. [sent-7, score-0.805]

7 The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. [sent-8, score-0.921]

8 Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. [sent-9, score-0.614]

9 1 Introduction Word alignment is a central problem in statistical machine translation (SMT). [sent-10, score-0.737]

10 A recent trend in this area of research is to exploit supervised learning to improve alignment accuracy by mimicking human alignment. [sent-11, score-0.836]

11 The objective of this work is to explore the performance limit of supervised word alignment. [sent-15, score-0.202]

12 165 More specifically, we would like to know what magnitude of gain in MT performance we can expect from supervised alignment over the state of the art unsupervised alignment if we have access to a large amount of parallel data. [sent-16, score-1.639]

13 Since alignment errors have been assumed to be a major hindrance to good MT, an answer to such a question might help us find new directions in MT research. [sent-17, score-0.594]

14 Our method is to use human alignment as the oracle of supervised learning and compare its per- formance against that of GIZA++ (Och and Ney 2003), a state of the art unsupervised aligner. [sent-18, score-1.066]

15 Our study was based on a manually aligned ChineseEnglish corpus (Li, 2009) with 280K word tokens. [sent-19, score-0.216]

16 Such a study has been previously impossible due to the lack of a hand-aligned corpus of sufficient size. [sent-20, score-0.05]

17 To our surprise, the gain in MT performance using human alignment is very small, less than 1 point in BLEU. [sent-21, score-0.728]

18 Furthermore, our diagnostic experiments indicate that the result is not an artifact of small training size since alignment errors are less harmful with more data. [sent-22, score-0.922]

19 We would like to stress that our result does not mean we should discontinue research in improving word alignment. [sent-23, score-0.035]

20 Rather it shows that current translation models, of which the string-to-tree model (Shen et al. [sent-24, score-0.148]

21 , 2008) used in this work is an example, cannot fully utilize super-accurate word alignment. [sent-25, score-0.035]

22 In order to significantly improve MT quality we need to improve both word alignment and the translation model. [sent-26, score-0.791]

23 In fact, we found that some of the information in the LDC hand-aligned corpus that might be useful for resolving certain translation ambiguities (e. [sent-27, score-0.25]

24 verb tense, pronoun coreferences and modifier-head relations) is even harmful to the system used in this work. [sent-29, score-0.134]

25 1 Description of MT System We used a state of the art hierarchical decoder in our experiments. [sent-33, score-0.102]

26 The system exploits a string to tree translation model, as described by Shen et al. [sent-34, score-0.148]

27 It uses a small set of linguistic and contextual features, such as word translation probabilities, rule translation probabilities, language model scores, and target side dependency scores, to rank translation hypotheses. [sent-36, score-0.556]

28 In addition, it uses a large number of discriminatively tuned features, which were inspired by Chiang et al. [sent-37, score-0.053]

29 context dependent word translation probabilities and discriminative word pairs, are motivated in part to discount bad translation rules caused by noisy word alignment. [sent-41, score-0.564]

30 Both LMs were trained on about 9 billion words of English text. [sent-43, score-0.051]

31 2 Hand Aligned Corpus The hand aligned corpus we used is LDC2010E63, which has around 280K words (English side). [sent-52, score-0.225]

32 This corpus was annotated with alignment links between Chinese characters and English words. [sent-53, score-0.848]

33 Since the MT system used in this work is word-based, we converted the character-based alignment to wordbased alignment. [sent-54, score-0.585]

34 We aligned Chinese word s to English word t if and only if s contains a character c that was aligned to t in the LDC annotation. [sent-55, score-0.332]

35 A unique feature of the LDC annotation is that it contains information beyond simple word correspondences. [sent-56, score-0.035]

36 Some links, called special links in this work, provide contextual information to resolve ambiguities in tense, pronoun co-reference, modifier-head relation and so forth. [sent-57, score-0.423]

37 The special links are similar to the so-called possible links described in other studies (Och and Ney, 2003; Fraser and Marcu, 2007), but are not identical. [sent-58, score-0.621]

38 While such links are useful for making high level inferences, 166 they cannot be effectively exploited by the translation model used in this work. [sent-59, score-0.394]

39 Worse, they can hurt its performance by hampering rule extraction. [sent-60, score-0.087]

40 Since the special links were marked with special tags to distinguish them from regular links, we can selectively remove them and check the impact on MT performance. [sent-61, score-0.416]

41 Figure 1 shows an example sentence with human alignment. [sent-62, score-0.071]

42 Solid lines indicate regular word correspondences while dashed lines indicate special links. [sent-63, score-0.198]

43 Tags inside [] indicate additional information about the function of the words connected by special links. [sent-64, score-0.124]

44 Ch inese: gei[OMN] ni ti gong En glish: provide you with[OMN] jie shi an[DET] explanation Figure 1: An example sentence pair with human alignment 2. [sent-65, score-0.623]

45 3 Parallel Corpora and Alignment Schemes Our experiments used two parallel training corpora, aligned by alternative schemes, from which translation rules were extracted. [sent-66, score-0.318]

46 One of the chunks is the small corpus mentioned above. [sent-68, score-0.171]

47 1 Other data items included are LDC{2002E18,2002L27, 2005E83,2005T06,2005T10,2005T34,2006E24,2006E34, 2006E85,2006E92,2006G05,2007E06,2007E101,2007E46, 2007E87,2008E40,2009E16,2008E56} • giza-strong: Run GIZA++ on the large corpus in one large chunk. [sent-70, score-0.05]

48 Alignment for the small corpus was extracted for experiments involving the small corpus. [sent-71, score-0.204]

49 • gold-original: human alignment, including special links • gold-clean: human alignment, excluding special links Needless to say, gold alignment schemes do not apply to the large corpus. [sent-73, score-1.48]

50 The special links in the human alignment hurt MT (Table 2, goldoriginal vs. [sent-76, score-1.003]

51 In fact, with such links, human alignment is worse than unsupervised alignment (Table 2, gold-original vs. [sent-78, score-1.266]

52 After removing such links, human alignment is better than unsupervised alignment, but the gain is small, 0. [sent-80, score-0.819]

53 As expected, having access to more training data increases the quality of unsupervised alignment (Table 1) and as a result the MT per- formance (Table 2, giza-strong vs. [sent-83, score-0.801]

54 067016 Table 1: Precision, recall and F score of different alignment schemes. [sent-88, score-0.552]

55 795R20 Table 2: MT results (lower case) on small corpus 167 It is interesting to note that from giza-weak to gizastrong, alignment accuracy improves by 15% and the BLEU score improves by 3. [sent-92, score-0.747]

56 In comparison, from giza-strong to gold-clean, alignment accuracy improves by 24% but BLEU score only improves by 0. [sent-94, score-0.62]

57 This anomaly can be partly explained by the inherent ambiguity of word alignment. [sent-96, score-0.035]

58 For example, Melamed (1998) reported inter annotator agreement for human alignments in the 80% range. [sent-97, score-0.129]

59 The LDC corpus used in this work has a higher agreement, about 90% (Li et al. [sent-98, score-0.05]

60 That means much of the disagreement between giza-strong and gold alignments is probably due to arbitrariness in the gold alignment. [sent-100, score-0.16]

61 2 Results on Large Corpus As discussed before, the gain using human alignment over GIZA++ is small on the small corpus. [sent-102, score-0.882]

62 One may wonder whether the small magnitude of the improvement is an artifact of the small size of the training corpus. [sent-103, score-0.371]

63 To dispel the above concern, we ran diagnostic experiments on the large corpus to show that with more training data, the benefit from improved alignment is less critical. [sent-104, score-0.679]

64 On the large corpus, the difference between good and poor unsupervised alignments is 2. [sent-106, score-0.149]

65 In contrast, the difference between the two schemes is larger on the small corpus, 3. [sent-109, score-0.169]

66 Since the quality of alignment of each scheme does not change with corpus size, the results indicate that alignment errors are less harmful with more training data. [sent-112, score-1.343]

67 We can therefore conclude the small magnitude of the gain using human alignment is not an artifact of small training. [sent-113, score-1.063]

68 Comparing giza-strong of Table 3 with gizastrong of Table 2, we can see the difference in MT performance is about 8 points in BLEU (20. [sent-114, score-0.094]

69 This result is reasonable since the small corpus is two orders of magnitude smaller than the large corpus. [sent-118, score-0.225]

70 63R82 Table 3: MT results (lower case) on large corpus 3. [sent-121, score-0.05]

71 , 2009; DeNero and Klein, 2010) reported improvements greater than the limit we established using an oracle aligner. [sent-125, score-0.139]

72 First, we used more data (3 1M) to train GIZA++, which improved the quality of unsupervised alignment. [sent-127, score-0.147]

73 Second, some of the features in the MT system used in this work, such as context dependent word translation probabilities and discriminatively trained penalties for certain word pairs, are designed to discount incorrect translation rules caused by alignment errors. [sent-128, score-1.172]

74 Third, the large language model (trained with 9 billion words) in our experiments further alleviated the impact of incorrect translation rules. [sent-129, score-0.237]

75 Had we used a test set with more references, the improvement in BLEU score would probably be higher. [sent-133, score-0.038]

76 An area for future work is to examine the impact of each factor on BLEU score. [sent-134, score-0.04]

77 While these factors can affect the numerical value of our result, they do not affect our main conclusion: Improving word alignment alone will not produce a breakthrough in MT quality. [sent-135, score-0.587]

78 DeNero and Klein (2010) described a technique to exploit possible links, which are similar to spe- cial links in the LDC hand aligned data, to improve rule coverage. [sent-136, score-0.421]

79 They extracted rules with and without possible links and used the union of the extracted rules in decoding. [sent-137, score-0.324]

80 We applied the technique on the LDC hand aligned data but got no gain in MT performance. [sent-138, score-0.28]

81 Our work assumes that unsupervised aligners have access to a large amount of training data. [sent-139, score-0.161]

82 For language pairs with limited training, unsupervised methods do not work well. [sent-140, score-0.091]

83 In such cases, supervised methods can make a bigger difference. [sent-141, score-0.105]

84 4 Related Work The study of the relation between alignment quality and MT performance can be traced as far as to Och and Ney, 2003. [sent-142, score-0.642]

85 A more recent study in this area is Fraser and Marcu, 2007. [sent-143, score-0.04]

86 Unlike our work, 168 both studies did not report MT results using oracle alignment. [sent-144, score-0.121]

87 Recent work in supervised alignment include Haghighi et al. [sent-145, score-0.657]

88 (2008) used a heuristic based method to delete problematic alignment links and improve MT. [sent-149, score-0.798]

89 Li (2009) described the annotation guideline of the hand aligned corpus (LDC2010E63) used in this work. [sent-150, score-0.263]

90 This corpus is at least an order of magnitude larger than similar corpora. [sent-151, score-0.148]

91 5 Conclusions Our experiments showed that even with human alignment, further improvement in MT quality will be small with the current SMT paradigm. [sent-153, score-0.204]

92 Our experiments also showed that certain alignment information suitable for making complex inferences can even hamper current SMT models. [sent-154, score-0.655]

93 A future direction for SMT is to develop translation models that can effectively employ such information. [sent-155, score-0.148]

94 Better word alignments with supervised ITG models, In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 923-93 1. [sent-182, score-0.198]

95 A new string-to-dependency machine translation algorithm with a target dependency language model. [sent-213, score-0.148]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('alignment', 0.552), ('mt', 0.25), ('links', 0.246), ('ldc', 0.181), ('bleu', 0.177), ('translation', 0.148), ('denero', 0.138), ('aligned', 0.131), ('setiawan', 0.115), ('giza', 0.113), ('supervised', 0.105), ('gain', 0.105), ('smt', 0.102), ('magnitude', 0.098), ('gizastrong', 0.094), ('xuansong', 0.094), ('chineseenglish', 0.094), ('harmful', 0.094), ('schemes', 0.092), ('unsupervised', 0.091), ('special', 0.085), ('artifact', 0.083), ('gale', 0.083), ('omn', 0.083), ('haghighi', 0.081), ('small', 0.077), ('oracle', 0.077), ('diagnostic', 0.077), ('fraser', 0.074), ('human', 0.071), ('mimicking', 0.068), ('formance', 0.068), ('fossum', 0.068), ('art', 0.066), ('limit', 0.062), ('inferences', 0.061), ('alignments', 0.058), ('xu', 0.057), ('quality', 0.056), ('discount', 0.056), ('jinxi', 0.056), ('discriminatively', 0.053), ('och', 0.053), ('klein', 0.053), ('ambiguities', 0.052), ('bbn', 0.052), ('billion', 0.051), ('corpus', 0.05), ('hurt', 0.049), ('tense', 0.048), ('shen', 0.047), ('snover', 0.045), ('studies', 0.044), ('hand', 0.044), ('chunks', 0.044), ('ney', 0.042), ('moulton', 0.042), ('hamper', 0.042), ('hindrance', 0.042), ('inese', 0.042), ('kayser', 0.042), ('kazuaki', 0.042), ('needless', 0.042), ('pronoun', 0.04), ('area', 0.04), ('li', 0.04), ('rules', 0.039), ('indicate', 0.039), ('probably', 0.038), ('glish', 0.038), ('raytheon', 0.038), ('penalties', 0.038), ('alleviated', 0.038), ('jinying', 0.038), ('gei', 0.038), ('guideline', 0.038), ('hampering', 0.038), ('hendra', 0.038), ('central', 0.037), ('state', 0.036), ('wonder', 0.036), ('approved', 0.036), ('devlin', 0.036), ('aligners', 0.036), ('word', 0.035), ('caused', 0.035), ('nist', 0.035), ('traced', 0.034), ('niyu', 0.034), ('lm', 0.034), ('access', 0.034), ('improves', 0.034), ('papineni', 0.034), ('chinese', 0.033), ('probabilities', 0.033), ('wordbased', 0.033), ('chiang', 0.032), ('john', 0.032), ('marcu', 0.032), ('gold', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

2 0.2635743 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

3 0.242347 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

4 0.24108081 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

Author: John DeNero ; Klaus Macherey

Abstract: Unsupervised word alignment is most often modeled as a Markov process that generates a sentence f conditioned on its translation e. A similar model generating e from f will make different alignment predictions. Statistical machine translation systems combine the predictions of two directional models, typically using heuristic combination procedures like grow-diag-final. This paper presents a graphical model that embeds two directional aligners into a single model. Inference can be performed via dual decomposition, which reuses the efficient inference algorithms of the directional models. Our bidirectional model enforces a one-to-one phrase constraint while accounting for the uncertainty in the underlying directional models. The resulting alignments improve upon baseline combination heuristics in word-level and phrase-level evaluations.

5 0.23648979 141 acl-2011-Gappy Phrasal Alignment By Agreement

Author: Mohit Bansal ; Chris Quirk ; Robert Moore

Abstract: We propose a principled and efficient phraseto-phrase alignment model, useful in machine translation as well as other related natural language processing problems. In a hidden semiMarkov model, word-to-phrase and phraseto-word translations are modeled directly by the system. Agreement between two directional models encourages the selection of parsimonious phrasal alignments, avoiding the overfitting commonly encountered in unsupervised training with multi-word units. Expanding the state space to include “gappy phrases” (such as French ne ? pas) makes the alignment space more symmetric; thus, it allows agreement between discontinuous alignments. The resulting system shows substantial improvements in both alignment quality and translation quality over word-based Hidden Markov Models, while maintaining asymptotically equivalent runtime.

6 0.2124085 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

7 0.20832606 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

8 0.2017062 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

9 0.18842532 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

10 0.18611142 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

11 0.18364097 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

12 0.17798842 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

13 0.17771283 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

14 0.16273759 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

15 0.15840775 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

16 0.15396075 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

17 0.14692508 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

18 0.1447247 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

19 0.1421897 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

20 0.1403815 340 acl-2011-Word Alignment via Submodular Maximization over Matroids

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.298), (1, -0.248), (2, 0.166), (3, 0.195), (4, 0.096), (5, 0.063), (6, 0.093), (7, 0.002), (8, -0.016), (9, 0.111), (10, 0.168), (11, 0.102), (12, 0.014), (13, 0.015), (14, -0.192), (15, 0.064), (16, 0.131), (17, -0.066), (18, -0.125), (19, -0.005), (20, -0.06), (21, -0.016), (22, -0.12), (23, 0.068), (24, 0.041), (25, 0.035), (26, 0.015), (27, 0.055), (28, -0.055), (29, -0.072), (30, -0.012), (31, -0.006), (32, -0.066), (33, 0.029), (34, -0.014), (35, 0.021), (36, 0.014), (37, -0.003), (38, 0.044), (39, -0.013), (40, 0.011), (41, 0.003), (42, 0.04), (43, 0.045), (44, -0.046), (45, 0.092), (46, 0.078), (47, 0.006), (48, 0.006), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98615664 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

2 0.85609257 141 acl-2011-Gappy Phrasal Alignment By Agreement

Author: Mohit Bansal ; Chris Quirk ; Robert Moore

3 0.84309852 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

Author: Chris Dyer ; Jonathan H. Clark ; Alon Lavie ; Noah A. Smith

Abstract: We introduce a discriminatively trained, globally normalized, log-linear variant of the lexical translation models proposed by Brown et al. (1993). In our model, arbitrary, nonindependent features may be freely incorporated, thereby overcoming the inherent limitation of generative models, which require that features be sensitive to the conditional independencies of the generative process. However, unlike previous work on discriminative modeling of word alignment (which also permits the use of arbitrary features), the parameters in our models are learned from unannotated parallel sentences, rather than from supervised word alignments. Using a variety of intrinsic and extrinsic measures, including translation performance, we show our model yields better alignments than generative baselines in a number of language pairs.

4 0.83363563 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

Author: John DeNero ; Klaus Macherey

5 0.83067441 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

6 0.8271516 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

7 0.82261699 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

8 0.80737323 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

9 0.79821581 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

10 0.71737891 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

11 0.70555919 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

12 0.69257396 340 acl-2011-Word Alignment via Submodular Maximization over Matroids

13 0.68202502 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

14 0.67505884 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

15 0.65519273 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

16 0.60853273 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

17 0.57270908 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

18 0.57139009 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

19 0.57037276 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

20 0.56522453 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.048), (26, 0.022), (37, 0.095), (39, 0.022), (41, 0.04), (55, 0.014), (59, 0.021), (72, 0.412), (91, 0.025), (96, 0.205)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96495807 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya

Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.

2 0.9337064 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

3 0.92837554 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

4 0.92793155 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

Abstract: This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. The system takes monologue and transforms it to two-participant dialogue. After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy.

same-paper 5 0.90350533 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

6 0.87971407 261 acl-2011-Recognizing Named Entities in Tweets

7 0.87303698 252 acl-2011-Prototyping virtual instructors from human-human corpora

8 0.80506349 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

9 0.74466264 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.68310946 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

11 0.68161654 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

12 0.67786062 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

13 0.67398429 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

14 0.67247617 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

15 0.66921556 141 acl-2011-Gappy Phrasal Alignment By Agreement

16 0.66291636 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

17 0.66082495 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

18 0.65738344 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

19 0.65300483 76 acl-2011-Comparative News Summarization Using Linear Programming

20 0.65167999 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output