emnlp emnlp2010 emnlp2010-96 knowledge-graph by maker-knowledge-mining

96 emnlp-2010-Self-Training with Products of Latent Variable Grammars


Source: pdf

Author: Zhongqiang Huang ; Mary Harper ; Slav Petrov

Abstract: Mary Harper†‡ ‡HLT Center of Excellence Johns Hopkins University Baltimore, MD mharpe r@ umd .edu Slav Petrov∗ ∗Google Research 76 Ninth Avenue New York, NY s lav@ google . com ting the training data and eventually begins over- fitting (Liang et al., 2007). Moreover, EM is a loWe study self-training with products of latent variable grammars in this paper. We show that increasing the quality of the automatically parsed data used for self-training gives higher accuracy self-trained grammars. Our generative self-trained grammars reach F scores of 91.6 on the WSJ test set and surpass even discriminative reranking systems without selftraining. Additionally, we show that multiple self-trained grammars can be combined in a product model to achieve even higher accuracy. The product model is most effective when the individual underlying grammars are most diverse. Combining multiple grammars that were self-trained on disjoint sets of unlabeled data results in a final test accuracy of 92.5% on the WSJ test set and 89.6% on our Broadcast News test set.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Moreover, EM is a loWe study self-training with products of latent variable grammars in this paper. [sent-6, score-1.018]

2 Our generative self-trained grammars reach F scores of 91. [sent-8, score-0.705]

3 Additionally, we show that multiple self-trained grammars can be combined in a product model to achieve even higher accuracy. [sent-10, score-0.999]

4 The product model is most effective when the individual underlying grammars are most diverse. [sent-11, score-1.08]

5 Combining multiple grammars that were self-trained on disjoint sets of unlabeled data results in a final test accuracy of 92. [sent-12, score-0.829]

6 (2006) is capable of learning high accuracy context-free grammars directly from a raw treebank. [sent-16, score-0.714]

7 However, because the latent variable grammars are not explicitly regularized, EM keeps fit12 cal method, making no promises regarding the final point of convergence when initialized from different random seeds. [sent-20, score-0.87]

8 Recently, Petrov (2010) showed that substantial differences between the learned grammars remain, even if the hierarchical splitting reduces the variance across independent runs of EM. [sent-21, score-0.734]

9 (2006) introduced a linear smoothing procedure that allows training grammars for 6 splitmerge (SM) rounds without overfitting. [sent-23, score-0.895]

10 They showed that self-training latent variable grammars on their own output can mitigate data sparsity issues and improve parsing accuracy. [sent-27, score-0.916]

11 Because the capacity of the model can grow with the size of the training data, latent variable grammars are able to benefit from the additional training data, even though it is not perfectly labeled. [sent-28, score-0.87]

12 However, variation still remains in their self-trained grammars and they had to use a held-out set for model selection. [sent-30, score-0.704]

13 What is perhaps more surprising is that the different latent variable grammars seem to capture complementary aspects of the data. [sent-36, score-0.894]

14 Quite serendipitously, these grammars can be combined into an unweighted product model that substantially outperforms the individual grammars. [sent-38, score-1.121]

15 The average over 10 SM6 grammars with the transformation is 90. [sent-78, score-0.714]

16 Latent variable grammars augment the observed parse trees in the treebank with a latent variable at each tree node. [sent-130, score-1.064]

17 14 spond to different high quality latent variable grammars that have captured different types of patterns in the data. [sent-140, score-0.87]

18 Because the individual models’ mistakes are independent to some extent, multiple grammars can be effectively combined into an unweighted product model of much higher accuracy. [sent-141, score-1.121]

19 We build upon this line of work and investigate methods to exploit products of latent variable grammars in the context of self-training. [sent-142, score-1.038]

20 2 Table 2: Performance of the regular grammars and their products on the WSJ development set. [sent-159, score-1.146]

21 parse a single subset of the unlabeled data and train 10 self-trained grammars using this single set. [sent-160, score-0.882]

22 ST-Prod Training Use the product model to parse a single subset of the unlabeled data and train 10 self-trained grammars using this single set. [sent-161, score-1.176]

23 The resulting grammars can be either used individually or combined in a product model. [sent-163, score-1.019]

24 Finally, the third experiment investigates a method for injecting some additional diversity into the individual grammars to determine whether a product model is most successful when there is more variance among the individual models. [sent-167, score-1.388]

25 It is important to construct grammars capable of parsing this type of data accurately and consistently in order to support structured language modeling (e. [sent-170, score-0.73]

26 4 Newswire Experiments In this section, we compare single grammars and their products that are trained in the standard way with gold WSJ training data, as well as the three self-training scenarios discussed in Section 3. [sent-173, score-0.853]

27 4 Table 3: Performance of the ST-Reg grammars and their products on the WSJ development set. [sent-180, score-0.859]

28 report the F scores of both SM6 and SM7 grammars on the development set in order to evaluate the ef- fect of model complexity on the performance of the self-trained and product models. [sent-181, score-1.005]

29 Note that we use 6th round grammars to produce the automatic parse trees for the self-training experiments. [sent-182, score-0.791]

30 Parsing with the product of the 7th round grammars is slow and requires a large amount of memory (32GB). [sent-183, score-1.038]

31 The best F score attained by the individual SM6 grammars on the development set is 90. [sent-188, score-0.853]

32 The product of grammars achieves a significantly improved accuracy at 92. [sent-191, score-1.034]

33 Notice that the individual SM7 grammars perform worse on average (90. [sent-193, score-0.816]

34 5) due to overfitting, but their product achieves higher accuracy than the product of the SM6 grammars (92. [sent-196, score-1.328]

35 2 ST-Reg Training Given the ten SM6 grammars from the previous subsection, we can investigate the three self-training methods. [sent-202, score-0.745]

36 We then train ten grammars from different random seeds, using an equally weighted combination of the WSJ training set with this single set. [sent-205, score-0.787]

37 These self-trained grammars are then combined into a product model. [sent-206, score-0.999]

38 4 Table 4: Performance of the ST-Prod grammars and their products on the WSJ development set. [sent-215, score-0.859]

39 thanks to the use of additional automatically labeled training data, the individual SM6 ST-Reg grammars perform significantly better than the individual SM6 grammars (91. [sent-216, score-1.572]

40 5 on average), and the individual SM7 ST-Reg grammars perform even better, achieving a high F score of 91. [sent-219, score-0.805]

41 The product of ST-Reg grammars achieves significantly better performance over the individual grammars, however, the improvement is much smaller than that obtained by the product of regular grammars. [sent-221, score-1.687]

42 In fact, the product of ST-Reg grammars performs quite similarly to the product of regular grammars despite the higher average accuracy of the individual grammars. [sent-222, score-2.405]

43 We will show in Section 5 that the diversity among the individual grammars is as important as average accuracy for the performance attained by the product model. [sent-224, score-1.317]

44 3 ST-Prod Training Since products of latent variable grammars perform significantly better than individual latent variable grammars, it is natural to try using the product model for parsing the unlabeled data. [sent-226, score-1.734]

45 To investi- gate whether the higher accuracy of the automatically labeled data translates into a higher accuracy of the self-trained grammars, we used the product of 6th round grammars to parse the same subset of the unlabeled data as in the previous experiment. [sent-227, score-1.233]

46 As can be seen in Table 4, using the product of the regular grammars for labeling the self-training data results in improved individual ST-Prod grammars when compared with the STReg grammars, with 0. [sent-229, score-2.051]

47 8 Table 5: Performance of the ST-Prod-Mult grammars and their products on the WSJ development set. [sent-239, score-0.859]

48 The product of the SM6 ST-Prod grammars also achieves a 0. [sent-243, score-1.004]

49 2 higher F score compared to the product of the SM6 ST-Reg grammars, but the product of the SM7 ST-Prod grammars has the same performance as the product of the SM7 ST-Reg grammars. [sent-244, score-1.585]

50 This could be due to the fact that the ST-Prod grammars are no more diverse than the ST-Reg grammars, as we will show in Section 5. [sent-245, score-0.684]

51 4 ST-Prod-Mult Training When creating a product model of regular grammars, Petrov (2010) used a different random seed for each model and conjectured that the effectiveness of the product grammars stems from the resulting diversity of the individual grammars. [sent-247, score-1.792]

52 Petrov (2010) attempted to use the second method to train individual grammars on either disjoint or overlapping subsets of the treebank, but observed a performance drop in individual grammars resulting from training on less data, as well as in the performance of the product model. [sent-249, score-1.988]

53 Hence, in en Ffinecrif De- 0132 31TotalVPQPNPSBARP ADVP_PRTSWHNPADJPG G G G G 1098765432 (a) Difference in F score between the product and the individual SM6 regular grammars. [sent-252, score-0.702]

54 Figure 1: Difference in F scores between various individual grammars and representative product grammars. [sent-254, score-1.08]

55 the third self-training experiment, we use the product of the regular grammars to parse all ten subsets of the unlabeled data and train ten grammars, which we call ST-Prod-Mult grammars, each using a different subset. [sent-255, score-1.556]

56 As shown in Table 5, the individual ST-Prod-Mult grammars perform similarly to the individual STProd grammars. [sent-256, score-0.888]

57 However, the product of the STProd-Mult grammars achieves significantly higher accuracies than the product of the ST-Prod grammars, with 0. [sent-257, score-1.324]

58 Figure 1(a) depicts the difference between the product and the individual SM6 regular grammars on overall F score, as well as individual constituent F scores. [sent-263, score-1.469]

59 As can be observed, there are significant 17 variations among the individual grammars, and the product of the regular grammars improves almost all categories, with a few exceptions (some individual grammars do better on QP and WHNP constituents). [sent-264, score-2.178]

60 Figure 1(b) shows the difference between the product of the SM6 regular grammars and the individual SM7 ST-Prod-Mult grammars. [sent-265, score-1.367]

61 In most of the categories, some individual ST-Prod-Mult grammars perform comparably or slightly better than the product of SM6 regular grammars used to automatically label the unlabeled training set. [sent-267, score-2.179]

62 As more latent variables are introduced through the iterative SM training algorithm, the modeling capacity of the grammars increases, leading to improved per- formance. [sent-271, score-0.792]

63 However, the performance of the regular grammars drops after 6 SM rounds, as also previously observed in (Huang and Harper, 2009; Petrov, 2009), suggesting that the regular SM7 grammars have overfit the relatively small-sized gold training data. [sent-272, score-1.984]

64 In contrast, the performance of the self-trained grammars continues to improve in the 7th SM round. [sent-273, score-0.684]

65 Although the performance of the individual grammars, both regular and self-trained, varies significantly and the product model consistently helps, there is a non-negligible difference between the improvement achieved by the two product models over their component grammars. [sent-275, score-0.977]

66 The regular product model improves upon its individual grammars more than the ST-Prod-Mult product does in the later SM rounds, as illustrated by the relative error reduction curves in figures 2(a) and (b). [sent-276, score-1.686]

67 In particular, the product of the SM7 regular grammars gains a remarkable 2. [sent-277, score-1.265]

68 1% absolute improvement over the average performance of the individual regular SM7 grammars and 0. [sent-278, score-1.103]

69 2% absolute over the product of the regular SM6 grammars, despite the fact that the individual regular SM7 grammars perform worse than the SM6 grammars. [sent-279, score-1.654]

70 , 2005), each individual expert learns complementary aspects of the training data and the veto power of product models enforces that the joint prediction of their product has to be licensed by all individual experts. [sent-284, score-0.816]

71 One possible explanation of the observation in the previous subsection is that with the addition of more latent variables, the individual grammars become more deeply specialized on certain aspects of the training data. [sent-285, score-0.894]

72 Petrov (2010) showed that the individually learned grammars are indeed very diverse by looking at the distribution oflatent annotations across the treebank categories, as well as the variation in over18 all and individual category F scores (see Figure 1). [sent-288, score-0.874]

73 iTlihtye power of the product model comes directly from the diversity in logp(r|s, G) among individual grammars. [sent-291, score-0.552]

74 yIf i nth leogre (isr sli,tGtle) diversity, dthivei iunadliv girdaumal- grammars would make similar predictions and there would be little or no benefit from using a product model. [sent-292, score-0.978]

75 This happens for coarser grammars produced in early SM stages when there is more uncertainty about what rules to apply, with the rules remaining in the parsing chart having low probabilities overall. [sent-297, score-0.73]

76 4213S RT e-gRPuerloagdr-Mult 2 Figure 2: 3 4 5 6 (c) SM Rounds Learning curves of the individual regular (a) and ST-Prod-Mult (b) grammars (average performance, with minimum and maximum values indicated by bars) and their products before and after self-training velopment set. [sent-302, score-1.246]

77 7 on the WSJ de- (c) The measured average empirical variance among the grammars trained on WSJ. [sent-304, score-0.789]

78 among the regular grammars grows at a much faster speed and is consistently greater when compared to the self-trained grammars. [sent-305, score-0.996]

79 This suggests that there is more diversity among the regular grammars than among the self-trained grammars, and explains the greater improvement obtained by the regular product model. [sent-306, score-1.733]

80 Last but not the least, the trend seems to indicate that the variance of the self-trained grammars would continue increasing if EM training was extended by a few more SM rounds, potentially resulting in even better product models. [sent-308, score-1.028]

81 2 F) alone is able to outperform the product 19 of SM7 regular grammars (88. [sent-314, score-1.265]

82 As can be observed, the selftrained grammars have increasing F scores as the split-merge rounds increase, while the regular grammars have a slight decrease in F score after round 6. [sent-326, score-1.935]

83 In contrast to the newswire models, it appears that the individual ST-Prod-Mult grammars trained on broadcast news always perform comparably to the product of the regular grammars at all SM rounds, including the product of SM7 regular grammars. [sent-327, score-2.904]

84 This is noteworthy, given that the ST-Prod-Mult grammars are trained on the output of the worse performing product of the SM6 regular grammars. [sent-328, score-1.265]

85 213 2 3 4 5 6 7 (c) SM Rounds Figure 3: Learning curves of the individual regular (a) and ST-Prod-Mult (b) grammars (average performance, with minimum and maximum values indicated by bars) and their products before and after self-training on the BN development set. [sent-332, score-1.273]

86 (c) The measured average empirical variance among the grammars trained on BN. [sent-334, score-0.789]

87 possible explanation is that we used more unlabeled data for self-training the broadcast news grammars than for the newswire grammars. [sent-335, score-1.004]

88 The product of the ST-Prod-Mult grammars provides further and significant improvement in F score. [sent-336, score-0.978]

89 6 Final Results We evaluated the best single self-trained grammar (SM7 ST-Prod), as well as the product of the SM7 ST-Prod-Mult grammars on the WSJ test set. [sent-337, score-1.068]

90 Table 7 compares these two grammars to a large body of related work grouped into single parsers (SINGLE), discriminative reranking approaches (RE), self-training (SELF), and system combinations (COMBO). [sent-338, score-0.78]

91 The product of the self-trained ST-Prod-Mult grammars achieves significantly higher accuracies with an F score of 92. [sent-349, score-1.049]

92 8Our ST-Reg grammars are trained in the same way as in 20 Type Parser LP LR EX EL 89. [sent-352, score-0.684]

93 (2006) with a product of latent variable grammars would give even higher parsing accuracies. [sent-389, score-1.21]

94 On the Broadcast News test set, our best performing single and product grammars (bolded in Table 6) obtained F scores of 88. [sent-390, score-0.999]

95 7 Conclusions and Future Work We evaluated methods for self-training high accuracy products of latent variable grammars with large amounts of genre-matched data. [sent-394, score-1.048]

96 We demonstrated empirically on newswire and broadcast news genres that very high accuracies can be achieved by training grammars on disjoint sets of automatically labeled data. [sent-395, score-0.969]

97 Second, the diversity of the individual grammars controls the gains that can be obtained by combining multiple grammars into a product model. [sent-398, score-1.895]

98 6 on the WSJ test set, rivaling discriminative reranking approaches (Charniak and Johnson, 2005) and products of latent variable grammars (Petrov, 2010), despite being a single generative PCFG. [sent-400, score-1.135]

99 Finally, for this work, we always used products of 10 grammars, but we sometimes observed that subsets of these grammars produce even better re21 sults on the development set. [sent-408, score-0.933]

100 Finding a way to select grammars from a grammar pool to achieve high performance products is an interesting area of future study. [sent-409, score-0.901]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grammars', 0.684), ('product', 0.294), ('regular', 0.287), ('sm', 0.17), ('rounds', 0.162), ('products', 0.148), ('petrov', 0.143), ('diversity', 0.131), ('harper', 0.13), ('broadcast', 0.123), ('latent', 0.108), ('wsj', 0.102), ('individual', 0.102), ('bllip', 0.09), ('unlabeled', 0.088), ('variable', 0.078), ('charniak', 0.07), ('grammar', 0.069), ('selftraining', 0.064), ('huang', 0.063), ('news', 0.06), ('round', 0.06), ('edited', 0.058), ('subsets', 0.053), ('bn', 0.051), ('variance', 0.05), ('reranking', 0.05), ('newswire', 0.049), ('treebank', 0.048), ('parse', 0.047), ('parsing', 0.046), ('logarithmic', 0.046), ('sparseval', 0.045), ('slav', 0.042), ('ten', 0.041), ('comparably', 0.04), ('parser', 0.04), ('eugene', 0.039), ('mary', 0.039), ('mcclosky', 0.039), ('selftrained', 0.039), ('pools', 0.039), ('overfitting', 0.036), ('parsed', 0.033), ('consortium', 0.033), ('average', 0.03), ('burnham', 0.03), ('contractions', 0.03), ('garofolo', 0.03), ('gp', 0.03), ('gramamrs', 0.03), ('splitmerge', 0.03), ('matsuzaki', 0.03), ('accuracy', 0.03), ('em', 0.03), ('johnson', 0.028), ('disjoint', 0.027), ('development', 0.027), ('accuracies', 0.026), ('achieves', 0.026), ('var', 0.026), ('filimonov', 0.026), ('hale', 0.026), ('reductions', 0.026), ('variances', 0.026), ('gn', 0.026), ('fossum', 0.026), ('maxima', 0.026), ('specialization', 0.026), ('discriminative', 0.025), ('among', 0.025), ('curves', 0.025), ('log', 0.025), ('miles', 0.025), ('complementary', 0.024), ('files', 0.024), ('opinion', 0.023), ('ontonotes', 0.023), ('zhongqiang', 0.023), ('weischedel', 0.023), ('pcfg', 0.022), ('sized', 0.021), ('attained', 0.021), ('overfit', 0.021), ('generative', 0.021), ('train', 0.021), ('observed', 0.021), ('combined', 0.021), ('single', 0.021), ('variation', 0.02), ('equally', 0.02), ('constituents', 0.02), ('roark', 0.02), ('individually', 0.02), ('bars', 0.02), ('baldridge', 0.02), ('unweighted', 0.02), ('investigate', 0.02), ('score', 0.019), ('smoothing', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999827 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

Author: Zhongqiang Huang ; Mary Harper ; Slav Petrov

Abstract: Mary Harper†‡ ‡HLT Center of Excellence Johns Hopkins University Baltimore, MD mharpe r@ umd .edu Slav Petrov∗ ∗Google Research 76 Ninth Avenue New York, NY s lav@ google . com ting the training data and eventually begins over- fitting (Liang et al., 2007). Moreover, EM is a loWe study self-training with products of latent variable grammars in this paper. We show that increasing the quality of the automatically parsed data used for self-training gives higher accuracy self-trained grammars. Our generative self-trained grammars reach F scores of 91.6 on the WSJ test set and surpass even discriminative reranking systems without selftraining. Additionally, we show that multiple self-trained grammars can be combined in a product model to achieve even higher accuracy. The product model is most effective when the individual underlying grammars are most diverse. Combining multiple grammars that were self-trained on disjoint sets of unlabeled data results in a final test accuracy of 92.5% on the WSJ test set and 89.6% on our Broadcast News test set.

2 0.14952578 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

3 0.14640318 114 emnlp-2010-Unsupervised Parse Selection for HPSG

Author: Rebecca Dridan ; Timothy Baldwin

Abstract: Parser disambiguation with precision grammars generally takes place via statistical ranking of the parse yield of the grammar using a supervised parse selection model. In the standard process, the parse selection model is trained over a hand-disambiguated treebank, meaning that without a significant investment of effort to produce the treebank, parse selection is not possible. Furthermore, as treebanking is generally streamlined with parse selection models, creating the initial treebank without a model requires more resources than subsequent treebanks. In this work, we show that, by taking advantage of the constrained nature of these HPSG grammars, we can learn a discriminative parse selection model from raw text in a purely unsupervised fashion. This allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.

4 0.14029106 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

Author: Slav Petrov ; Pi-Chuan Chang ; Michael Ringgaard ; Hiyan Alshawi

Abstract: It is well known that parsing accuracies drop significantly on out-of-domain data. What is less known is that some parsers suffer more from domain shifts than others. We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers, which are of highest interest for practical applications because of their linear running time, drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

5 0.10829654 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

Author: Zhongqiang Huang ; Martin Cmejrek ; Bowen Zhou

Abstract: In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decod- ing time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

6 0.098485813 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

7 0.098141022 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

8 0.093803987 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

9 0.082618125 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

10 0.079411581 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

11 0.076668143 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

12 0.068903096 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

13 0.06323465 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

14 0.060351662 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

15 0.057442475 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

16 0.053839099 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

17 0.053164452 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

18 0.052106272 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

19 0.048463706 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

20 0.047353473 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.061), (2, 0.187), (3, -0.065), (4, 0.084), (5, 0.053), (6, 0.06), (7, -0.01), (8, 0.025), (9, 0.018), (10, 0.154), (11, -0.047), (12, 0.123), (13, 0.138), (14, -0.049), (15, -0.034), (16, -0.034), (17, 0.047), (18, -0.052), (19, 0.09), (20, 0.04), (21, -0.189), (22, -0.005), (23, -0.038), (24, 0.312), (25, -0.106), (26, -0.155), (27, -0.113), (28, -0.121), (29, 0.018), (30, 0.012), (31, 0.008), (32, -0.091), (33, -0.068), (34, 0.021), (35, 0.003), (36, 0.167), (37, 0.04), (38, -0.264), (39, 0.071), (40, -0.049), (41, -0.092), (42, -0.008), (43, 0.044), (44, -0.191), (45, 0.017), (46, 0.092), (47, -0.031), (48, -0.098), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99132639 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

Author: Zhongqiang Huang ; Mary Harper ; Slav Petrov

Abstract: Mary Harper†‡ ‡HLT Center of Excellence Johns Hopkins University Baltimore, MD mharpe r@ umd .edu Slav Petrov∗ ∗Google Research 76 Ninth Avenue New York, NY s lav@ google . com ting the training data and eventually begins over- fitting (Liang et al., 2007). Moreover, EM is a loWe study self-training with products of latent variable grammars in this paper. We show that increasing the quality of the automatically parsed data used for self-training gives higher accuracy self-trained grammars. Our generative self-trained grammars reach F scores of 91.6 on the WSJ test set and surpass even discriminative reranking systems without selftraining. Additionally, we show that multiple self-trained grammars can be combined in a product model to achieve even higher accuracy. The product model is most effective when the individual underlying grammars are most diverse. Combining multiple grammars that were self-trained on disjoint sets of unlabeled data results in a final test accuracy of 92.5% on the WSJ test set and 89.6% on our Broadcast News test set.

2 0.56800008 114 emnlp-2010-Unsupervised Parse Selection for HPSG

Author: Rebecca Dridan ; Timothy Baldwin

Abstract: Parser disambiguation with precision grammars generally takes place via statistical ranking of the parse yield of the grammar using a supervised parse selection model. In the standard process, the parse selection model is trained over a hand-disambiguated treebank, meaning that without a significant investment of effort to produce the treebank, parse selection is not possible. Furthermore, as treebanking is generally streamlined with parse selection models, creating the initial treebank without a model requires more resources than subsequent treebanks. In this work, we show that, by taking advantage of the constrained nature of these HPSG grammars, we can learn a discriminative parse selection model from raw text in a purely unsupervised fashion. This allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.

3 0.50661004 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

Author: Eric Hardisty ; Jordan Boyd-Graber ; Philip Resnik

Abstract: Strong indications of perspective can often come from collocations of arbitrary length; for example, someone writing get the government out of my X is typically expressing a conservative rather than progressive viewpoint. However, going beyond unigram or bigram features in perspective classification gives rise to problems of data sparsity. We address this problem using nonparametric Bayesian modeling, specifically adaptor grammars (Johnson et al., 2006). We demonstrate that an adaptive na¨ ıve Bayes model captures multiword lexical usages associated with perspective, and establishes a new state-of-the-art for perspective classification results using the Bitter Lemons corpus, a collection of essays about mid-east issues from Israeli and Palestinian points of view.

4 0.44672576 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

Author: Slav Petrov ; Pi-Chuan Chang ; Michael Ringgaard ; Hiyan Alshawi

Abstract: It is well known that parsing accuracies drop significantly on out-of-domain data. What is less known is that some parsers suffer more from domain shifts than others. We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers, which are of highest interest for practical applications because of their linear running time, drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.

5 0.41594431 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

Abstract: Inducing a grammar directly from text is one of the oldest and most challenging tasks in Computational Linguistics. Significant progress has been made for inducing dependency grammars, however the models employed are overly simplistic, particularly in comparison to supervised parsing models. In this paper we present an approach to dependency grammar induction using tree substitution grammar which is capable of learning large dependency fragments and thereby better modelling the text. We define a hierarchical non-parametric Pitman-Yor Process prior which biases towards a small grammar with simple productions. This approach significantly improves the state-of-the-art, when measured by head attachment accuracy.

6 0.41520548 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

7 0.38217312 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

8 0.33249554 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

9 0.32641363 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

10 0.31658307 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

11 0.31102991 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

12 0.27094218 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

13 0.24055356 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

14 0.20446347 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

15 0.19148323 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

16 0.18966699 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

17 0.17554383 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

18 0.17420034 102 emnlp-2010-Summarizing Contrastive Viewpoints in Opinionated Text

19 0.17364568 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

20 0.17278852 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.011), (12, 0.027), (29, 0.528), (30, 0.015), (32, 0.015), (52, 0.025), (56, 0.043), (62, 0.01), (66, 0.09), (72, 0.039), (76, 0.027), (77, 0.014), (79, 0.019), (87, 0.021), (89, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9789592 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

Author: Dominic Espinosa ; Rajakrishnan Rajkumar ; Michael White ; Shoshana Berleant

Abstract: We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion ofthe implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community. 1 Introduction and Background In building surface-realization systems for natural language generation, there is a need for reliable automated metrics to evaluate the output. Unlike in parsing, where there is usually a single goldstandard parse for a sentence, in surface realization there are usually many grammatically-acceptable ways to express the same concept. This parallels the task of evaluating machine-translation (MT) systems: for a given segment in the source language, 564 there are usually several acceptable translations into the target language. As human evaluation of translation quality is time-consuming and expensive, a number of automated metrics have been developed to evaluate the quality of MT outputs. In this study, we investigate whether the metrics developed for MT evaluation tasks can be used to reliably evaluate the outputs of surface realizers, and which of these metrics are best suited to this task. A number of surface realizers have been developed using the Penn Treebank (PTB), and BLEU scores are often reported in the evaluations of these systems. But how useful is BLEU in this context? The original BLEU study (Papineni et al., 2001) scored MT outputs, which are of generally lower quality than grammar-based surface realizations. Furthermore, even for MT systems, the usefulness of BLEU has been called into question (Callison-Burch et al., 2006). BLEU is designed to work with multiple reference sentences, but in treebank realization, there is only a single reference sentence available for comparison. A few other studies have investigated the use of such metrics in evaluating the output of NLG systems, notably (Reiter and Belz, 2009) and (Stent et al., 2005). The former examined the performance of BLEU and ROUGE with computer-generated weather reports, finding a moderate correlation with human fluency judgments. The latter study applied several MT metrics to paraphrase data from Barzilay and Lee’s corpus-based system (Barzilay and Lee, 2003), and found moderate correlations with human adequacy judgments, but little correlation with fluency judgments. Cahill (2009) examined the performance of six MT metrics (including BLEU) in evaluating the output of a LFG-based surface realizer for ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 5t6ic4s–574, German, also finding only weak correlations with the human judgments. To study the usefulness of evaluation metrics such as BLEU on the output of grammar-based surface realizers used with the PTB, we assembled a corpus of surface realizations from three different realizers operating on Section 00 of the PTB. Two human judges evaluated the adequacy and fluency of each of the realizations with respect to the reference sentence. The realizations were then scored with a number of automated evaluation metrics developed for machine translation. In order to investigate the correlation of targeted metrics with human evaluations, and gather other acceptable realizations for future evaluations, the judges manually repaired each unacceptable realization during the rating task. In contrast to previous NLG meta-evaluations, we found that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best. However, when looking at statistically significant system-level differences in human judgments, we found that some of the metrics get some of the rankings correct, but none get them all correct, with different metrics making different ranking errors. This suggests that multiple metrics should be routinely consulted when comparing realizer systems. Overall, our methodology is similar to that of previous MT meta-evaluations, in that we collected human judgments of system outputs, and compared these scores with those assigned by automatic metrics. A recent alternative approach to paraphrase evaluation is ParaMetric (Callison-Burch et al., 2008); however, it requires a corpus of annotated (aligned) paraphrases (which does not yet exist for PTB data), and is arguably focused more on paraphrase analysis than paraphrase generation. The plan of the paper is as follows: Section 2 discusses the preparation of the corpus of surface realizations. Section 3 describes the human evaluation task and the automated metrics applied. Sections 4 and 5 present and discuss the results of these evaluations. We conclude with some general observations about automatic evaluation of surface realizers, and some directions for further research. 565 2 Data Preparation We collected realizations of the sentences in Section 00 of the WSJ corpus from the following three sources: 1. OpenCCG, a CCG-based chart realizer (White, 2006) 2. The XLE Generator, a LFG-based system developed by Xerox PARC (Crouch et al., 2008) 3. WordNet synonym substitutions, to investigate how differences in lexical choice compare to grammar-based variation.1 Although all three systems used Section 00 of the PTB, they were applied with various parameters (e.g., language models, multiple-output versus single-output) and on different input structures. Accordingly, our study does not compare OpenCCG to XLE, or either of these to the WordNet system. 2.1 OpenCCG realizations OpenCCG is an open source parsing/realization library with multimodal extensions to CCG (Baldridge, 2002). The OpenCCG chart realizer takes logical forms as input and produces strings by combining signs for lexical items. Alternative realizations are scored using integrated n-gram and perceptron models. For robustness, fragments are greedily assembled when necessary. Realizations were generated from 1,895 gold standard logical forms, created by constrained parsing of development-section derivations. The following OpenCCG models (which differ essentially in the way the output is ranked) were used: 1. Baseline 1: Output ranked by a trigram word model 2. Baseline 2: Output ranked using three language models (3-gram words 3-gram words with named entity class replacement factored language model of words, POS tags and CCG supertags) + + 1Not strictly surface realizations, since they do not involve an abstract input specification, but for simplicity we refer to them as realizations throughout. 3. Baseline 3: Perceptron with syntax features and the three LMs mentioned above 4. Perceptron full-model: n-best realizations ranked using perceptron with syntax features and the three n-gram models, as well as discriminative n-grams The perceptron model was trained on sections 0221 of the CCGbank, while a grammar extracted from section 00-21 was used for realization. In addition, oracle supertags were inserted into the chart during realization. The purpose of such a non-blind testing strategy was to evaluate the quality of the output produced by the statistical ranking models in isolation, rather than focusing on grammar coverage, and avoid the problems associated with lexical smoothing, i.e. lexical categories in the development section not being present in the training section. To enrich the variation in the generated realizations, dative-alternation was enforced during realization by ensuring alternate lexical categories of the verb in question, as in the following example: (1) the executives gave [the chefs] [a standing ovation] (2) the executives gave [a standing ovation] [to the chefs] 2.2 XLE realizations The corpus of realizations generated by the XLE system contained 42,527 surface realizations of approximately 1,421 section 00 sentences (an average of 30 per sentence), initially unranked. The LFG f-structures used as input to the XLE generator were derived from automatic parses, as described in (Riezler et al., 2002). The realizations were first tokenized using Penn Treebank conventions, then ranked using perplexities calculated from the same trigram word model used with OpenCCG. For each sentence, the top 4 realizations were selected. The XLE generator provides an interesting point of comparison to OpenCCG as it uses a manuallydeveloped grammar with inputs that are less abstract but potentially noisier, as they are derived from automatic parses rather than gold-standard ones. 566 2.3 WordNet synonymizer To produce an additional source of variation, the nouns and verbs of the sentences in section 00 of the PTB were replaced with all of their WordNet synonyms. Verb forms were generated using verb stems, part-of-speech tags, and the morphg tool.2 These substituted outputs were then filtered using the n-gram data which Google Inc. has made available.3 Those without any 5-gram matches centered on the substituted word (or 3-gram matches, in the case of short sentences) were eliminated. 3 Evaluation From the data sources described in the previous sec- tion, a corpus of realizations to be evaluated by the human judges was constructed by randomly choosing 305 sentences from section 00, then selecting surface realizations of these sentences using the following algorithm: 1. Add OpenCCG’s best-scored realization. 2. Add other OpenCCG realizations until all four models are represented, to a maximum of 4. 3. Add up to 4 realizations from either the XLE system or the WordNet pool, chosen randomly. The intent was to give reasonable coverage of all realizer systems discussed in Section 2 without overloading the human judges. “System” here means any instantiation that emits surface realizations, including various configurations of OpenCCG (using different language models or ranking systems), and these can be multiple-output, such as an n-best list, or single-output (best-only, worst-only, etc.). Accordingly, more realizations were selected from the OpenCCG realizer because 5 different systems were being represented. Realizations were chosen randomly, rather than according to sentence types or other criteria, in order to produce a representative sample of the corpus. In total, 2,114 realizations were selected for evaluation. 2http : //www. informatics . sussex. ac .uk/ re search/ groups / nlp / carro l /morph .html l 3http : //www . ldc . upenn .edu/Catalog/docs/ LDC2 0 0 6T 13 / readme .txt 3.1 Human judgments Two human judges evaluated each surface realization on two criteria: adequacy, which represents the extent to which the output conveys all and only the meaning of the reference sentence; and fluency, the extent to which it is grammatically acceptable. The realizations were presented to the judges in sets containing a reference sentence and the 1-8 outputs selected for that sentence. To aid in the evaluation of adequacy, one sentence each of leading and trailing context were displayed. Judges used the guidelines given in Figure 1, based on the scales developed by the NIST Machine Translation Evaluation Workshop. In addition to rating each realization on the two five-point scales, each judge also repaired each output which he or she did not judge to be fully adequate and fluent. An example is shown in Figure 2. These repairs resulted in new reference sentences for a substantial number of sentences. These repaired realizations were later used to calculate targeted versions of the evaluation metrics, i.e., using the repaired sentence as the reference sentence. Although targeted metrics are not fully automatic, they are of interest because they allow the evaluation algorithm to focus on what is actually wrong with the input, rather than all textual differences. Notably, targeted TER (HTER) has been shown to be more consistent with human judgments than human annotators are with one another (Snover et al., 2006). 3.2 Automatic evaluation The realizations were also evaluated using seven automatic metrics: • IBM’s BLEU, which scores a hypothesis by counting n-gram matches with the reference sentence (Papineni et al., 2001), with smoothing as described in (Lin and Och, 2004) • • • • • • The NIST n-gram evaluation metric, similar to BLEU, but rewarding rarer n-gram matches, and using a different length penalty METEOR, which measures the harmonic mean of unigram precision and recall, with a higher weight for recall (Banerjee and Lavie, 2005) 567 TER (Translation Edit Rate), a measure of the number of edits required to transform a hypothesis sentence into the reference sentence (Snover et al., 2006) TERP, an augmented version of TER which performs phrasal substitutions, stemming, and checks for synonyms, among other improvements (Snover et al., 2009) TERPA, an instantiation of TERP with edit weights optimized for correlation with adequacy in MT evaluations GTM (General Text Matcher), a generaliza- tion of the F-measure that rewards contiguous matching spans (Turian et al., 2003) Additionally, targeted versions of BLEU, METEOR, TER, and GTM were computed by using the human-repaired outputs as the reference set. The human repair was different from the reference sentence in 193 cases (about 9% of the total), and we expected this to result in better scores and correlations with the human judgments overall. 4 Results 4.1 Human judgments Table 1 summarizes the dataset, as well as the mean adequacy and fluency scores garnered from the human evaluation. Overall adequacy and fluency judgments were high (4.16, 3.63) for the realizer systems on average, and the best-rated realizer systems achieved mean fluency scores above 4. 4.2 Inter-annotator agreement Inter-annotator agreement was measured using the κ-coefficient, which is commonly used to measure the extent to which annotators agree in category P(1A−)P−(PE()E), judgment tasks. κ is defined as where P(A) is the observed agreement 1 b−etPw(eEe)n annotators and P(E) is the probability of agreement due to chance (Carletta, 1996). Chance agreement for this data is calculated by the method discussed in Carletta’s squib. However, in previous work in MT meta-evaluation, Callison-Burch et al. (2007), assume the less strict criterion of uniform chance agreement, i.e. for a five-point scale. They also 51 Score Adequacy Fluency 5All the meaning of the referencePerfectly grammatical 4 Most of the meaning Awkward or non-native; punctuation errors 3 Much of the meaning Agreement errors or minor syntactic problems 2 Meaning substantially different Major syntactic problems, such as missing words 1 Meaning completely different Completely ungrammatical Figure Ref. Realiz. Repair 1: Rating scale and guidelines It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf spurns them again It weren’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them Figure 2: Example of repair introduce the notion of “relative” κ, which measures how often two or more judges agreed that A > B, A = B, or A < B for two outputs A and B, irrespective of the specific values given on the five-point scale; here, uniform chance agreement is taken to be We report both absolute and relative κ in Table 2, using actual chance agreement rather than uniform chance agreement. 31. The κ scores of0.60 for adequacy and 0.63 for fluency across the entire dataset represent “substantial” agreement, according to the guidelines discussed in (Landis and Koch, 1977), better than is typically reported for machine translation evaluation tasks; for example, Callison-Burch et al. (2007) reported “fair” agreement, with κ = 0.281 for fluency and κ = 0.307 for adequacy (relative). Assuming the uniform chance agreement that the previously cited work adopts, our inter-annotator agreements (both absolute and relative) are still higher. This is likely due to the generally high quality of the realizations evaluated, leading to easier judgments. 4.3 Correlation with automatic evaluation To determine how well the automatic evaluation methods described in Section 3 correlate with the human judgments, we averaged the human judgments for adequacy and fluency, respectively, for each of the rated realizations, and then computed both Pearson’s correlation coefficient and Spearman’s rank correlation coefficient between these scores and each of the metrics. Spearman’s correlation makes fewer assumptions about the distribu- tion of the data, but may not reflect a linear rela568 tionship that is actually present. Both are frequently reported in the literature. Due to space constraints, we show only Spearman’s correlation, although the TER family scored slightly better on Pearson’s coefficient, relatively. The results for Spearman’s correlation are given in Table 3. Additionally, the average scores for adequacy and fluency were themselves averaged into a single score, following (Snover et al., 2009), and the Spearman’s correlation of each of the automatic metrics with these scores are given in Table 4. All reported correlations are significant at p < 0.001. 4.4 Bootstrap sampling of correlations For each of the sub-corpora shown in Table 1, we computed confidence intervals for the correlations between adequacy and fluency human scores with selected automatic metrics (BLEU, HBLEU, TER, TERP, and HTER) as described in (Koenh, 2004). We sampled each sub-corpus 1000 times with replace- ment, and calculated correlations between the rankings induced by the human scores and those induced by the metrics for each reference sentence. We then used these coefficients to estimate the confidence interval, after excluding the top 25 and bottom 25 coefficients, following (Lin and Och, 2004). The results of this for the BLEU metric are shown in Table 5. We determined which correlations lay within the 95% confidence interval of the best performing metric in each row of Table Table 3; these figures are italicized. 5 Discussion 5.1 Human judgments of systems The results for the four OpenCCG perceptron models mostly confirm those reported in (White and Rajkumar, 2009), with one exception: the B-3 model was below B-2, though the P-B (perceptron-best) model still scored highest. This may have been due to differences in the testing scenario. None of the differences in adequacy scores among the individual systems are significant, with the exception of the WordNet system. In this case, the lack of wordsense disambiguation for the substituted words results in a poor overall adequacy score (e.g., wage floor → wage story). Conversely, it scores highest ffoloro fluency, as substituting a noun or tve srcbo rwesith h a synonym does not usually introduce ungrammaticality. 5.2 Correlations of human judgments with MT metrics Of the non-human-targeted metrics evaluated, BLEU and TER/TERP demonstrate the highest correlations with the human judgments of fluency (r = 0.62, 0.64). The TER family of evaluation metrics have been observed to perform very well in MTevaluation tasks, and although the data evaluated here differs from typical MT data in some important ways, the correlation of TERP with the human judgments is substantial. In contrast with previous MT evaluations where TERP performs considerably better than TER, these scored close to equal on our data, possibly because TERP’s stem, synonym, and paraphrase matching are less useful when most of the variation is syntactic. The correlations with BLEU and METEOR are lower than those reported in (Callison-Burch et al., 2007); in that study, BLEU achieved adequacy and fluency correlations of 0.690 and 0.722, respectively, and METEOR achieved 0.701 and 0.719. The correlations for these metrics might be expected to be lower for our data, since overall quality is higher, making the metrics’ task more difficult as the outputs involve subtler differences between acceptable and unacceptable variation. The human-targeted metrics (represented by the prefixed H in the data tables) correlated even more strongly with the human judgments, compared to the non-targeted versions. HTER demonstrated the best 569 correlation with realizer fluency (r = 0.75). For several kinds of acceptable variation involving the rearrangement of constituents (such as dative shift), TERP gives a more reasonable score than BLEU, due to its ability to directly evaluate phrasal shifts. The following realization was rated 4.5 for fluency, and was more correctly ranked by TERP than BLEU: (3) Ref: The deal also gave Mitsui access to a high-tech medical product. (4) Realiz.: The deal also gave access to a high-tech medical product to Mitsui. For each reference sentence, we compared the ranking of its realizations induced from the human scores to the ranking induced from the TERP score, and counted the rank errors by the latter, informally categorizing them by error type (see Table 7). In the 50 sentences with the highest numbers of rank errors, 17 were affected by punctuation differences, typically involving variation in comma placement. Human fluency judgments of outputs with only punctuation problems were generally high, and many realizations with commas inserted or removed were rated fully fluent by the annotators. However, TERP penalizes such insertions or deletions. Agreement errors are another frequent source of ranking errors for TERP. The human judges tended to harshly penalize sentences with number-agreement or tense errors, whereas TERP applies only a single substitution penalty for each such error. We expect that with suitable optimization of edit weights to avoid over-penalizing punctuation shifts and underpenalizing agreement errors, TERP would exhibit an even stronger correlation with human fluency judgments. None of the evaluation metrics can distinguish an acceptable movement of a word or constituent from an unacceptable movement, with only one reference sentence. A substantial source of error for both TERP and BLEU is variation in adverbial placement, as shown in (7). Similar errors are seen with prepositional phrases and some commonly-occurring temporal adverbs, which typically admit a number of variations in placement. Another important example of acceptable variation which these metrics do not generally rank correctly is dative alternation: Ref. We need to clarify what exactly is wrong with it. Realiz. Flu. TERP BLEU We need to clarify exactly what is wrong with it.50.10.5555 We need to clarify exactly what ’s wrong with it. 5 0.2 0.4046 (7) We need to clarify what , exactly , is wrong with it. 5 0.2 0.5452 We need to clarify what is wrong with it exactly. 4.5 0.1 0.6756 We need to clarify what exactly , is wrong with it. 4 0.1 0.7017 We need to clarify what , exactly is wrong with it. 4 0.1 0.7017 We needs to clarify exactly what is wrong with it. (5) Ref. When test booklets were passed out 48 hours ahead of time, she says she copied questions in the social studies section and gave the answers to students. (6) Realiz. When test booklets were passed out 48 hours ahead of time , she says she copied questions in the social studies section and gave students the answers. The correlations of each of the metrics with the human judgments of fluency for the realizer systems indicate at least a moderate relationship, in contrast with the results reported in (Stent et al., 2005) for paraphrase data, which found an inverse correlation for fluency, and (Cahill, 2009) for the output ofa surface realizer for German, which found only a weak correlation. However, the former study employed a corpus-based paraphrase generation system rather than grammar-driven surface realizers, and the resulting paraphrases exhibited much broader variation. In Cahill’s study, the outputs of the realizer were almost always grammatically correct, and the automated evaluation metrics were ranking markedness instead of grammatical acceptability. 5.3 System-level comparisons In order to investigate the efficacy of the metrics in ranking different realizer systems, or competing realizations from the same system generated using different ranking models, we considered seven different “systems” from the whole dataset of realizations. These consisted of five OpenCCG-based realizations (the best realization from three baseline models, and the best and the worst realization from the full perceptron model), and two XLE-based sys- tems (the best and the worst realization, after ranking the outputs of the XLE realizer with an n-gram model). The mean of the combined adequacy and 570 3 0.103 0.346 fluency scores of each of these seven systems was compared with that of every other system, resulting in 21 pairwise comparisons. Then Tukey’s HSD test was performed to determine the systems which differed significantly in terms of the average adequacy and fluency rating they received.4 The test revealed five pairwise comparisons where the scores were significantly different. Subsequently, for each of these systems, an overall system-level score for each of the MT metrics was calculated. For the five pairwise comparisons where the adequacy-fluency group means differed significantly, we checked whether the metric ranked the systems correctly. Table 8 shows the results of a pairwise comparison between the ranking induced by each evaluation metric, and the ranking induced by the human judgments. Five of the seven non- targeted metrics correctly rank more than half of the systems. NIST, METEOR, and GTM get the most comparisons right, but neither NIST nor GTM correctly rank the OpenCCG-baseline model 1 with respect to the XLE-best model. TER and TERP get two of the five comparisons correct, and they incorrectly rank two of the five OpenCCG model comparisons, as well as the comparison between the XLE-worst and OpenCCG-best systems. For the targeted metrics, HNIST is correct for all five comparisons, while neither HBLEU nor HMETEOR correctly rank all the OpenCCG models. On the other hand, HTER and HGTM incorrectly rank the XLE-best system versus OpenCCG-based models. In summary, some of the metrics get some of the rankings correct, but none of the non-targeted metrics get all of them correct. Moreover, different metrics make different ranking errors. This argues for 4This particular test was chosen since it corrects for multiple post-hoc analyses conducted on the same data-set. the use of multiple metrics in comparing realizer systems. 6 Conclusion Our study suggests that although the task of evaluating the output from realizer systems differs from the task of evaluating machine translations, the automatic metrics used to evaluate MT outputs deliver moderate correlations with combined human fluency and adequacy scores when used on surface realizations. We also found that the MT-evaluation metrics are useful in evaluating different versions of the same realizer system (e.g., the various OpenCCG realization ranking models), and finding cases where a system is performing poorly. As in MT-evaluation tasks, human-targeted metrics have the highest correlations with human judgments overall. These results suggest that the MT-evaluation metrics are useful for developing surface realizers. However, the correlations are lower than those reported for MT data, suggesting that they should be used with caution, especially for cross-system evaluation, where consulting multiple metrics may yield more reliable comparisons. In our study, the targeted version of TERP correlated most strongly with human judgments of fluency. In future work, the performance of the TER family of metrics on this data might be improved by opti- mizing the edit weights used in computing its scores, so as to avoid over-penalizing punctuation movements or under-penalizing agreement errors, both of which were significant sources of ranking errors. Multiple reference sentences may also help mitigate these problems, and the corpus of human-repaired realizations that has resulted from our study is a step in this direction, as it provides multiple references for some cases. We expect the corpus to also prove useful for feature engineering and error analysis in developing better realization models.5 Acknowledgements We thank Aoife Cahill and Tracy King for providing us with the output of the XLE generator. We also thank Chris Callison-Burch and the anonymous reviewers for their helpful comments and suggestions. 5The corpus can be downloaded from http : / /www . l ing .ohio-st ate . edu / ˜mwhite / dat a / emnlp 10 / . 571 This material is based upon work supported by the National Science Foundation under Grant No. 0812297. References Jason Baldridge. 2002. Lexically Specified Derivational Control in Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72. R. Barzilay and L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In proceedings of HLT-NAACL, volume 2003, pages 16–23. Aoife Cahill. 2009. Correlating human and automatic evaluation of a german surface realiser. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 97–100, Suntec, Singapore, August. Association for Computational Linguistics. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of BLEU in machine translation research. In Proceedings of EACL, volume 2006, pages 249–256. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (meta-) evaluation ofmachine translation. In StatMT ’07: Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Morristown, NJ, USA. Association for Computational Linguistics. C. Callison-Burch, T. Cohn, and M. Lapata. 2008. Parametric: An automatic evaluation metric for paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 97–104. Association for Computational Linguistics. J. Carletta. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2):249–254. Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy King, John Maxwell, and Paula Newman. 2008. Xle documentation. Technical report, Palo Alto Research Center. Philip Koenh. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. J.R. Landis and G.G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1): 159–174. Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING ’04: Proceedings Chin-Yew of the 20th international conference on Computational 501, Morristown, NJ, USA. Associfor Computational Linguistics. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. Linguistics, page ation K. Bleu: a method for automatic evaluation of machine translation. E. Technical report, IBM Research. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558. Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 271–278, Philadelphia, Pennsylvania, USA, July. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–23 1. M. Snover, N. Madnani, B.J. Dorr, and R. Schwartz. 2009. Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 259–268. Association for Computational Linguistics. Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of CICLing. J.P. Turian, L. Shen, and I.D. Melamed. 2003. Evaluation of machine translation and its evaluation. recall (C— R), 100:2. Michael White and Rajakrishnan Rajkumar. 2009. Perceptron reranking for CCG realization. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 410–419, Singapore, August. Association for Computational Linguistics. Michael White. 2006. Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language and Computation, 4(1):39–75. 572 Table 1: Descriptive statistics Table 2: Corpora-wise inter-annotator agreement (absolute and relative κ values shown) SXAROWlpeyLos-aErFndAliCzueqrtd-GAFluq0 N.354217690 B.356219470M .35287410G .35241780 TP.465329170T.A34521670T.465230 H.54T76321H0 .543N89270H.653B7491280H.563M41270H.5643G218 Table 3: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (-Adq: adequacy and -Flu: Fluency); Scores which fall within the 95 %CI of the best are italicized. SROXAWlLeypoasErldniCze rtG0 N.35246 190 B.5618740 M.542719G0 .5341890T .P632180T.A54268 0T .629310 H.7T6 3985H0 .546N180 H.765B8730H.673M5190 H.56G 4318 Table 4: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (combined adequacy and fluency scores) 573 SRAXOWylLpeosatrEldniezCm rtG0S A.p61d35q94107 .5304%65874L09.5462%136U0SF .lp256u 1209 .51 6%9213L0 .562%91845U Table 5: Spearman’s correlation analysis (bootstrap sampling) of the BLEU scores of various systems with human adequacy and fluency scores SRXOAWylLpeosarEndiCztGH J -12 0 N.6543210 B.6512830 M.4532 960 G.13457960T.P56374210T.A45268730T.562738140 H.7T6854910H.56N482390H.675B1398240H.567M3 240H.56G41290H.8J71562- Table 6: Spearman’s correlations of NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), human variants (HT, HN, HB, HM, HG), and individual human judgments (combined adq. and flu. scores) Factor Count Punctuation17 Adverbial shift Agreement Other shifts Conjunct rearrangement Complementizer ins/del PP shift 16 14 8 8 5 4 Table 7: Factors influencing TERP ranking errors for 50 worst-ranked realization groups Table 8: Metric-wise ranking performance in terms of agreement with a ranking induced by combined adequacy and fluency scores; each metric gets a score out of 5 (i.e. number of system-level comparisons that emerged significant as per the Tukey’s HSD test) Legend: Perceptron Best (PB); Perceptron Worst (PW); XLE Best (XB); XLE Worst (XW); OpenCCG baseline models 1 to 3 (C1 ... C3) 574

same-paper 2 0.96509254 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

Author: Zhongqiang Huang ; Mary Harper ; Slav Petrov

Abstract: Mary Harper†‡ ‡HLT Center of Excellence Johns Hopkins University Baltimore, MD mharpe r@ umd .edu Slav Petrov∗ ∗Google Research 76 Ninth Avenue New York, NY s lav@ google . com ting the training data and eventually begins over- fitting (Liang et al., 2007). Moreover, EM is a loWe study self-training with products of latent variable grammars in this paper. We show that increasing the quality of the automatically parsed data used for self-training gives higher accuracy self-trained grammars. Our generative self-trained grammars reach F scores of 91.6 on the WSJ test set and surpass even discriminative reranking systems without selftraining. Additionally, we show that multiple self-trained grammars can be combined in a product model to achieve even higher accuracy. The product model is most effective when the individual underlying grammars are most diverse. Combining multiple grammars that were self-trained on disjoint sets of unlabeled data results in a final test accuracy of 92.5% on the WSJ test set and 89.6% on our Broadcast News test set.

3 0.9567399 77 emnlp-2010-Measuring Distributional Similarity in Context

Author: Georgiana Dinu ; Mirella Lapata

Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.

4 0.95150203 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

Author: Yoong Keok Lee ; Aria Haghighi ; Regina Barzilay

Abstract: Part-of-speech (POS) tag distributions are known to exhibit sparsity a word is likely to take a single predominant tag in a corpus. Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy. However, in existing systems, this expansion come with a steep increase in model complexity. This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments. In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training. Our experiments consistently demonstrate that this model architecture yields substantial performance gains over more complex tagging — counterparts. On several languages, we report performance exceeding that of more complex state-of-the art systems.1

5 0.94610274 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

Author: Libin Shen ; Bing Zhang ; Spyros Matsoukas ; Jinxi Xu ; Ralph Weischedel

Abstract: In modern machine translation practice, a statistical phrasal or hierarchical translation system usually relies on a huge set of translation rules extracted from bi-lingual training data. This approach not only results in space and efficiency issues, but also suffers from the sparse data problem. In this paper, we propose to use factorized grammars, an idea widely accepted in the field of linguistic grammar construction, to generalize translation rules, so as to solve these two problems. We designed a method to take advantage of the XTAG English Grammar to facilitate the extraction of factorized rules. We experimented on various setups of low-resource language translation, and showed consistent significant improvement in BLEU over state-ofthe-art string-to-dependency baseline systems with 200K words of bi-lingual training data.

6 0.82213426 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

7 0.80352932 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

8 0.79885542 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

9 0.79191297 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

10 0.78480643 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

11 0.78199798 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.77954525 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

13 0.77885783 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

14 0.74959648 94 emnlp-2010-SCFG Decoding Without Binarization

15 0.74663305 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

16 0.74074107 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

17 0.73623818 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

18 0.72732723 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

19 0.72514123 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

20 0.7159338 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation