emnlp emnlp2012 emnlp2012-18 knowledge-graph by maker-knowledge-mining

18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP


Source: pdf

Author: Taylor Berg-Kirkpatrick ; David Burkett ; Dan Klein

Abstract: We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. [sent-3, score-0.394]

2 First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? [sent-4, score-0.411]

3 1 Introduction It is, or at least should be, nearly universal that NLP evaluations include statistical significance tests to validate metric gains. [sent-12, score-0.542]

4 As important as significance testing is, relatively few papers have empirically investigated its practical properties. [sent-13, score-0.332]

5 In this paper, we investigate two aspects of the empirical behavior of paired significance tests for NLP systems. [sent-16, score-0.394]

6 What should be made of the conventional wisdom that often springs up that a certain metric gain is roughly the point of significance for a given task (e. [sent-19, score-0.496]

7 Here, we show that more similar systems tend to achieve significance with smaller metric gains, reflecting the fact that their outputs are more correlated. [sent-27, score-0.63]

8 For example, in designing a shared task it is important to know how large the test set must be in order for significance tests to be sensitive to small gains in the performance metric. [sent-29, score-0.516]

9 For example, doubling the test size will not obviate the need for significance testing. [sent-31, score-0.363]

10 Thus, in the course of our investigations, we propose a very simple method for automatically generating arbitrary numbers of comparable system outputs and we then validate the trends revealed by our synthetic method against data from public competitions. [sent-35, score-0.344]

11 Finally, we consider a related and perhaps even more important question that can only be answered empirically: to what extent is statistical significance on a test corpus predictive of performance on other test corpora, in-domain or otherwise? [sent-37, score-0.374]

12 Focusing on constituency parsing, we investigate the relationship between significance levels and actual performance PLarnocgeueadgieng Lse oafrn thineg 2,0 p1a2ge Jsoi 9n9t5 C–1o0n0fe5r,e Jnecjeu o Insl Eanmdp,i Kriocraela M, 1e2th–o1d4s J iunly N 2a0t1u2ra. [sent-38, score-0.441]

13 However, as the domain of the new data diverges from that of the test set, the predictive ability of significance level drops off dramatically. [sent-45, score-0.316]

14 2 Statistical Significance Testing in NLP First, we review notation and standard practice in significance testing to set up our empirical investigation. [sent-46, score-0.332]

15 Hypothesis testing consists of attempting to estimate this likelihood, written p(δ(X) > δ(x) |H0), wmhateere t hXis i lsi a e rlainhdooomd, wvarriitatbelne over possible test sets of size n that we could have drawn, and δ(x) is a constant, the metric gain we actually observed. [sent-55, score-0.451]

16 Our null hypothesis does not condition on the test data, and therefore the bootstrap is a better choice. [sent-66, score-0.33]

17 The bootstrap therefore draws each from x itself, sampling n items from x with replacement; these new test sets are called bootstrap samples. [sent-82, score-0.432]

18 3 We run the bootstrap using several metrics: F1-measure for constituency parsing, unlabeled dependency accuracy for dependency parsing, alignment error rate (AER) for word alignment, ROUGE score (Lin, 2004) for summarization, and BLEU score for machine translation. [sent-93, score-0.549]

19 While obtaining such data is not generally feasible, for several tasks there are public competitions to which systems are submitted by many researchers. [sent-100, score-0.31]

20 We obtained system outputs from the TAC 2008 workshop on automatic summarization (Dang and Owczarzak, 2008), the CoNLL 2007 shared task on dependency parsing (Nivre et al. [sent-102, score-0.413]

21 3Note that the bootstrap procedure given only approximates the true significance level, with multiple sources of approximation error. [sent-107, score-0.428]

22 A third is the assumption that the mean bootstrap gain is the test gain (which could be further corrected for ifthe metric is sufficiently ill-behaved). [sent-110, score-0.571]

23 ROUGE improvement on TAC 2008 test set for comparisons between all pairs of the 58 participating systems at TAC 2008. [sent-117, score-0.849]

24 Comparisons between systems entered by the same research group and comparisons between systems entered by different research groups are shown separately. [sent-118, score-1.236]

25 It suggests that relatively quickly we reach ROUGE gains for which, in practice, significance tests will most likely be positive. [sent-129, score-0.454]

26 We might expect that systems whose outputs are highly correlated will achieve higher confidence at lower metric gains. [sent-130, score-0.559]

27 5 5In order to run bootstraps between all pairs of systems quickly, we reuse a random sample counts matrix between bootstrap runs. [sent-132, score-0.406]

28 unlabeled dependency accuracy improvement on the Chinese CoNLL 2007 test set for comparisons between all pairs of the 21 participating systems in CoNLL 2007 shared task. [sent-138, score-0.992]

29 Comparisons between systems entered by the same research group and comparisons between systems entered by different research groups are shown separately. [sent-139, score-1.236]

30 separately show the comparisons between systems entered by the same research group and comparisons between systems entered by different research groups, with the expectation that systems entered by the same group are likely to have more correlated outputs. [sent-140, score-2.173]

31 Many of the comparisons between systems submitted by the same group are offset from the main curve. [sent-141, score-0.802]

32 For example, let’s say we take all the comparisons with p-value between 0. [sent-145, score-0.591]

33 Each of these comparisons has an associated metric gain, and by taking, say, the 95th percentile of these metric gains, we get a potentially useful threshold. [sent-148, score-0.826]

34 10 on the exact same test set, there is a pretty good chance that a statistical significance test would show significance at the p-value(x) < 0. [sent-153, score-0.632]

35 We have already seen that pairs of systems submitted by the same research group and by different research groups follow different trends, and we will soon see more evidence demonstrating the importance of system correlation in determining the relationship between metric gain and confidence. [sent-160, score-0.647]

36 There are many factors are at work, and so, of course, metric gain alone will not fully determine the outcome of a paired significance test. [sent-162, score-0.536]

37 We use the outputs of the 21 systems that participated in the CoNLL 2007 shared task on depen- dency parsing. [sent-166, score-0.314]

38 In Figure 3, we plot, for all pairs, the gain in unlabeled dependency accuracy against confidence on the CoNLL 2007 Chinese test set, which consists of 690 sentences and parses. [sent-167, score-0.377]

39 We again separate comparisons between systems submitted by the same research group and those submitted by different groups, although for this task there were fewer cases of multiple submission. [sent-168, score-0.894]

40 The results resemble the plot for summarization; we again see a curve-shaped trend, and comparisons between systems from the same group (few that they are) achieve higher confidences at lower metric gains. [sent-169, score-0.924]

41 We run an experiment using the outputs of the 31 systems participating in WMT 2010 on the system combination portion ofthe German-English WMT 2010 news test set, which consists of 2,034 German sentences and English translations. [sent-173, score-0.524]

42 We again run comparisons for pairs of participating systems. [sent-174, score-0.702]

43 We plot gain in test BLEU score against confidence in Figure 4. [sent-175, score-0.32]

44 BLEU improvement on the system combination portion of the German-English WMT 2010 news test set for comparisons between pairs of the 3 1 participating systems at WMT 2010. [sent-177, score-0.972]

45 Comparisons between systems entered by the same research group, comparisons between systems entered by different research groups, and comparisons between system combination entries are shown separately. [sent-178, score-1.742]

46 isons that are likely to have specially correlated systems: 13 of the submitted systems are system combinations, and each take into account the same set of proposed translations. [sent-179, score-0.361]

47 We separate comparisons into three sets: comparisons between non-combined systems entered by different research groups, comparisons between non-combined systems entered by the same research group, and comparisons between system-combinations. [sent-180, score-2.814]

48 Different group comparisons, same group comparisons, and system combination comparisons form distinct curves. [sent-182, score-0.724]

49 This indicates, again, that comparisons between systems that are expected to be specially correlated achieve high confidence at lower metric gain levels. [sent-183, score-1.13]

50 The first class can be used to approximate comparisons of systems that are expected to be specially correlated, and the latter for comparisons of systems that are not. [sent-200, score-1.357]

51 Together, this yields a total of 75 system outputs on the CoNLL 2007 Chinese test set, 25 systems for each base model type. [sent-208, score-0.429]

52 This ensures that for each pair of model types we will be able to see comparisons where the metric gains are small. [sent-210, score-0.757]

53 The results of the pairwise comparisons of all 75 system outputs are shown in Figure 5, along with the results of the CoNLL 2007 shared task system comparisons from Figure 3. [sent-211, score-1.403]

54 unlabeled dependency accuracy improvement on the Chinese CoNLL 2007 test set for comparisons between all pairs of systems generated by using resampled training sets to train either MST parser, Maltparser, or the ensemble parser. [sent-214, score-1.102]

55 Comparisons between systems generated using the same base model type and comparisons between systems generated using different base model types are shown separately. [sent-215, score-1.015]

56 The CoNLL 2007 shared task comparisons from Figure 3 are also shown. [sent-216, score-0.6]

57 The overlay of the natural comparisons suggests that the synthetic approach reasonably models the relationship between metric gain and confidence. [sent-217, score-0.894]

58 Additionally, the different model type and same model type comparisons exhibit the behavior we would expect, matching the curves corresponding to comparisons between specially correlated systems and standard comparisons respectively. [sent-218, score-1.967]

59 1 to compute the threshold above which the metric gain is probably significant. [sent-221, score-0.317]

60 For comparisons between systems of the same model type, the threshold is 1. [sent-222, score-0.731]

61 For comparisons between systems of different model types, the threshold is 1. [sent-224, score-0.731]

62 For example, during development most comparisons are made between incremental variants of the same system. [sent-228, score-0.56]

63 BLEU improvement on the system combination portion of the GermanEnglish WMT 2010 news test set for comparisons between all pairs of systems generated by using resampled training sets to train either Moses or Joshua. [sent-230, score-1.09]

64 Comparisons between systems generated using the same base model type and comparisons between systems generated using different base model types are shown separately. [sent-231, score-1.015]

65 The WMT 2010 workshop comparisons from Figure 4 are also shown. [sent-232, score-0.56]

66 This yields a total of 150 system outputs on the system combination portion of the German-English WMT 2010 news test set. [sent-246, score-0.376]

67 The results of the pairwise comparisons of all 150 system outputs are shown in Figure 6, along with the results ofthe WMT 2010 workshop system comparisons from Figure 4. [sent-247, score-1.363]

68 The natural comparisons from the WMT 2010 workshop align well with the comparisons between synthetically varied models. [sent-248, score-1.12]

69 Again, the different model type and same model type comparisons form distinct curves. [sent-249, score-0.626]

70 For comparisons between systems 1000 AER Figure 7: Word alignment: Confidence vs. [sent-250, score-0.652]

71 AER improvement on the Hansard test set for comparisons between all pairs of systems generated by using resampled training sets to train either the ITG aligner, the joint HMM aligner, or GIZA++. [sent-251, score-0.967]

72 Comparisons between systems generated using the same base model type and comparisons between systems generated using different base model types are shown separately. [sent-252, score-1.015]

73 For comparisons between systems of different model types the threshold is 0. [sent-256, score-0.731]

74 (2009), we train the supervised ITG aligner using the first 337 sentence pairs of the hand-aligned Hansard test set; again, we resample 20 training sets of the same size as the original data. [sent-268, score-0.368]

75 Unlike previous plots, the points corresponding to comparisons between systems with different base 6GIZA++ failed to produce reasonable output when trained with some of these training sets, so there are fewer than 20 GIZA++ systems in our comparisons. [sent-270, score-0.828]

76 F1 improvement on section 23 of the WSJ corpus for comparisons between all pairs of systems generated by using resampled training sets to train either the Berkeley parser, the Stanford parser, or the Collins parser. [sent-272, score-0.909]

77 Comparisons between systems generated using the same base model type and comparisons between systems generated using different base model types are shown separately. [sent-273, score-1.015]

78 It turns out that the upper curve consists only of comparisons between ITG and HMM aligners. [sent-275, score-0.56]

79 For comparisons between systems of the same model type the p-value < 0. [sent-279, score-0.685]

80 For comparisons between systems of different model types the threshold is 1. [sent-282, score-0.731]

81 For comparisons between systems of the same model type, the p-value < 0. [sent-294, score-0.652]

82 For comparisons between systems of different model types the threshold is 0. [sent-297, score-0.731]

83 4 Properties of the Test Corpus For five tasks, we have seen a trend relating metric gain and confidence, and we have seen that the level of correlation between the systems being compared affects the location of the curve. [sent-299, score-0.36]

84 Next, we look at how the size and domain of the test set play a role, and, finally, how significance level predicts performance on held out data. [sent-300, score-0.395]

85 In this section, we carry out experiments for both machine translation and constituency parsing, but mainly focus on the latter because of the availability of large test corpora that span more than one domain: the Brown corpus and the held out portions of the WSJ corpus. [sent-301, score-0.312]

86 1 Varying the Size Figure 9 plots comparisons for machine translation on variously sized initial segments of the WMT 2010 news test set. [sent-303, score-0.809]

87 Similarly, Figure 10 plots comparisons for constituency parsing on initial segments of the Brown corpus. [sent-304, score-0.867]

88 This phenomenon follows the general shape of the central limit theorem, which predicts that variances of observed metric gains will shrink according to the square root of the test size. [sent-312, score-0.348]

89 Even using the entire Brown corpus as a test set there is a small range where the result of a paired significance test was not completely determined by metric gain. [sent-313, score-0.547]

90 Figure 11 plots comparisons for a fixed test size, but with various test corpora. [sent-315, score-0.77]

91 2 Empirical Calibration across Domains Now that we have a way of generating outputs for thousands of pairs of systems, we can check empirically the practical reliability of significance testing. [sent-323, score-0.443]

92 Thus, 92% of the significance tests with p-value in a tight range around 0. [sent-337, score-0.354]

93 It suggests that sta- tistical significance computed using the bootstrap is reasonably well calibrated. [sent-340, score-0.428]

94 This time, for each pair of generated systems we run a bootstrap on section 23. [sent-348, score-0.335]

95 fraction of comparisons where system A beats system B on section 22, section 24, and the Brown corpus. [sent-353, score-0.696]

96 It should be considered a good practice to include statistical significance testing results with empirical evaluations. [sent-365, score-0.332]

97 However, we have demonstrated some limitations of statistical significance testing for NLP. [sent-367, score-0.332]

98 Our results reveal that the relationship between metric gain and statistical significance is complex, and therefore simple thresholds are not a replacement for significance tests. [sent-370, score-0.865]

99 On some pitfalls in automatic evaluation and significance testing for mt. [sent-588, score-0.332]

100 More accurate tests for the statistical significance of result differences. [sent-602, score-0.354]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comparisons', 0.56), ('significance', 0.258), ('entered', 0.195), ('wmt', 0.177), ('bootstrap', 0.17), ('outputs', 0.147), ('constituency', 0.146), ('rouge', 0.145), ('metric', 0.133), ('resampled', 0.115), ('confidence', 0.111), ('itg', 0.106), ('gain', 0.105), ('aligner', 0.102), ('brown', 0.1), ('tac', 0.097), ('tests', 0.096), ('plots', 0.094), ('conll', 0.094), ('submitted', 0.092), ('systems', 0.092), ('competitions', 0.091), ('resample', 0.089), ('bleu', 0.087), ('base', 0.084), ('threshold', 0.079), ('hansard', 0.078), ('correlated', 0.076), ('testing', 0.074), ('bisani', 0.068), ('bootstraps', 0.068), ('maxwell', 0.068), ('victory', 0.068), ('parsing', 0.067), ('participating', 0.066), ('wsj', 0.066), ('gains', 0.064), ('translation', 0.062), ('shape', 0.061), ('summarization', 0.06), ('synthetic', 0.059), ('population', 0.058), ('test', 0.058), ('group', 0.058), ('hypothesis', 0.057), ('validate', 0.055), ('aer', 0.053), ('koehn', 0.053), ('specially', 0.053), ('unlabeled', 0.052), ('sign', 0.051), ('dependency', 0.051), ('maltparser', 0.049), ('gillick', 0.049), ('riezler', 0.049), ('system', 0.048), ('size', 0.047), ('portions', 0.046), ('plot', 0.046), ('calibration', 0.046), ('cox', 0.046), ('nlp', 0.045), ('null', 0.045), ('groups', 0.044), ('degrades', 0.044), ('replacement', 0.041), ('alignment', 0.041), ('paired', 0.04), ('portion', 0.04), ('shared', 0.04), ('beats', 0.04), ('diminishing', 0.039), ('efron', 0.039), ('pairs', 0.038), ('run', 0.038), ('relationship', 0.037), ('quickly', 0.036), ('improvement', 0.035), ('confidences', 0.035), ('participated', 0.035), ('public', 0.035), ('news', 0.035), ('generated', 0.035), ('sets', 0.034), ('type', 0.033), ('hmm', 0.033), ('parser', 0.033), ('thresholds', 0.033), ('yeh', 0.033), ('nivre', 0.032), ('predicts', 0.032), ('giza', 0.032), ('moses', 0.032), ('ensemble', 0.032), ('say', 0.031), ('berkeley', 0.031), ('extreme', 0.031), ('aligners', 0.031), ('bikel', 0.031), ('trend', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

Author: Taylor Berg-Kirkpatrick ; David Burkett ; Dan Klein

Abstract: We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

2 0.095232755 65 emnlp-2012-Improving NLP through Marginalization of Hidden Syntactic Structure

Author: Jason Naradowsky ; Sebastian Riedel ; David Smith

Abstract: Many NLP tasks make predictions that are inherently coupled to syntactic relations, but for many languages the resources required to provide such syntactic annotations are unavailable. For others it is unclear exactly how much of the syntactic annotations can be effectively leveraged with current models, and what structures in the syntactic trees are most relevant to the current task. We propose a novel method which avoids the need for any syntactically annotated data when predicting a related NLP task. Our method couples latent syntactic representations, constrained to form valid dependency graphs or constituency parses, with the prediction task via specialized factors in a Markov random field. At both training and test time we marginalize over this hidden structure, learning the optimal latent representations for the problem. Results show that this approach provides significant gains over a syntactically uninformed baseline, outperforming models that observe syntax on an English relation extraction task, and performing comparably to them in semantic role labeling.

3 0.084142968 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

Author: David Burkett ; Dan Klein

Abstract: We describe a transformation-based learning method for learning a sequence of monolingual tree transformations that improve the agreement between constituent trees and word alignments in bilingual corpora. Using the manually annotated English Chinese Translation Treebank, we show how our method automatically discovers transformations that accommodate differences in English and Chinese syntax. Furthermore, when transformations are learned on automatically generated trees and alignments from the same domain as the training data for a syntactic MT system, the transformed trees achieve a 0.9 BLEU improvement over baseline trees.

4 0.080627404 86 emnlp-2012-Locally Training the Log-Linear Model for SMT

Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu

Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.

5 0.080150634 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

Author: Kristian Woodsend ; Mirella Lapata

Abstract: Multi-document summarization involves many aspects of content selection and surface realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

6 0.077471726 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

7 0.07712879 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing

8 0.072602183 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

9 0.071849912 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

10 0.071487248 67 emnlp-2012-Inducing a Discriminative Parser to Optimize Machine Translation Reordering

11 0.070784025 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules

12 0.068217807 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

13 0.068053178 57 emnlp-2012-Generalized Higher-Order Dependency Parsing with Cube Pruning

14 0.067056052 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques

15 0.064203367 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

16 0.061452854 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

17 0.059972208 88 emnlp-2012-Minimal Dependency Length in Realization Ranking

18 0.059260715 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

19 0.058179565 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage

20 0.05748263 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.221), (1, -0.086), (2, -0.038), (3, -0.014), (4, -0.015), (5, -0.005), (6, -0.066), (7, 0.016), (8, 0.048), (9, 0.034), (10, 0.039), (11, 0.063), (12, -0.049), (13, 0.012), (14, -0.015), (15, 0.058), (16, -0.009), (17, -0.012), (18, -0.03), (19, -0.028), (20, 0.085), (21, -0.044), (22, 0.055), (23, 0.006), (24, -0.003), (25, 0.06), (26, -0.022), (27, -0.275), (28, 0.113), (29, 0.145), (30, 0.088), (31, -0.036), (32, 0.011), (33, 0.017), (34, 0.068), (35, -0.066), (36, -0.141), (37, -0.026), (38, 0.13), (39, 0.086), (40, 0.073), (41, 0.21), (42, -0.114), (43, -0.072), (44, -0.109), (45, 0.081), (46, 0.023), (47, 0.088), (48, -0.052), (49, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97502899 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

Author: Taylor Berg-Kirkpatrick ; David Burkett ; Dan Klein

Abstract: We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

2 0.5376128 88 emnlp-2012-Minimal Dependency Length in Realization Ranking

Author: Michael White ; Rajakrishnan Rajkumar

Abstract: Comprehension and corpus studies have found that the tendency to minimize dependency length has a strong influence on constituent ordering choices. In this paper, we investigate dependency length minimization in the context of discriminative realization ranking, focusing on its potential to eliminate egregious ordering errors as well as better match the distributional characteristics of sentence orderings in news text. We find that with a stateof-the-art, comprehensive realization ranking model, dependency length minimization yields statistically significant improvements in BLEU scores and significantly reduces the number of heavy/light ordering errors. Through distributional analyses, we also show that with simpler ranking models, dependency length minimization can go overboard, too often sacrificing canonical word order to shorten dependencies, while richer models manage to better counterbalance the dependency length minimization preference against (sometimes) competing canonical word order preferences.

3 0.5355438 65 emnlp-2012-Improving NLP through Marginalization of Hidden Syntactic Structure

Author: Jason Naradowsky ; Sebastian Riedel ; David Smith

Abstract: Many NLP tasks make predictions that are inherently coupled to syntactic relations, but for many languages the resources required to provide such syntactic annotations are unavailable. For others it is unclear exactly how much of the syntactic annotations can be effectively leveraged with current models, and what structures in the syntactic trees are most relevant to the current task. We propose a novel method which avoids the need for any syntactically annotated data when predicting a related NLP task. Our method couples latent syntactic representations, constrained to form valid dependency graphs or constituency parses, with the prediction task via specialized factors in a Markov random field. At both training and test time we marginalize over this hidden structure, learning the optimal latent representations for the problem. Results show that this approach provides significant gains over a syntactically uninformed baseline, outperforming models that observe syntax on an English relation extraction task, and performing comparably to them in semantic role labeling.

4 0.38982937 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou

Abstract: The training of most syntactic SMT approaches involves two essential components, word alignment and monolingual parser. In the current state of the art these two components are mutually independent, thus causing problems like lack of rule generalization, and violation of syntactic correspondence in translation rules. In this paper, we propose two ways of re-training monolingual parser with the target of maximizing the consistency between parse trees and alignment matrices. One is targeted self-training with a simple evaluation function; the other is based on training data selection from forced alignment of bilingual data. We also propose an auxiliary method for boosting alignment quality, by symmetrizing alignment matrices with respect to parse trees. The best combination of these novel methods achieves 3 Bleu point gain in an IWSLT task and more than 1 Bleu point gain in NIST tasks. 1

5 0.38828269 86 emnlp-2012-Locally Training the Log-Linear Model for SMT

Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu

Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.

6 0.38682145 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

7 0.37954015 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

8 0.37452516 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

9 0.37151164 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

10 0.35496515 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

11 0.3486312 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

12 0.34168717 67 emnlp-2012-Inducing a Discriminative Parser to Optimize Machine Translation Reordering

13 0.33705384 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage

14 0.32095963 57 emnlp-2012-Generalized Higher-Order Dependency Parsing with Cube Pruning

15 0.31961781 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

16 0.31271738 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules

17 0.31079236 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

18 0.30751204 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation

19 0.30515811 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

20 0.30243424 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.034), (11, 0.015), (14, 0.025), (16, 0.042), (25, 0.027), (32, 0.207), (34, 0.108), (60, 0.09), (63, 0.078), (64, 0.016), (65, 0.032), (70, 0.018), (74, 0.073), (76, 0.047), (80, 0.037), (86, 0.034), (95, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79289883 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

Author: Taylor Berg-Kirkpatrick ; David Burkett ; Dan Klein

Abstract: We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

2 0.6520223 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

3 0.64705563 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

Author: Wang Ling ; Joao Graca ; Isabel Trancoso ; Alan Black

Abstract: Phrase-based machine translation models have shown to yield better translations than Word-based models, since phrase pairs encode the contextual information that is needed for a more accurate translation. However, many phrase pairs do not encode any relevant context, which means that the translation event encoded in that phrase pair is led by smaller translation events that are independent from each other, and can be found on smaller phrase pairs, with little or no loss in translation accuracy. In this work, we propose a relative entropy model for translation models, that measures how likely a phrase pair encodes a translation event that is derivable using smaller translation events with similar probabilities. This model is then applied to phrase table pruning. Tests show that considerable amounts of phrase pairs can be excluded, without much impact on the transla- . tion quality. In fact, we show that better translations can be obtained using our pruned models, due to the compression of the search space during decoding.

4 0.64138448 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

Author: Shujie Liu ; Chi-Ho Li ; Mu Li ; Ming Zhou

Abstract: The training of most syntactic SMT approaches involves two essential components, word alignment and monolingual parser. In the current state of the art these two components are mutually independent, thus causing problems like lack of rule generalization, and violation of syntactic correspondence in translation rules. In this paper, we propose two ways of re-training monolingual parser with the target of maximizing the consistency between parse trees and alignment matrices. One is targeted self-training with a simple evaluation function; the other is based on training data selection from forced alignment of bilingual data. We also propose an auxiliary method for boosting alignment quality, by symmetrizing alignment matrices with respect to parse trees. The best combination of these novel methods achieves 3 Bleu point gain in an IWSLT task and more than 1 Bleu point gain in NIST tasks. 1

5 0.64038968 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.63620532 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

7 0.63464606 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

8 0.63419676 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

9 0.63354617 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints

10 0.63214558 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

11 0.6311689 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

12 0.63078982 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

13 0.6299724 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing

14 0.62772584 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

15 0.62641513 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM

16 0.62497067 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques

17 0.62468189 45 emnlp-2012-Exploiting Chunk-level Features to Improve Phrase Chunking

18 0.62409443 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

19 0.62362689 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

20 0.62210399 120 emnlp-2012-Streaming Analysis of Discourse Participants