emnlp emnlp2012 emnlp2012-18 emnlp2012-18-reference knowledge-graph by maker-knowledge-mining

18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

Source: pdf

Author: Taylor Berg-Kirkpatrick ; David Burkett ; Dan Klein

Abstract: We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

reference text

D.M. Bikel. 2004. Intricacies of collins’ parsing model. Computational Linguistics. M. Bisani and H. Ney. 2004. Bootstrap estimates for confidence intervals in asr performance evaluation. In Proc. of ICASSP. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proc. of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. M. Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. H.T. Dang and K. Owczarzak. 2008. Overview of the tac 2008 update summarization task. In Proc. of Text Analysis Conference. B. Efron and R. Tibshirani. 1993. An introduction to the bootstrap. Chapman & Hall/CRC. L. Gillick and S.J. Cox. 1989. Some statistical issues in the comparison of speech recognition algorithms. In Proc. of ICASSP. 1004 A. Haghighi, J. Blitzer, J. DeNero, and D. Klein. 2009. Better word alignments with supervised ITG models. In Proc. of ACL. D. Klein and C.D. Manning. 2003. Accurate unlexicalized parsing. In Proc. of ACL. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL. P. Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proc. of EMNLP. P. Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit. Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudanpur, L. Schwartz, W.N.G. Thornton, J. Weese, and O.F. Zaidan. 2009. Joshua: An open source toolkit for parsing-based machine translation. In Proc. ofthe Fourth Workshop on Statistical Machine Translation. P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proc. of NAACL. C.Y. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of the Workshop on Text Summarization. M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics. R. McDonald, F. Pereira, K. Ribarov, and J. Haji cˇ. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proc. of EMNLP. J. Nivre, J. Hall, and J. Nilsson. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proc. of LREC. J. Nivre, J. Hall, S. K ¨ubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. 2007. The conll 2007 shared task on dependency parsing. In Proc. of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. E.W. Noreen. 1989. Computer Intensive Methods for Hypothesis Testing: An Introduction. Wiley, New York. F.J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics. F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL. K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL. S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proc. of ACL. S. Riezler and J.T. Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing for mt. In Proc. of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. and C.D. Manning. 2010. Ensemble models for dependency parsing: cheap and good? In Proc. M. Surdeanu of NAACL. A. Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proc. of ACL. Y. Zhang, S. Vogel, and A. Waibel. 2004. Interpreting bleu/nist scores: How much improvement do we need to have a better system. In Proc. of LREC. 1005