acl acl2011 acl2011-25 acl2011-25-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
Chris Buckley and Ellen M. Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33–40. ACM. David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine, 3 1(3). Junichi Fukumoto, Tsuneaki Kato, and Fumito Masui. 2002. Question and Answering Challenge (QAC1): Question Answering Evaluation at NTCIR Workshop 3. In Working Notes of the Third NTCIR Workshop Meeting Part IV: Question Answering Challenge (QAC-1), pages 1-10. Jes u´s Herrera, Anselmo Pe˜ nas, and Felisa Verdejo. 2005. Question Answering Pilot Task at CLEF 2004. In Mul1423 tilingual Information Access for Text, Speech and Images, CLEF 2004, Revised Selected Papers., volume 3491 of Lecture Notes in Computer Science, Springer, pages 581–590. Bernardo Magnini, Alessandro Vallin, Christelle Ayache, Gregor Erbach, Anselmo Pe˜ nas, Maarten de Rijke, Paulo Rocha, Kiril Ivanov Simov, and Richard F. E. Sutcliffe. 2005. Overview of the CLEF 2004 Multilingual Question Answering Track. In Multilingual Information Access for Text, Speech and Images, CLEF 2004, Revised Selected Papers., volume 3491 of Lecture Notes in Computer Science, Springer, pages 371– 391. Anselmo Pe˜ nas, A´lvaro Rodrigo, Valent´ ın Sama, and Felisa Verdejo. 2007. Overview of the Answer Validation Exercise 2006. In Evaluation of Multilingual and Multi-modal Information Retrieval, CLEF 2006, Revised Selected Papers, volume 4730 of Lecture Notes in Computer Science, Springer, pages 257–264. Anselmo Pe˜ nas, A´lvaro Rodrigo, Valent´ ın Sama, and Felisa Verdejo. 2008a. Testing the Reasoning for Question Answering Validation. In Journal of Logic and Computation. 18(3), pages 459–474. Anselmo Pe˜ nas, A´lvaro Rodrigo, and Felisa Verdejo. 2008b. Overview of the Answer Validation Exercise 2007. In Advances in Multilingual and Multimodal Information Retrieval, CLEF 2007, Revised Selected Papers, volume 5152 of Lecture Notes in Computer Science, Springer, pages 237–248. Anselmo Pe˜ nas, Pamela Forner, Richard Sutcliffe, A´lvaro Rodrigo, Corina Forascu, I n˜aki Alegria, Danilo Giampiccolo, Nicolas Moreau, and Petya Osenova. 2010. Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation. In Multilingual Information Access Evaluation I. Text Retrieval Experiments, CLEF 2009, Revised Selected Papers, volume 6241 of Lecture Notes in Computer Science, Springer. Alvaro Rodrigo, Anselmo Pe˜ nas, and Felisa Verdejo. 2008. Evaluating Answer Validation in Multi-stream Question Answering. In Proceedings of the Second International Workshop on Evaluating Information Access (EVIA 2008). Alvaro Rodrigo, Anselmo Pe˜ nas, and Felisa Verdejo. 2011. Evaluating Question Answering Validation as a classification problem. Language Resources andEvaluation, Springer Netherlands (In Press). Tetsuya Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval, Seattle, Washington, pages 525–532. USA, August Tetsuya Sakai. 2007a. On the Reliability of Factoid Question Answering Evaluation. ACM Trans. Asian Lang. Inf. Process., 6(1). Tetsuya Sakai. 2007b. On the reliability of information retrieval metrics based on graded relevance. Inf. Process. Manage., 43(2):53 1–548. Ellen M. Voorhees and Chris Buckley. 2002. The effect of Topic Set Size on Retrieval Experiment Error. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3 16–323. Ellen M. Voorhees and Dawn M. Tice. 1999. The TREC8 Question Answering Track Evaluation. In Text Retrieval Conference TREC-8, pages 83–105. Ellen M. Voorhees. 2001. Overview of the TREC 2001 Question Answering Track. In E. M. voorhees, D. K. Harman, editors: Proceedings of the Tenth Text REtrieval Conference (TREC 2001). NIST Special Publication 500-250. Ellen M. Voorhees. 2002. Overview of TREC 2002 Question Answering Track. In E.M. Voorhees, L. P. Buckland, editors: Proceedings of the Eleventh Text REtrieval Conference (TREC 2002). NIST Publication 500-251. Ellen M. Voorhees. 2003. Overview of the TREC 2003 Question Answering Track. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003). 1424 6-11, 2006,