emnlp emnlp2011 emnlp2011-89 emnlp2011-89-reference knowledge-graph by maker-knowledge-mining

89 emnlp-2011-Linguistic Redundancy in Twitter

Source: pdf

Author: Fabio Massimo Zanzotto ; Marco Pennaccchiotti ; Kostas Tsioutsiouliklis

Abstract: In the last few years, the interest of the research community in micro-blogs and social media services, such as Twitter, is growing exponentially. Yet, so far not much attention has been paid on a key characteristic of microblogs: the high level of information redundancy. The aim of this paper is to systematically approach this problem by providing an operational definition of redundancy. We cast redundancy in the framework of Textual Entailment Recognition. We also provide quantitative evidence on the pervasiveness of redundancy in Twitter, and describe a dataset of redundancy-annotated tweets. Finally, we present a general purpose system for identifying redundant tweets. An extensive quantitative evaluation shows that our system successfully solves the redundancy detection task, improving over baseline systems with statistical significance.

reference text

Luciano Barbosa and Junlan Feng. 2010. Robust sentiment detection on twitter from biased and noisy data. In Posters Proceedings of the 23rd International Con667 ference on Computational Linguistics (Coling 2010), pages 36–44, Beijing, China. Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 132–139, Seattle, Washington. Michael Collins and Nigel Duffy. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 263–270. Courtney Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pages 13–18, Ann Arbor, Michigan. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Quionero-Candela et al., editor, LNAI 3944: MLCW 2005, pages 177–190, Milan, Italy. SpringerVerlag. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Enhanced sentiment learning using twitter hashtags and smileys. In Posters Proceedings of the 23rd In- ternational Conference on Computational Linguistics (Coling 2010), pages 241–249, Beijing, China. Marie-Catherine de Marneffe, Bill MacCartney, Trond Grenager, Daniel Cer, Anna Rafferty, and Christopher D. Manning. 2006. Learning to distinguish valid textual entailments. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (Coling 2004), pages 350– 356, Geneva, Switzerland. Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. 2010. An empirical study on learning to rank of tweets. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 295–303, Beijing, China. Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 362–370, Boulder, Colorado. Aria Haghighi, Andrew Ng, and Christopher Manning. 2005. Robust textual inference via graph matching. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 387–394, Vancouver, British Columbia, Canada. Andrew Hickl, John Williams, Jeremy Bensley, Kirk Roberts, Bryan Rink, and Ying Shi. 2006. Recognizing textual entailment with LCC’s groundhog system. In Bernardo Magnini and Ido Dagan, editors, Proceedings of the 2nd PASCAL RTE Challenge, Venice, Italy. Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. 2007. Why we Twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007. Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th International Conference on Research in Computational Linguistics ROCLING, pages 132–139, Tapei, Taiwan. Thorsten Joachims. 1999. Making large-scale svm learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel MethodsSupport Vector Learning. MIT Press. Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about twitter. In Proceedings of the first workshop on Online social networks, pages 19–24, Seattle, WA, USA. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is twitter, a social network or 668 a news media? In Proceedings of WWW ’10: Proceedings ofthe 19th international conference on World wide web, pages 591–600, Raleigh, North Carolina, USA. Cindy-Xide Lin, Bo Zhao, Qiaozhu Mei, and Jiawei Han. 2010. Pet: a statistical model for popular events tracking in social communities. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 929– 938, Washington, DC, USA. Xiaohua Liu, Kuan Li, Bo Han, Ming Zhou, Long Jiang, Zhongyang Xiong, and Changning Huang. 2010. Semantic role labeling for news tweets. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 698–706, Beijing, China, August. Bill MacCartney, Trond Grenager, Marie-Catherine de Marneffe, Daniel Cer, and Christopher D. Manning. 2006. Learning to recognize features of valid textual entailments. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 41–48, New York City, USA. George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(1 1):39–41, November. Alessandro Moschitti and Fabio Massimo Zanzotto. 2007. Fast and effective kernels for relational learning from texts. In Proceedings of the International Conference of Machine Learning (ICML), Corvallis, Oregon. Eamonn Newman, Nicola Stokes, John Dunnion, and Joe Carthy. 2005. Textual entailment recognition using a linguistically-motivated decision tree classifier. In Joaquin Qui n˜onero Candela, Ido Dagan, Bernardo Magnini, and Florence d’Alch´ e Buc, editors, MLCW, volume 3944 of Lecture Notes in Computer Science, pages 372–384. Springer. Sebastian Pad o´, 2006. User’s guide to sigf: Significance testing by approximate randomisation. Pear-Analytics. 2009. Twitter study - august 2009. Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet::similarity - measuring the relatedness of concepts. In Demonstration Papers at HLTNAACL 2004, pages 38–41, Boston, MA. Saˇ sa Petrovi c´, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with application to twitter. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 18 1–189, Los Angeles, California. Ana-Maria Popescu and Marco Pennacchiotti. 2010. Detecting controversial events from twitter. In In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 1873– 1876. Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of the InternationalAAAI Conference on Weblogs and Social Media, pages 130–137. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of twitter conversations. In Human Language Technologies: The 2010Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 172–180, Los Angeles, California. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on Computational linguistics, pages 947–953, Morristown, NJ, USA. Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete. 2009. Efficient kernels for sentence pair classification. In Conference on Empirical Methods on Natural Language Processing, pages 91–100, 6-7 August. Fabio Massimo Zanzotto and Alessandro Moschitti. 2006. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st Coling and 44th ACL, pages 401–408, Sydney, Australia, July. Fabio Massimo Zanzotto, Marco Pennacchiotti, and Alessandro Moschitti. 2009. A machine learning approach to textual entailment recognition. Natural Language Engineering, 15-04:55 1–582. Q. Zhao, P. Mitra, and B. Chen. 2007. Temporal and information flow based event detection from social text streams. In Proceedings of the 22nd national conference on Artificial intelligence, pages 1501–1506, Vancouver, British Columbia, Canada. 669