emnlp emnlp2013 emnlp2013-61 emnlp2013-61-reference knowledge-graph by maker-knowledge-mining

61 emnlp-2013-Detecting Promotional Content in Wikipedia

Source: pdf

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

reference text

Maik Anderka, Benno Stein, and Nedim Lipka. 2012. Predicting quality flaws in user-generated content: the case of Wikipedia. In Proceedings of the 35th International ACM SIGIR Conference on Research and development in Information Retrieval, SIGIR ’ 12, pages 981–990, New York, NY, USA. ACM. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May. Joachim Diederich, J o¨rg Kindermann, Edda Leopold, and Gerhard Paass. 2003. Authorship attribution with support vector machines. Applied Intelligence, 19(12): 109–123. Hugo J Escalante, Thamar Solorio, and M Montes-y G ´omez. 2011. Local histograms of character ngrams for authorship attribution. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 288–298. Rudolf Flesch. 1948. A new readability yardstick. The Journal of Applied Psychology, 32(3):221 . Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2000. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337–407. Michael Gamon. 2004. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics, page 611. Association for Computational Linguistics. Manoj Harpalani, Michael Hart, Sandesh Singh, Rob Johnson, and Yejin Choi. 2011. Language of vandalism: Improving Wikipedia vandalism detection via stylometric analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, volume 2, pages 83–88. Daniel Hasan Dalip, Marcos Andr e´ Gon ¸calves, Marco Cristo, and P a´vel Calado. 2009. Automatic qual- ity assessment of content created collaboratively by web communities: a case study of Wikipedia. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL ’09, pages 295–304, New York, NY, USA. ACM. Michael Heilman, Kevyn Collins-Thompson, and Maxine Eskenazi. 2008. An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of 1856 NLP for Building Educational Applications, pages 71– 79. Association for Computational Linguistics. Vlado Keˇ selj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, volume 3, pages 255–264. Dan Klein and Christopher D Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics. Emily Pitler, Annie Louis, and Ani Nenkova. 2010. Automatic evaluation of linguistic quality in multidocument summarization. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 544–554. Association for Computa- tional Linguistics. Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’ 10, pages 38–42, Stroudsburg, PA, USA. Association for Computational Linguistics. Congzhou He Ramyaa and Khaled Rasheed. 2004. Using machine learning techniques for stylometry. In Proceedings of International Conference on Machine Learning. Paul Rayson, Andrew Wilson, and Geoffrey Leech. 2001. Grammatical word class variation within the british national corpus sampler. Language and Computers, 36(1):295–306. Klaus Stein and Claudia Hess. 2007. Does it matter who contributes: a study on featured articles in the German Wikipedia. In Proceedings of the Eighteenth Conference on Hypertext and Hypermedia, pages 171–174. ACM. Kristina Toutanova and Christopher D Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics- Volume 13, pages 63–70. Association for Computa- tional Linguistics. Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency ceedings of the 2003 Conference network. In Pro- of the North Ameri- can Chapter of the Association for Computational Linguistics pages on Human Language 173–180. Association Technology-Volume for Computational guistics. William Yang Wang and Kathleen R. McKeown. 2010. ”Got you!”: Automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’ 10, pages 1146–1 154, Stroudsburg, PA, USA. Association for Computational Linguistics. Dennis M Wilkinson and Bernardo A Huberman. 2007. Assessing the value of coooperation in Wikipedia. arXiv preprint cs/0702140. 1857 1, Lin-