emnlp emnlp2013 emnlp2013-24 emnlp2013-24-reference knowledge-graph by maker-knowledge-mining

24 emnlp-2013-Application of Localized Similarity for Web Documents

Source: pdf

Author: Peter Rebersek ; Mateja Verlic

Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.

reference text

S.M. Alzahrani, N. Salim, and A. Abraham. 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(2): 133–149. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist., 32(1): 13–47, March. Razvan Bunescu and Marius Pasca. 2006. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), pages 9–16, Trento, Italy. Scott Deerwester, Susan T. Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391– 407. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA. Association for Computational Linguistics. Qi He, Daniel Kifer, Jian Pei, Prasenjit Mitra, and C. Lee Giles. 2011. Citation recommendation without author supervision. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’ 11, pages 755–764, New York, NY, USA. ACM. 1403 Jan Kasprzak and Michal Brandejs. 2010. Improving the reliability of the plagiarism detection system lab report for pan at clef 2010. Jonathan Koberstein and Yiu-Kai Ng. 2006. Using word clusters to detect similar web documents. In Proceedings of the First international conference on Knowledge Science, Engineering and Management, KSEM’06, pages 215–228, Berlin, Heidelberg. Springer-Verlag. Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 457–466, New York, NY, USA. ACM. Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett. 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. on Knowl. and Data Eng., 18(8): 1138– 1150, August. Mihai Lintean, Cristian Moldovan, Vasile Rus, and Danielle McNamara. 2010. The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, Daytona Beach, FL. Sean M. McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K. Lam, Al Mamunur Rashid, Joseph A. Konstan, and John Riedl. 2002. On the recommending of citations for research papers. In Proceedings of the 2002 ACM conference on Computer supported cooperative work, CSCW ’02, pages 116– 125, New York, NY, USA. ACM. David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 509–518, New York, NY, USA. ACM. Krisztián Monostori, Raphael Finkel, Arkady Zaslavsky, Gábor Hodász, and Máté Pataki. 2002. Comparison of overlap detection techniques. In PeterM.A. Sloot, AlfonsG. Hoekstra, C.J.Kenneth Tan, and JackJ. Dongarra, editors, Computational Science ICCS 2002, volume 2329 of Lecture Notes in Computer Science, pages 51–60. Springer Berlin Heidelberg. Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’ 11, pages 1375–1384, Stroudsburg, PA, USA. Association for Computational Linguistics. Radim Rˇeh u˚ ˇrek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Pro— ceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May. ELRA. Trevor Strohman, W. Bruce Croft, and David Jensen. 2007. Recommending citations for academic papers. In In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’07, pages 705– 706. Jie Tang and Jing Zhang. 2009. A discriminative approach to topic-based citation recommendation. In Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, and Tu-Bao Ho, editors, Advances in Knowledge Discovery and Data Mining, volume 5476 of Lecture Notes in Computer Science, pages 572–579. Springer Berlin Heidelberg. 1404