emnlp emnlp2013 emnlp2013-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
Reference: text
sentIndex sentText sentNum sentScore
1 Application of Localized Similarity for Web Documents Peter Reberšek Zemanta Celovška cesta 32 Ljubljana, Slovenia . [sent-1, score-0.078]
2 pete r rebe rs ek @ z emant a com Abstract In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. [sent-3, score-0.898]
3 Methods used in this approach rank parts of a document based on the similarity to a presumably related document. [sent-4, score-0.267]
4 Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. [sent-5, score-0.733]
5 A number of different methods from information retrieval and natural language processing are adapted for this task. [sent-6, score-0.047]
6 Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. [sent-7, score-1.108]
7 Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. [sent-8, score-0.199]
8 Results show that our best adapted methods rival the precision of the baseline method. [sent-9, score-0.047]
9 1 Introduction One of the features of hypertext documents are hyperlinks that point to other resources pictures, videos, tweets, or other hypertext documents. [sent-10, score-0.296]
10 A fairly familiar category of the latter is related articles; these usually appear at the end of a news article or a blog post with the title of the target document as anchor text. [sent-11, score-0.701]
11 The target document is similar in content to original document; it may tell the story from another point of view, it may be a more detailed version of a part of the events in the original document, etc. [sent-12, score-0.164]
12 Another category are the in-text links; these appear inside the main body of text and use some of – 1399 Mateja Verli ˇc Zemanta Celovška cesta 32 Ljubljana, Slovenia mate j a verl i z emant a com c@ . [sent-13, score-0.357]
13 Ideally the anchor text is selected in such a way that it conveys some information about the target document; in reality sometimes just an adverb (e. [sent-16, score-0.447]
14 here, there) is used, or even the destination URL may serve as anchor. [sent-18, score-0.174]
15 for a query document finds a target document and an appropriate part of the text of the query document that serves as the anchor text for the hyperlink. [sent-21, score-1.146]
16 We want the target document to be similar in content to the query document and the anchor text to indicate that content. [sent-22, score-0.856]
17 There are many potential uses for such a system, especially for simplifying and streamlining document creation. [sent-23, score-0.164]
18 It may also be used when writing a scientific paper, automatically adding citations to other relevant papers inside the main body. [sent-25, score-0.259]
19 A citation can be considered an in-text link without a defined starting point. [sent-27, score-0.279]
20 We have addressed the problem in two steps, separately finding a similar document, and finding the anchor text for it. [sent-28, score-0.447]
21 Since the retrieval of similar documents was a research focus for many years and is thus better researched, we have decided in this paper to focus on the placement of the anchor text for a link to a preselected document. [sent-29, score-0.765]
22 2 Related Work Semantic similarity of textual documents offers a way to organize the increasing number of available documents. [sent-33, score-0.185]
23 It can be used in many applications such as summarization, educational systems, finding duplicated bug reports in software testing (Lintean et al. [sent-34, score-0.043]
24 , 2010), plagiarism detection (Kasprzak and Brandejs, 2010), and research of a scientific field (Koberstein and Ng, 2006). [sent-35, score-0.281]
25 , 2006; Koberstein and Ng, 2006) to paragraphs (Lintean et al. [sent-37, score-0.054]
26 There is also commercial software such as nRelate1 , Zemanta2 and OpenCalais3 with functionality that ranges from named entity recognition (NER) and event detection to related content. [sent-39, score-0.14]
27 Most ofthe methods for comparing documents focus on the query document as a whole. [sent-41, score-0.363]
28 The calculated score therefore belongs to the whole document and nothing can be said about more or less similar parts of the document. [sent-42, score-0.2]
29 Our goal is to localize the similarity to a part of the query document, a paragraph, sentence, or even a part of the sentence that is most similar to another document. [sent-43, score-0.21]
30 This part of the query document can then serve as anchor text for a hyperlink connection to the similar document. [sent-44, score-0.793]
31 Extrinsic plagiarism detection methods compare two documents to determine if some of the material in one is pla- giarised from the other. [sent-48, score-0.352]
32 These methods have localization of similarity already built-in as they are searching for parts of the text that seem to be plagiarised. [sent-51, score-0.148]
33 This method uses shared n-grams from the two documents in order to determine if one of them is plagiarised. [sent-59, score-0.156]
34 Another similar research is automatic citation placement for scientific papers. [sent-60, score-0.306]
35 , 2002) is concerned with putting citations at the end of the paper (non-localized), which is a task similar to inserting related articles for a news article at the of the text. [sent-63, score-0.232]
36 There have been some attempts to place the citations in the main body of text (Tang and Zhang, 2009; He et al. [sent-64, score-0.295]
37 Tang and Zang (2009) used a placeholder constraint: the query document must contain placeholders for citations, i. [sent-66, score-0.302]
38 the places in text where citation might be inserted. [sent-68, score-0.242]
39 Their method then just ranks all possible documents for a particular placeholder and chooses the best ranked document as a result. [sent-69, score-0.418]
40 Documents are ranked on the basis of a learned topic model, obtained by a two-layer Restricted Boltzmann Machine. [sent-70, score-0.042]
41 (201 1) made a step further towards generality of a citation location; they divide the text into overlapping windows and then decide which windows are viable citation context. [sent-72, score-0.58]
42 The best method for deciding which citation context to use was a dependency feature model, an ensemble method using 17 different features and decision trees. [sent-73, score-0.197]
43 Named entity recognition (NER) also offers a useful insight into document similarity. [sent-74, score-0.164]
44 If two documents share a named entity (NE), it is more likely they are similar. [sent-75, score-0.118]
45 Detected NEs may also serve as anchor text for the link. [sent-76, score-0.486]
46 , 2009; Milne and Witten, 2008) and is also used in several commercial applications such as Zemanta, OpenCalais and AlchemyAPI4, which are able to automatically insert links for a NE pointing to a knowledge base such as Wikipedia or IMDB. [sent-80, score-0.235]
47 However, at this point they are unable to link to arbitrary documents, but may be useful in conjunction with other methods. [sent-81, score-0.082]
48 1 Corpus We have chosen 100 web articles (posts) at random from the end of January 2012. [sent-85, score-0.06]
49 We extracted the body and title of each document. [sent-86, score-0.078]
50 All the present in-text links were also extracted and filtered. [sent-87, score-0.119]
51 First, automatic filtering was applied to remove unwanted categories of links (videos, definition pages on wikipedia and imdb, etc. [sent-88, score-0.155]
52 ), and articles that were deemed too short for similarity comparison. [sent-89, score-0.127]
53 The threshold was set at 200 words of automatically scraped body text of a linked document. [sent-90, score-0.171]
54 All the remaining links were manually checked to ensure the integrity of link targets. [sent-91, score-0.201]
55 This way we collected 265 articles (hereinafter related articles RA). [sent-92, score-0.12]
56 A number of different methods were then used to calculate similarity rank and select the best part of the post text to be used as anchor text for a hyperlink pointing to the originally linked RA. [sent-93, score-0.849]
57 We have used CrowdFlower5, a crowdsourcing platform, to evaluate how many of the 265 post–RA pairs were really related; the final corpus thus consisted of 236 pairs. [sent-94, score-0.064]
58 3 to automatically construct anchor text for each of the 236 pairs of documents in the final corpus. [sent-97, score-0.565]
59 If a method could not find a suitable anchor, no result was returned; on average there were 147 anchors per method. [sent-98, score-0.135]
60 All the automatically created links were then manually scored by the authors with an in-house evaluation tool using scores and guidelines summarized in Table 1. [sent-99, score-0.156]
61 We provided simplified guidelines for assigning scores to automatically created anchors and set a confidence threshold of 0. [sent-103, score-0.172]
62 It is important to mention that the use of crowdsourcing for such tasks has to be carefully 5CrowdFlower: http : / / crowdflower . [sent-105, score-0.257]
63 com/ 1401 cally created anchors planned, because many issues related to monetary incentives, which are out of the scope of this paper, may arise. [sent-106, score-0.213]
64 3 Methods for constructing anchor texts We have adapted a number of methods from a variety of sources to test how they perform for our exact purpose. [sent-108, score-0.49]
65 , 2009); the text is first tokenized with the default NLTK tokenizer, and then POS tagged with one of the included POS taggers. [sent-113, score-0.082]
66 After much testing, we have decided on a combination of Brill Trigram Bigram Unigram Affix Regex backoff tagger with noun as default tag. [sent-114, score-0.056]
67 The trainable parts of the tagger were trained on the included CoNLL 2000 tagged corpus. [sent-115, score-0.036]
68 We then used a regex chunker to find a sequence of a proper noun and a verb separated by zero or more other tokens. [sent-117, score-0.105]
69 We have also tested a proper noun - verb - proper noun combination, but there were even fewer results, so this direction was abandoned. [sent-118, score-0.074]
70 In order to localize the similarity and place an anchor, we split the source document into paragraphs and compute similarity scores between target document and each paragraph of the source document. [sent-125, score-0.635]
71 We then split the paragraph with the highest score into sentences and again obtain scores for each. [sent-126, score-0.057]
72 3 Sorted n-grams Drawing on plagiarism detection, the winning method from the PAN 2010 (Kasprzak and Brandejs, 2010) seemed a viable choice. [sent-130, score-0.219]
73 The basis of the method is comparing n-grams of the source and the destination documents. [sent-131, score-0.135]
74 First, the text was again tokenized with NLTK, removed stopwords and tokens with two or less characters. [sent-132, score-0.195]
75 We have deviated from Kasprzak’s merging policy and decided to merge two results if they are less than 20 tokens apart. [sent-134, score-0.118]
76 We also required only one shared n-gram to consider the documents similar. [sent-135, score-0.156]
77 Results were ranked based on the number of shared tokens within each. [sent-136, score-0.142]
78 Since we had a closed system, we used corpus-wide frequencies; stopwords were also removed. [sent-140, score-0.051]
79 We have scored tokens in the source document with tf*idf summary of the destination document; tokens not in summary are given a zero weight. [sent-141, score-0.423]
80 We have experimentally determined that a summary of just top 150 tokens improves results. [sent-142, score-0.062]
81 Sentences were ranked based on the sum of its tokens weights. [sent-143, score-0.104]
82 We also included NEs from Zemanta API response for both source and destination document. [sent-144, score-0.135]
83 Sentences containing shared NEs get their score multiplied by the sum of shared NE tf*idf weights. [sent-145, score-0.076]
84 5 Baseline Our baseline was a method that inserted links that were originally present in the source documents. [sent-149, score-0.216]
85 As a contrast almost half of CrowdFlower workers stated they don’t blog and of the rest, more than a third of them don’t link out, i. [sent-156, score-0.128]
86 We also have only 74% median interannotator agreement leading us to believe that some of the annotators answered without being familiar with the question (monetary incentive issue). [sent-159, score-0.041]
87 Furthermore, CrowdFlower results for original links (our baseline) indicate that almost all of them were recognized as relevant, while our evaluators discarded 30% of them. [sent-160, score-0.17]
88 Understanding plagiarism linguistic patterns, textual features, and detection methods. [sent-170, score-0.234]
89 In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), pages 9–16, Trento, Italy. [sent-185, score-0.036]
90 In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA. [sent-194, score-0.036]
91 In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’ 11, pages 755–764, New York, NY, USA. [sent-200, score-0.036]
92 Improving the reliability of the plagiarism detection system lab report for pan at clef 2010. [sent-204, score-0.302]
93 In Proceedings of the First international conference on Knowledge Science, Engineering and Management, KSEM’06, pages 215–228, Berlin, Heidelberg. [sent-208, score-0.036]
94 In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 457–466, New York, NY, USA. [sent-213, score-0.036]
95 Sentence similarity based on semantic nets and corpus statistics. [sent-219, score-0.067]
96 The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis. [sent-226, score-0.108]
97 In Proceedings of the 2002 ACM conference on Computer supported cooperative work, CSCW ’02, pages 116– 125, New York, NY, USA. [sent-234, score-0.036]
98 In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 509–518, New York, NY, USA. [sent-240, score-0.036]
99 Dongarra, editors, Computational Science ICCS 2002, volume 2329 of Lecture Notes in Computer Science, pages 51–60. [sent-251, score-0.036]
100 In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’ 11, pages 1375–1384, Stroudsburg, PA, USA. [sent-256, score-0.036]
wordName wordTfidf (topN-words)
[('anchor', 0.402), ('citation', 0.197), ('kasprzak', 0.194), ('citations', 0.172), ('plagiarism', 0.172), ('document', 0.164), ('crowdflower', 0.155), ('zemanta', 0.155), ('anchors', 0.135), ('destination', 0.135), ('links', 0.119), ('documents', 0.118), ('alzahrani', 0.116), ('brandejs', 0.116), ('emant', 0.116), ('koberstein', 0.116), ('lintean', 0.116), ('nltk', 0.086), ('link', 0.082), ('lsi', 0.081), ('nes', 0.081), ('pointing', 0.081), ('query', 0.081), ('idf', 0.08), ('tf', 0.078), ('body', 0.078), ('celov', 0.078), ('cesta', 0.078), ('ljubljana', 0.078), ('monetary', 0.078), ('monostori', 0.078), ('researched', 0.078), ('slovenia', 0.078), ('tang', 0.071), ('pan', 0.068), ('budanitsky', 0.068), ('mcnee', 0.068), ('regex', 0.068), ('rek', 0.068), ('strohman', 0.068), ('videos', 0.068), ('similarity', 0.067), ('berlin', 0.066), ('crowdsourcing', 0.064), ('detection', 0.062), ('tokens', 0.062), ('hyperlink', 0.062), ('hypertext', 0.062), ('localize', 0.062), ('placement', 0.062), ('articles', 0.06), ('placeholder', 0.057), ('paragraph', 0.057), ('decided', 0.056), ('recommending', 0.054), ('deerwester', 0.054), ('hyperlinks', 0.054), ('paragraphs', 0.054), ('finkel', 0.053), ('evaluators', 0.051), ('stopwords', 0.051), ('originally', 0.051), ('milne', 0.049), ('post', 0.048), ('linked', 0.048), ('kulkarni', 0.047), ('bunescu', 0.047), ('viable', 0.047), ('windows', 0.047), ('scientific', 0.047), ('adapted', 0.047), ('blog', 0.046), ('springer', 0.046), ('inserted', 0.046), ('ka', 0.046), ('text', 0.045), ('bird', 0.044), ('ne', 0.043), ('software', 0.043), ('ner', 0.043), ('eh', 0.043), ('ny', 0.042), ('ranked', 0.042), ('familiar', 0.041), ('texts', 0.041), ('ek', 0.04), ('inside', 0.04), ('lecture', 0.039), ('york', 0.039), ('serve', 0.039), ('ratinov', 0.038), ('shared', 0.038), ('http', 0.038), ('proper', 0.037), ('ranks', 0.037), ('guidelines', 0.037), ('tokenized', 0.037), ('parts', 0.036), ('pages', 0.036), ('commercial', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
2 0.15784347 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
Author: Hendra Setiawan ; Bowen Zhou ; Bing Xiang
Abstract: Reordering poses one of the greatest challenges in Statistical Machine Translation research as the key contextual information may well be beyond the confine oftranslation units. We present the “Anchor Graph” (AG) model where we use a graph structure to model global contextual information that is crucial for reordering. The key ingredient of our AG model is the edges that capture the relationship between the reordering around a set of selected translation units, which we refer to as anchors. As the edges link anchors that may span multiple translation units at decoding time, our AG model effectively encodes global contextual information that is previously absent. We integrate our proposed model into a state-of-the-art translation system and demonstrate the efficacy of our proposal in a largescale Chinese-to-English translation task.
3 0.15059802 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
Author: James Foulds ; Padhraic Smyth
Abstract: When reviewing scientific literature, it would be useful to have automatic tools that identify the most influential scientific articles as well as how ideas propagate between articles. In this context, this paper introduces topical influence, a quantitative measure of the extent to which an article tends to spread its topics to the articles that cite it. Given the text of the articles and their citation graph, we show how to learn a probabilistic model to recover both the degree of topical influence of each article and the influence relationships between articles. Experimental results on corpora from two well-known computer science conferences are used to illustrate and validate the proposed approach.
4 0.1266298 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
5 0.10664255 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang
Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.
6 0.092548527 160 emnlp-2013-Relational Inference for Wikification
7 0.090928517 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
8 0.08157336 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
9 0.080077924 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
10 0.075126566 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
11 0.073486537 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers
12 0.072723567 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
13 0.071408629 41 emnlp-2013-Building Event Threads out of Multiple News Articles
14 0.067564934 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves
15 0.062546179 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
16 0.062471706 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
17 0.062169239 95 emnlp-2013-Identifying Multiple Userids of the Same Author
18 0.061536204 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles
19 0.060826894 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
20 0.060506541 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach
topicId topicWeight
[(0, -0.213), (1, 0.087), (2, -0.028), (3, 0.032), (4, 0.02), (5, -0.009), (6, 0.092), (7, 0.106), (8, 0.059), (9, -0.173), (10, -0.038), (11, 0.075), (12, -0.115), (13, 0.014), (14, -0.034), (15, 0.069), (16, -0.093), (17, -0.054), (18, -0.114), (19, 0.019), (20, -0.129), (21, -0.019), (22, -0.064), (23, -0.019), (24, 0.073), (25, -0.084), (26, 0.106), (27, 0.035), (28, 0.197), (29, -0.015), (30, 0.059), (31, 0.061), (32, -0.18), (33, 0.027), (34, 0.038), (35, -0.02), (36, -0.032), (37, 0.055), (38, -0.066), (39, -0.083), (40, 0.082), (41, -0.036), (42, 0.129), (43, -0.001), (44, -0.062), (45, 0.034), (46, -0.137), (47, 0.153), (48, 0.103), (49, -0.125)]
simIndex simValue paperId paperTitle
same-paper 1 0.93541431 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
2 0.6906054 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
Author: James Foulds ; Padhraic Smyth
Abstract: When reviewing scientific literature, it would be useful to have automatic tools that identify the most influential scientific articles as well as how ideas propagate between articles. In this context, this paper introduces topical influence, a quantitative measure of the extent to which an article tends to spread its topics to the articles that cite it. Given the text of the articles and their citation graph, we show how to learn a probabilistic model to recover both the degree of topical influence of each article and the influence relationships between articles. Experimental results on corpora from two well-known computer science conferences are used to illustrate and validate the proposed approach.
3 0.49760243 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
4 0.49248302 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
Author: Moshe Koppel ; Shachar Seidman
Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1
5 0.4799616 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
Author: Jerome White ; Douglas W. Oard ; Nitendra Rajput ; Marion Zalk
Abstract: Building search engines that can respond to spoken queries with spoken content requires that the system not just be able to find useful responses, but also that it know when it has heard enough about what the user wants to be able to do so. This paper describes a simulation study with queries spoken by non-native speakers that suggests that indicates that finding relevant content is often possible within a half minute, and that combining features based on automatically recognized words with features designed for automated prediction of query difficulty can serve as a useful basis for predicting when that useful content has been found.
6 0.4661116 61 emnlp-2013-Detecting Promotional Content in Wikipedia
8 0.43213519 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
9 0.40791577 34 emnlp-2013-Automatically Classifying Edit Categories in Wikipedia Revisions
10 0.40776813 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
11 0.40035439 95 emnlp-2013-Identifying Multiple Userids of the Same Author
12 0.39038947 138 emnlp-2013-Naive Bayes Word Sense Induction
13 0.37198395 23 emnlp-2013-Animacy Detection with Voting Models
14 0.36580154 41 emnlp-2013-Building Event Threads out of Multiple News Articles
15 0.36331683 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri
16 0.36016899 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
17 0.35536379 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
18 0.35121515 160 emnlp-2013-Relational Inference for Wikification
19 0.34997174 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity
topicId topicWeight
[(3, 0.026), (22, 0.022), (30, 0.036), (50, 0.014), (51, 0.774), (66, 0.01), (71, 0.012), (75, 0.016), (96, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99897313 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
2 0.99537206 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game
Author: Anais Cadilhac ; Nicholas Asher ; Farah Benamara ; Alex Lascarides
Abstract: This paper describes a method that predicts which trades players execute during a winlose game. Our method uses data collected from chat negotiations of the game The Settlers of Catan and exploits the conversation to construct dynamically a partial model of each player’s preferences. This in turn yields equilibrium trading moves via principles from game theory. We compare our method against four baselines and show that tracking how preferences evolve through the dialogue and reasoning about equilibrium moves are both crucial to success.
3 0.99507254 32 emnlp-2013-Automatic Idiom Identification in Wiktionary
Author: Grace Muzny ; Luke Zettlemoyer
Abstract: Online resources, such as Wiktionary, provide an accurate but incomplete source ofidiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.
4 0.99318916 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi
Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,
5 0.99156928 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
Author: Karl Pichotta ; John DeNero
Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.
6 0.99075198 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
7 0.98927128 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
8 0.98759615 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
9 0.93395203 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution
10 0.93337554 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
11 0.92505628 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
13 0.92005783 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
14 0.91993833 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
15 0.9190644 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
16 0.91899294 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
17 0.91430503 27 emnlp-2013-Authorship Attribution of Micro-Messages
18 0.90975386 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations
19 0.90973198 26 emnlp-2013-Assembling the Kazakh Language Corpus
20 0.90810734 69 emnlp-2013-Efficient Collective Entity Linking with Stacking