emnlp emnlp2013 emnlp2013-148 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. [sent-3, score-0.241]
2 Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. [sent-4, score-0.939]
3 In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. [sent-5, score-0.605]
4 Thus, the method combines the benefits of both explicit and latent topic modelling approaches. [sent-6, score-0.506]
5 We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language. [sent-7, score-0.276]
6 1 Introduction Cross-lingual document matching is the task of, given a query document in some source language, estimating the similarity to a document in some target language. [sent-8, score-0.559]
7 An approach that has become quite popular in recent years for cross-lingual document matching is Explicit Semantics Analysis (ESA, Gabrilovich and Markovitch (2007)) and its cross-lingual extension 1732 CL-ESA (Sorg and Cimiano, 2008). [sent-14, score-0.211]
8 ESA indexes documents by mapping them into a topic space defined by their similarity to predefined explicit topics generally articles from an encyclopaedia in such a way that there is a one-to-one correspondence between topics and encyclopedic entries. [sent-15, score-0.797]
9 CL-ESA extends this to the multilingual case by exploiting a background document collection that is aligned across languages, such as Wikipedia. [sent-16, score-0.359]
10 (1990)) or generative topic models (such as LDA, Blei et al. [sent-20, score-0.174]
11 A key choice in Explicit Semantic Analysis is the document space that will act as the topic space. [sent-23, score-0.38]
12 The standard choice is to regard all articles from a background document collection Wikipedia articles are a typical choice as the topic space. [sent-24, score-0.491]
13 However, it is crucial to ensure that these topics cover the semantic space evenly and completely. [sent-25, score-0.204]
14 In this paper, we present an alternative approach where we remap the semantic space defined by the topics in such a manner that it is orthonormal. [sent-26, score-0.204]
15 In this way, each document is mapped to a topic that is distinct from all other topics. [sent-27, score-0.348]
16 Such a mapping can be considered as equivalent to a variant of Latent Semantic Indexing (LSI) with the main difference that our model exploits the matrix that maps topic vectors back into document space, which is normally discarded in LSI-based approaches. [sent-28, score-0.574]
17 In particular, we quantify the effect of different approximation techniques for computing the orthonormal basis and investigate the effect of various methods for the normalization of frequency vectors. [sent-32, score-0.508]
18 The structure of the paper is as follows: we situate our work in the general context of related work on topic models for cross-lingual document matching in Section 2. [sent-33, score-0.385]
19 2 Related Work The idea of applying topic models that map documents into an interlingual topic space seems a quite natural and principled approach to tackle several tasks including the cross-lingual document retrieval problem. [sent-35, score-0.784]
20 Three main variants of document models have been mainly considered for cross-lingual document matching: Latent methods such as Latent Semantic Indexing (LSI, Deerwester et al. [sent-37, score-0.348]
21 (1990)) induce a decomposition of the term-document matrix in a way that reduces the dimensionality of the documents, while minimizing the error in reconstructing the training data. [sent-38, score-0.196]
22 For example, in Latent Semantic Indexing, a term-document matrix is approximated by a partial singular value decomposition, or in Non-Negative Matrix Factorization (NMF, Lee and Seung (1999)) by two smaller non-negative matrices. [sent-39, score-0.172]
23 If we append comparable or equivalent documents in multiple languages together before computing the decomposition as proposed by Dumais et al. [sent-40, score-0.198]
24 (1997) then the topic model is 1733 essentially cross-lingual allowing to compare documents in different languages once they have been mapped into the topic space. [sent-41, score-0.496]
25 Probabilistic or generative methods instead attempt to induce a (topic) model that has the highest likelihood of generating the documents actually observed during training. [sent-42, score-0.148]
26 As with latent methods, these topics are thus interlingual and can generate words/terms in different languages. [sent-43, score-0.264]
27 Explicit topic models make the assumption that topics are explicitly given instead of being in- duced from training data. [sent-47, score-0.292]
28 Typically, a background document collection is assumed to be given whereby each document in this corpus corresponds to one topic. [sent-48, score-0.491]
29 A mapping from document to topic space is calculated by computing the similarity of the document to every document in the topic space. [sent-49, score-0.943]
30 A prominent example for this kind of topic modelling approach is Explicit Semantic Analysis (ESA, Gabrilovich and Markovitch (2007)). [sent-50, score-0.241]
31 Both latent and generative topic models attempt to find topics from the data and it has been found that in some cases they are equivalent (Ding et al. [sent-51, score-0.391]
32 However, this approach suffers from the problem that the topics might be artifacts of the training data rather than coherent semantic topics. [sent-53, score-0.172]
33 In contrast, explicit topic methods can use a set of topics that are chosen to be well-suited to the domain. [sent-54, score-0.458]
34 The principle drawback of this is that the method for choosing such explicit topics by selecting documents is comparatively crude. [sent-55, score-0.432]
35 In general, these topics may be overlapping and poorly distributed over the semantic topic space. [sent-56, score-0.346]
36 By comparison, our method takes the × advantage ofthe pre-specified topics ofexplicit topic models, but incorporates a training step to learn latent relations between these topics. [sent-57, score-0.391]
37 3 Orthonormal explicit topic analysis Our approach follows Explicit Semantic Analysis in the sense that it assumes the availability of a background document collection B = {b1, b2, . [sent-58, score-0.657]
38 The mapping into the explicit topic space is defined by a language-specific function Φ that maps documents into RN such that the jth value in the vector is given by some association measure φj (d) for each background document bj . [sent-62, score-0.975]
39 We can see that maximizing (thbis margin mf(orb all i,j Wise th caen same as minimizing the semantic overlap of the background documents, which is given as follows: X overlap(B) = i = 1X X, X. [sent-74, score-0.184]
40 This is the case when the topics are orthonormal: (XTbi)T(XTbj) =0 (XTbi)T(XTbi) if i =j =1 Unfortunately, this is not typically the case as the documents have significant word overlap as well as semantic overlap. [sent-81, score-0.36]
41 Assuming that this transformation of X is done by multiplication with some other matrix A, we can define the learning problem as finding that matrix A such that: (AXTX)T(AXTX) =I P 1||A||p = Pi,j |aij|p is the p-norm. [sent-83, score-0.318]
42 2 We define the projection function of a document d, represented as a normalized term frequency vector, as follows: ΦONETA(d) = (XTX)−1XTd For the cross-lingual case we assume that we have two sets of background documents of equal size, B1 = {b11, . [sent-86, score-0.513]
43 , b2N} in languages l1 and l2, respectively band that th}es ien d laocn-uments are aligned such that for every index i, bi1 and bi2 are documents on the same topic in each language. [sent-92, score-0.364]
44 Using this we can construct a projection function for each language which maps into the same topic space. [sent-93, score-0.213]
45 Thus, as in CL-ESA, we obtain the cross-lingual similarity between a document di in language l1 and a document dj in language l2 as follows: sim(di, dj) = cos(ΦlO1NETA (di) , ΦlO2NETA (dj)) We note here that we assume that Φ could be represented as a symmetric inner product of two vectors. [sent-94, score-0.485]
46 In this case the expression XTX can be replaced with a kernel matrix specifying the association of each background document to each other background document. [sent-96, score-0.5]
47 Latent Semantic Indexing defines a mapping from a document represented as a term frequency vector to a vector in RK. [sent-99, score-0.316]
48 1735 X = UΣVT Where Σ is diagonal and U V are the eigenvectors of XXT and XTX. [sent-102, score-0.128]
49 2 Approximations The computation of the inverse has a complexity that, using current practical algorithms, is approximately cubic and as such the time spent calculating the inverse can grow very quickly. [sent-115, score-0.176]
50 As XTX is symmetric positive definite, it holds that: XTX = UΣUT Where U are the eigenvectors of XTX and Σ is a diagonal matrix of the eigenvalues. [sent-118, score-0.274]
51 807 or a = 3 (Coppersmith and Winograd, 1990) as the first K eigenvalues and eigenvectors, respectively, we have: (XTX)−1 ’ UKΣK−1UTK (1) We call this the orthonormal eigenapproximation or ON-Eigen. [sent-121, score-0.188]
52 Similarly, using the formula derived in the previous section we can derive an approximation of the full model as follows: (XTX)−1XT ’ UKΣK−1VKT (2) We call this approximation Explicit LSI as it first maps into the latent topic space and then into the explicit topic space. [sent-123, score-0.896]
53 We can consider another approximation by noticing that X is typically very sparse and moreover some rows of X have significantly fewer non-zeroes than others (these rows are for terms with low frequency). [sent-124, score-0.217]
54 Thus, if we take the first N1 columns (documents) in X, it is possible to rearrange the rows ×× of X with the result that there is some W1 such that rows with index greater than W1 have only zeroes in the columns up to N1. [sent-125, score-0.14]
55 In other words, we take a subset of N1 documents and enumerate the words in such a way that the terms occurring in the first N1 documents are enumerated 1, . [sent-126, score-0.296]
56 The result of this row permutation does= n Wot −aff Wect the value of XTX and we can write the matrix X as: X= ? [sent-131, score-0.146]
57 159) for matrix inversion yields the following easily verifiable matrix identity, given that we can find C0 such that C0C = I. [sent-135, score-0.33]
58 = I (3) We denote the above equation using a matrix L as LTX = I. [sent-140, score-0.192]
59 We note that L (XTX)−1X, but for any documWeent n voetecto trh adt t Lhat 6= =is representable as a linear combination of the background document set (i. [sent-141, score-0.264]
60 it Fcoonrta thiinss, only terms not contained in the first N1 documents and we notice that very sparse matrices tend to be approximately orthogonal, hence suggesting that it should be very easy to find a left-inverse of C. [sent-146, score-0.177]
61 On real data this might be violated if we do not have linear independence of the rows of C, for example if W2 < N2 or if we have even one document which has only words that are also contained in the first N1 documents and hence there is a row in C that consists of zeros only. [sent-165, score-0.363]
62 This can be solved by removing documents from the collection until C is row-wise linear independent. [sent-166, score-0.201]
63 3 Normalization A key factor in the effectiveness of topic-based methods is the appropriate normalization of the elements of the document matrix X. [sent-168, score-0.491]
64 This is even more relevant for orthonormal topics as the matrix inversion procedure can be very sensitive to small changes in the matrix. [sent-169, score-0.448]
65 In this context, we consider two forms of normalization, term and document normalization, which can also be considered as row/column normalizations of X. [sent-170, score-0.25]
66 A straightforward approach to normalization is to normalize each column of X to obtain a matrix as follows: X0=? [sent-171, score-0.317]
67 Formally, we can use rtihxis Y Yto ssta atlea bound on − I||F, but in practice it means tah bato uthned orthonormalizing matrix has more small or zero values. [sent-178, score-0.146]
68 A further option for normalization is to consider some form of term frequency normalization. [sent-179, score-0.272]
69 (tfFwwn), Here, tfwn is the term frequency of word w in document n, Fw is the total frequency of word w in the corpus, and dfw is the number of documents containing the words w. [sent-181, score-0.544]
70 The SQRT normalization has been shown to be effective for explicit topic methods in previous experiments not reported here. [sent-183, score-0.511]
71 4 Experiments and Results For evaluation, we consider a cross-lingual mate retrieval task from English/Spanish on the basis of Wikipedia as aligned corpus. [sent-184, score-0.171]
72 The goal is to, for each document of a test set, retrieve the aligned document or mate. [sent-185, score-0.39]
73 958 n/a n/a n/a 72s 284MB on large-scale mate-finding studies for English to Spanish matching n/a the similarity of the query document to all indexed documents, we compute the value ranki indicating at which position the mate of the ith document occurs. [sent-257, score-0.452]
74 This gives us 10,369 aligned documents in total, which form the background document collection B. [sent-259, score-0.507]
75 Normalization Methods: In order to investigate the impact of different normalization methods, we ran small-scale experiments using the first 500 documents from our dataset to train ONETA and then evaluate the resulting models on the mate-finding task on 100 unseen documents. [sent-265, score-0.319]
76 The results are presented in Table 1, which shows the Top-1 Precision 1738 for the different normalization methods. [sent-266, score-0.171]
77 We see that the effect of applying document normalization in all cases improves the quality of the overall result. [sent-267, score-0.345]
78 Surprisingly, we do not see the same result for frequency normalization yielding the best result for the case where we do no normalization at all5 . [sent-268, score-0.4]
79 In the remaining experiments we thus employ document normalization and no term frequency normalization. [sent-269, score-0.446]
80 Approximation Methods: In order to evaluate the different approximation methods, we experimentally compare 4 different approximation methods: standard LSI, ON-Eigen (Equation 1), Explicit LSI (Equation 2), L-Solve (Equation 3) on the same small-scale corpus. [sent-270, score-0.212]
81 For convenience we plot an approximation rate which is either K or N1 depending on method; at K = 500 and N1 = 500, these ap- proximations become exact. [sent-271, score-0.106]
82 We also observe the effects of approximation and see that the performance increases steadily as we increase the computational factor. [sent-273, score-0.106]
83 We see that the orthonormal eigenvector (Equation 1) method and the L-solve (Equation 3) method are clearly similar in approximation quality. [sent-274, score-0.252]
84 Explicit LSI is worse than the other approximations as it first maps the test documents into a K-dimensional LSI topic space, before mapping back into the N-dimensional explicit space. [sent-276, score-0.606]
85 We also see that the (CL-)ESA baseline, which is very low due to the small number ofdocuments, is improved upon by even the least approximation of orthonormalization. [sent-278, score-0.106]
86 Evaluation and Comparison: We compare ONETA using the L-Solve method with N1 values from 1000 to 9000 topics with (CL-)ESA (using SQRT normalization), LDA (using 1000 topics) and LSI (using 4000 topics). [sent-280, score-0.118]
87 We choose the largest topic count for LSI and LDA we could to provide the best possible comparison. [sent-281, score-0.174]
88 We also stress that for L-Solve ONETA, N1 is not the topic count but an approximation rate of the mapping. [sent-283, score-0.28]
89 In all settings we use N topics as with standard ESA, and so should not be considered directly comparable to the K values of these methods. [sent-284, score-0.118]
90 Interestingly, even for a small number of documents (e. [sent-288, score-0.148]
91 , N1 = 6000) our results improve both the word-translation baseline as well as all other topic models, ESA, LDA and LSI in particular. [sent-290, score-0.174]
92 1739 Further, we consider a straightforward combination of our method with the translation system consisting of appending the topic vectors and the translation frequency vectors, weighted by the relative average norms of the vectors. [sent-293, score-0.348]
93 This is in line with the theoretical calculations presented earlier where we argued that inverting the N N dense matrix XTX awrgheune dW th ? [sent-300, score-0.146]
94 cIonm addition, as we dWo n×ot W multiply (XTX)−1 and XT, we do not need to allocate a large W K matrix in memory as with LSI and lLaDrgAe. [sent-302, score-0.183]
95 5 Conclusion We have presented a novel method for cross-lingual topic modelling, which combines the strengths of explicit and latent topic models and have demonstrated its application to cross-lingual document matching. [sent-305, score-0.787]
96 We have in particular shown that the method outperforms widely used topic models such as Explicit Semantic Analysis (ESA), Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). [sent-306, score-0.174]
97 Further, we have shown that it outperforms a simple baseline relying on word-by-word translation of the query document into the target language, while the induction of the model takes less time than training the machine translation system from a parallel corpus. [sent-307, score-0.29]
98 L-Solve, which significantly reduces the computational cost associated with computing the topic models. [sent-310, score-0.174]
99 Learning the parts of objects by non-negative matrix factorization. [sent-369, score-0.146]
100 An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. [sent-392, score-0.22]
wordName wordTfidf (topN-words)
[('oneta', 0.42), ('lsi', 0.396), ('xtx', 0.267), ('esa', 0.233), ('topic', 0.174), ('document', 0.174), ('normalization', 0.171), ('explicit', 0.166), ('documents', 0.148), ('orthonormal', 0.146), ('matrix', 0.146), ('bielefeld', 0.126), ('topics', 0.118), ('lda', 0.114), ('bj', 0.111), ('approximation', 0.106), ('sqrt', 0.105), ('cimiano', 0.1), ('latent', 0.099), ('dj', 0.093), ('sorg', 0.091), ('background', 0.09), ('indexing', 0.077), ('sim', 0.075), ('philipp', 0.072), ('modelling', 0.067), ('mate', 0.067), ('diagonal', 0.065), ('eigenvectors', 0.063), ('tfwn', 0.063), ('xtbi', 0.063), ('frequency', 0.058), ('translation', 0.058), ('calculating', 0.058), ('plsi', 0.055), ('semantic', 0.054), ('collection', 0.053), ('decomposition', 0.05), ('interlingual', 0.047), ('equation', 0.046), ('inverse', 0.046), ('deerwester', 0.044), ('gabrilovich', 0.044), ('di', 0.044), ('term', 0.043), ('axtx', 0.042), ('coppersmith', 0.042), ('dctc', 0.042), ('eigenvalues', 0.042), ('jmccrae', 0.042), ('klinger', 0.042), ('spiliopoulos', 0.042), ('xtxxtx', 0.042), ('svd', 0.042), ('aligned', 0.042), ('mapping', 0.041), ('rows', 0.041), ('overlap', 0.04), ('maps', 0.039), ('approximations', 0.038), ('inversion', 0.038), ('inspiration', 0.038), ('matching', 0.037), ('memory', 0.037), ('cos', 0.037), ('eigendecomposition', 0.037), ('bytes', 0.037), ('yij', 0.037), ('df', 0.036), ('retrieval', 0.035), ('dumais', 0.035), ('normalizations', 0.033), ('tam', 0.033), ('nmf', 0.033), ('space', 0.032), ('tf', 0.032), ('markovitch', 0.031), ('tfidf', 0.031), ('roman', 0.031), ('ding', 0.031), ('oxf', 0.031), ('germany', 0.031), ('bn', 0.029), ('sparse', 0.029), ('columns', 0.029), ('mimno', 0.028), ('bi', 0.028), ('cb', 0.028), ('element', 0.028), ('allocation', 0.028), ('basis', 0.027), ('moses', 0.027), ('spanish', 0.027), ('europarl', 0.027), ('inequality', 0.027), ('dirichlet', 0.027), ('singular', 0.026), ('blei', 0.026), ('transformation', 0.026), ('spent', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
2 0.17455536 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
3 0.12968758 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
Author: Min Xiao ; Yuhong Guo
Abstract: Cross-lingual adaptation aims to learn a prediction model in a label-scarce target language by exploiting labeled data from a labelrich source language. An effective crosslingual adaptation system can substantially reduce the manual annotation effort required in many natural language processing tasks. In this paper, we propose a new cross-lingual adaptation approach for document classification based on learning cross-lingual discriminative distributed representations of words. Specifically, we propose to maximize the loglikelihood of the documents from both language domains under a cross-lingual logbilinear document model, while minimizing the prediction log-losses of labeled documents. We conduct extensive experiments on cross-lingual sentiment classification tasks of Amazon product reviews. Our experimental results demonstrate the efficacy of the pro- posed cross-lingual adaptation approach.
4 0.1266298 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
5 0.12251944 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
Author: Wei Wang ; Hua Xu ; Xiaoqiu Huang
Abstract: Implicit feature detection, also known as implicit feature identification, is an essential aspect of feature-specific opinion mining but previous works have often ignored it. We think, based on the explicit sentences, several Support Vector Machine (SVM) classifiers can be established to do this task. Nevertheless, we believe it is possible to do better by using a constrained topic model instead of traditional attribute selection methods. Experiments show that this method outperforms the traditional attribute selection methods by a large margin and the detection task can be completed better.
6 0.1112202 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
7 0.10861824 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
8 0.098414987 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
9 0.09304662 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
10 0.091793582 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
11 0.084691979 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
12 0.075495355 121 emnlp-2013-Learning Topics and Positions from Debatepedia
13 0.074443236 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
14 0.073941119 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
15 0.071156487 137 emnlp-2013-Multi-Relational Latent Semantic Analysis
16 0.068019569 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
17 0.067525819 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
18 0.067150593 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
19 0.065445744 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
20 0.064914174 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
topicId topicWeight
[(0, -0.214), (1, 0.006), (2, -0.087), (3, 0.019), (4, 0.066), (5, 0.001), (6, 0.065), (7, 0.062), (8, -0.045), (9, -0.237), (10, -0.008), (11, -0.138), (12, -0.069), (13, 0.086), (14, 0.055), (15, 0.042), (16, 0.067), (17, -0.029), (18, -0.108), (19, -0.011), (20, -0.123), (21, -0.065), (22, -0.016), (23, 0.035), (24, -0.058), (25, -0.165), (26, -0.01), (27, 0.119), (28, 0.2), (29, 0.174), (30, 0.028), (31, -0.016), (32, -0.074), (33, 0.086), (34, 0.021), (35, -0.023), (36, -0.085), (37, 0.071), (38, 0.086), (39, 0.104), (40, 0.061), (41, 0.017), (42, -0.021), (43, -0.069), (44, -0.051), (45, 0.001), (46, 0.087), (47, 0.122), (48, -0.125), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.95458108 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
2 0.53555137 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
3 0.52380478 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
4 0.51186705 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
5 0.49963108 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao
Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.
6 0.49372622 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
7 0.48725358 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
8 0.48404542 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
9 0.45824155 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
10 0.45098409 121 emnlp-2013-Learning Topics and Positions from Debatepedia
11 0.42740551 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
12 0.41975278 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
13 0.41952157 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
14 0.41658726 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
15 0.40364331 137 emnlp-2013-Multi-Relational Latent Semantic Analysis
16 0.40269306 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
17 0.39125305 138 emnlp-2013-Naive Bayes Word Sense Induction
18 0.37595105 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
19 0.3679719 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
20 0.36286688 95 emnlp-2013-Identifying Multiple Userids of the Same Author
topicId topicWeight
[(3, 0.017), (9, 0.013), (18, 0.033), (22, 0.034), (30, 0.056), (50, 0.012), (51, 0.662), (66, 0.025), (71, 0.014), (75, 0.012), (77, 0.02), (96, 0.019)]
simIndex simValue paperId paperTitle
1 0.998519 32 emnlp-2013-Automatic Idiom Identification in Wiktionary
Author: Grace Muzny ; Luke Zettlemoyer
Abstract: Online resources, such as Wiktionary, provide an accurate but incomplete source ofidiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.
2 0.998362 24 emnlp-2013-Application of Localized Similarity for Web Documents
Author: Peter Rebersek ; Mateja Verlic
Abstract: In this paper we present a novel approach to automatic creation of anchor texts for hyperlinks in a document pointing to similar documents. Methods used in this approach rank parts of a document based on the similarity to a presumably related document. Ranks are then used to automatically construct the best anchor text for a link inside original document to the compared document. A number of different methods from information retrieval and natural language processing are adapted for this task. Automatically constructed anchor texts are manually evaluated in terms of relatedness to linked documents and compared to baseline consisting of originally inserted anchor texts. Additionally we use crowdsourcing for evaluation of original anchors and au- tomatically constructed anchors. Results show that our best adapted methods rival the precision of the baseline method.
3 0.99764425 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
Author: Karl Pichotta ; John DeNero
Abstract: We address the problem of identifying multiword expressions in a language, focusing on English phrasal verbs. Our polyglot ranking approach integrates frequency statistics from translated corpora in 50 different languages. Our experimental evaluation demonstrates that combining statistical evidence from many parallel corpora using a novel ranking-oriented boosting algorithm produces a comprehensive set ofEnglish phrasal verbs, achieving performance comparable to a human-curated set.
4 0.99761188 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels
Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi
Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,
same-paper 5 0.9970004 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
6 0.99662566 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game
7 0.99563545 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
8 0.99412698 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
9 0.94668812 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
10 0.94574964 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution
11 0.94255149 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
12 0.94086707 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
13 0.93633538 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
14 0.93469077 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
15 0.93349111 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
16 0.93286121 27 emnlp-2013-Authorship Attribution of Micro-Messages
17 0.93009174 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
18 0.92927128 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations
19 0.92916381 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
20 0.92636418 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation