emnlp emnlp2013 emnlp2013-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Do Kook Choe ; Eugene Charniak
Abstract: We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. The extended model incorporates the idea the words closer to the target word are more relevant in predicting its sense. The proposed model is very simple yet effective when evaluated on SemEval-2010 WSI data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. [sent-3, score-0.474]
2 The extended model incorporates the idea the words closer to the target word are more relevant in predicting its sense. [sent-4, score-0.109]
3 1 Introduction The task of word sense induction (WSI) is to find clusters of tokens of an ambiguous word in an unlabeled corpus that have the same sense. [sent-6, score-0.553]
4 For instance, given a target word “crane,” a good WSI system should find a cluster of tokens referring to avian cranes and another referring to mechanical cranes. [sent-7, score-0.261]
5 We believe that neighboring words contain enough information that these clusters can be found from plain texts. [sent-8, score-0.217]
6 WSI is related to word sense disambiguation (WSD) . [sent-9, score-0.21]
7 In a WSD task, a system learns a sense classifier in a supervised manner from a sense-labeled corpus. [sent-10, score-0.321]
8 The performance of the learned classifier is measured on some unseen data. [sent-11, score-0.08]
9 In addition, WSD systems are not suitable for newly created words, new senses of existing words, or domainspecific words. [sent-13, score-0.155]
10 On the other hand, WSI systems can learn new senses of words directly from texts because these programs do not rely on a predefined set of senses . [sent-14, score-0.31]
11 In Section 3 and 4 we introduce the naive Bayes model for WSI and inference schemes for the model. [sent-16, score-0.187]
12 edu 2 Related Work Yarowsky (1995) introduces a semi-supervised bootstrapping algorithm with two assumptions that rivals supervised algorithms: one-sense-percollocation and one-sense-per-discourse. [sent-21, score-0.159]
13 But this algorithm cannot easily be scaled up because for any new ambiguous word humans need to pick a few seed words, which initialize the algorithm. [sent-22, score-0.08]
14 In order to automate the semi-supervised system, Eisner and Karakos (2005) propose an unsupervised bootstrapping algorithm. [sent-23, score-0.094]
15 Their system tries many different seeds for bootstrapping and chooses the “best” classifier at the end. [sent-24, score-0.105]
16 They run a topic modeling algorithm on texts with some fixed number of topics that correspond to senses and induce a cluster by finding target words assigned to the same topic. [sent-29, score-0.278]
17 3 Model Following Yarowsky (1995) , we assume that a word in a document has one sense. [sent-34, score-0.09]
18 Multiple occurrences of a word in a document refer to the same object or concept. [sent-35, score-0.09]
19 The naive Bayes model is well suited for this one-sense-per-document assumption. [sent-36, score-0.15]
20 Each document has one topic corresponding to the sense of the target word that needs disambiguation. [sent-37, score-0.343]
21 Context words in a document are drawn from the conditional distribution of words given the sense. [sent-38, score-0.057]
22 1 Naive Bayes The naive Bayes model assumes that every word in a document is generated independently from the conditional distribution of words given a sense, p(w|s) . [sent-43, score-0.24]
23 With the model, a new document can be easily labeled using the following classifier: s0 = argmsaxp(s)Ywp(w|s), (2) where s0 is the label of the new document. [sent-45, score-0.057]
24 In contrast to LDA-like models, it is easy to construct the closed form classifier from the model. [sent-46, score-0.045]
25 The parameters of the model, p(s) and p(w|s) , can be rleaamrneetder sQby o maximizQing tlh,e p p(sro)ba anbdilit py( wof| st)h,e corpus, p(d) = Qd p(d) = Qw p(w) where d is a vector of documenQts and d = Qw . [sent-47, score-0.05]
26 2 Distance Incorporated Naive Bayes Intuitively, context words near a target word are more indicative of its sense than ones that are farther away. [sent-49, score-0.347]
27 To account for this intuition, we propose a more sophisticated model that uses the distance between a context word and a target word. [sent-50, score-0.157]
28 Before introducing the new model, we define a probability distribution, f(w|s) , that incorporates distances as fdoislltorwibsu: f(w|s) =Pw0p(w|ps()wl(0w|s))l(w), (3) where l(w) = dist(1w)x. [sent-51, score-0.04]
29 x is a tunable parameter that takes nonnegative real values. [sent-53, score-0.047]
30 With the new probability distribution, the model and the classifier become: = p(w) Xp(s)Yf(w|s) (4) s0 = argmsaxp(s)Ywf(w|s), (5) Xs Yw where f(w|s) replaces p(w|s) . [sent-54, score-0.045]
31 The naive Bayes wmhoedreel i fs a s|sp)ec riaepl case; sp(etw x =. [sent-55, score-0.189]
32 The new model puts more weight on context words that are close 1434 to the target word. [sent-57, score-0.076]
33 The distribution of words that are farther away approaches the uniform distribution. [sent-58, score-0.121]
34 4 Inference Given the generative model, we employ two inference algorithms to learn the sense distribution and word distributions given a sense. [sent-60, score-0.21]
35 Expectation Maximization (EM) is a natural choice for the naive Bayes (Dempster et al. [sent-61, score-0.15]
36 To avoid local maxima, we use a Gibbs sampler for the plain naive Bayes to learn parameters that initialize EM. [sent-64, score-0.409]
37 The task has 100 target words, 50 nouns and 50 verbs. [sent-68, score-0.112]
38 For each target word, there are training and test documents. [sent-69, score-0.076]
39 The training and test data are plain texts without sense tags. [sent-71, score-0.233]
40 For evaluation, the inferred sense labels are compared with human annotations. [sent-72, score-0.177]
41 To tune some parameters we use the trial data of NVAoeruTlbna sle1T:817r 6Da926ein98 t4ai60n5i27lgsofTSe538e2ms69t831iEn5 0vgal-20S1e0n43s d. [sent-73, score-0.168]
42 The trial data consists of training and test portions of 4 verbs. [sent-75, score-0.118]
43 On average there are 137 documents for each target word in the training part of the trial data. [sent-76, score-0.227]
44 2 Task Participants induce clusters from the training data and use them to label the test data. [sent-78, score-0.161]
45 Tuning parameters and inducing clusters are only allowed during the training phase. [sent-80, score-0.211]
46 Note however that LDA requires learning the mixture weights of topics for each individual document p(topic | document) . [sent-84, score-0.088]
47 Tthhee deo acrue,ments in the testing corpus have never been seen before, so clearly their topic mixture weights are not learned during training, and thus not learned at all. [sent-87, score-0.101]
48 Context words within a window of 50 about a target word are used to construct a bag-of-words. [sent-93, score-0.109]
49 When a target word appears more than once in a document, the distance between that target word and a context word is ambiguous. [sent-94, score-0.299]
50 We define this distance to be minimum distance between a context word and an instance of the target word. [sent-95, score-0.205]
51 , “shining” there are three possible distances: 8 away from the first “chip,” 4 away from the second “chip” and 11 away from the last “chip. [sent-100, score-0.18]
52 ” We set the distance of “shining” from the target to 4. [sent-101, score-0.124]
53 1 for the Gibbs sampler as in Brody and Lapata (2009) . [sent-105, score-0.106]
54 We initialize EM with parameters learned from the sampler. [sent-106, score-0.132]
55 We run the sampler 2000 iterations including 1000 iterations of burn-in: 10 samples at an interval of 100 are averaged. [sent-108, score-0.143]
56 4) are averaged over ten different runs of the program. [sent-111, score-0.071]
57 1 Tuning Parameters Two parameters, the number of senses and x of the function l(w) , need to be determined before running the program. [sent-114, score-0.155]
58 To find a good setting we do grid search on the trial data with the number of senses 1Code used for experiments is available for download at http : //cs . [sent-115, score-0.273]
59 Due to the small size of the training portion of the trial data, words that occur once are thrown out in the training portion. [sent-121, score-0.118]
60 All the other parameters are as described in Section 5. [sent-122, score-0.05]
61 With a fixed value of x, a column is nearly unimodal in the number of senses and vice versa. [sent-127, score-0.155]
62 4 Evaluation We compare our system to other WSI systems and discuss two metrics for unsupervised evaluation (VMeasure, paired F-Score) and one metric for supervised evaluation (supervised recall) . [sent-130, score-0.36]
63 We refer to the true group of tokens as a gold class and to an induced group of tokens as a cluster. [sent-131, score-0.175]
64 We refer to the model learned with the sampler and EM as NB, and to the model learned with EM only as NB0. [sent-132, score-0.176]
65 1 Short Descriptions of Other WSI Systems Evaluated on SemEval-2010 The baseline assigns every instance of a target word with the most frequent sense (MFS) . [sent-135, score-0.286]
66 UoY runs a clustering algorithm on a graph with words as nodes and co-occurrences between words as edges (Korkontzelos and Manandhar, 2010) . [sent-136, score-0.072]
67 NMFlib factors a matrix using nonnegative matrix factorization and runs a clustering algorithm on test instances represented by factors (Van de Cruys et al. [sent-138, score-0.119]
68 2 V-Measure V-Measure computes the quality of induced clusters as the harmonic mean of two values, homogeneity and completeness. [sent-142, score-0.344]
69 Homogeneity measures whether instances of a cluster belong to a single gold class. [sent-143, score-0.139]
70 Completeness measures whether instances of a gold class belong to a cluster. [sent-144, score-0.092]
71 See Table 3 for details of V-Measure evaluation (#cl is the number of induced clusters) . [sent-146, score-0.103]
72 This holds for paired F-Score and supervised recall evaluations. [sent-148, score-0.417]
73 The sampler improves the log-likelihood of NB by 3. [sent-149, score-0.106]
74 But increasing the number of clusters harms paired F-Score, which results in bad supervised recalls. [sent-191, score-0.521]
75 NB attains a very high VMeasure with few induced clusters, which indicates that those clusters are high quality. [sent-192, score-0.334]
76 Other systems use more induced clusters but fail to attain the VMeasure of NB. [sent-193, score-0.314]
77 3 Paired F-Score Paired F-Score is the harmonic mean of paired recall and paired precision. [sent-196, score-0.623]
78 Paired recall is fraction of pairs belonging to the same gold class that belong to the same cluster. [sent-197, score-0.185]
79 Paired precision is fraction of pairs belonging to the same cluster that belong to the same class. [sent-198, score-0.142]
80 See Table 4 for details of paired FScore evaluation. [sent-199, score-0.292]
81 As with V-Measure, it is possible to attain a high paired F-Score by producing only one cluster. [sent-200, score-0.342]
82 The baseline, MFS, attains 100% paired recall, which together with the poor performance of WSI systems makes its paired F-Score difficult to beat. [sent-201, score-0.623]
83 V-Measure and paired F-Score are meaningful when systems produce about the same numbers of clusters as the numbers of classes and attain high scores on these metrics. [sent-202, score-0.534]
84 78 Table 4: Unsupervised evaluation: paired F-Score 1436 100 runs. [sent-227, score-0.261]
85 4 Supervised Recall For the supervised task, the test data is split into two groups: one for mapping clusters to classes and the other for standard WSD evaluation. [sent-230, score-0.371]
86 2 different split schemes (80% mapping, 20% evaluation and 60% mapping, 40% evaluation) are evaluated. [sent-231, score-0.07]
87 5 random splits are averaged for each split scheme. [sent-232, score-0.067]
88 Mapping is induced automatically by the program provided by organizers. [sent-233, score-0.072]
89 See Table 5 for details of supervised recall evaluation (#s is the average number of classes mapped from clusters) . [sent-234, score-0.218]
90 6 Table 5: Supervised evaluation: mapping and 20% evaluation 69. [sent-249, score-0.047]
91 06 supervised recall, 80% Overall our system performs better than other systems with respect to supervised recall. [sent-259, score-0.198]
92 When a system has higher V-Measure and paired F-Score on nouns than another system, it achieves a higher supervised recall on nouns too. [sent-260, score-0.521]
93 For example, NB has higher V-Measure and paired F-Score on verbs than NMFlib but NB attains a lower supervised recall on verbs than NMFlib. [sent-262, score-0.598]
94 It is difficult to see which verbs clusters are better than some other clusters. [sent-263, score-0.201]
95 (2012) achieves superior numbers to ours for the two supervised metrics, but at the expense of requiring LDA type processing on the test data, something that the SemEval organizers ruled out, presumably with the reasonable idea that such processing would not be feasible in the real world. [sent-269, score-0.131]
96 More generally, their system assigns many senses (about 10) to each word, and thus nodoubt does poorly on the paired F-Score (they do not report results on V-Measure and paired F-Score) . [sent-270, score-0.677]
97 Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. [sent-273, score-0.324]
98 Maximum likelihood from incomplete data via the em algorithm. [sent-287, score-0.095]
99 Uoy: Graphs of unambiguous vertices for word sense induction and disambiguation. [sent-302, score-0.324]
100 Duluth-wsi: Senseclusters applied to the sense induction task of semeval-2. [sent-317, score-0.291]
wordName wordTfidf (topN-words)
[('wsi', 0.429), ('paired', 0.261), ('hermit', 0.194), ('nmflib', 0.194), ('uoy', 0.194), ('bayes', 0.178), ('sense', 0.177), ('nb', 0.172), ('clusters', 0.161), ('senses', 0.155), ('chip', 0.155), ('naive', 0.15), ('mfs', 0.135), ('wsd', 0.123), ('trial', 0.118), ('allnounsverbs', 0.116), ('induction', 0.114), ('manandhar', 0.108), ('sampler', 0.106), ('attains', 0.101), ('vmeasure', 0.101), ('supervised', 0.099), ('em', 0.095), ('lau', 0.092), ('attain', 0.081), ('qw', 0.077), ('target', 0.076), ('xp', 0.074), ('induced', 0.072), ('xs', 0.071), ('providence', 0.067), ('homogeneity', 0.067), ('shining', 0.067), ('argmsaxp', 0.067), ('jurgens', 0.067), ('karakos', 0.067), ('brody', 0.062), ('farther', 0.061), ('cruys', 0.061), ('dirichlet', 0.061), ('away', 0.06), ('lda', 0.06), ('bootstrapping', 0.06), ('belong', 0.059), ('korkontzelos', 0.057), ('recall', 0.057), ('document', 0.057), ('plain', 0.056), ('ioannis', 0.054), ('dempster', 0.051), ('parameters', 0.05), ('yw', 0.049), ('suresh', 0.049), ('distance', 0.048), ('eisner', 0.048), ('nonnegative', 0.047), ('competition', 0.047), ('mapping', 0.047), ('initialize', 0.047), ('cl', 0.047), ('cluster', 0.047), ('agirre', 0.045), ('yarowsky', 0.045), ('classifier', 0.045), ('harmonic', 0.044), ('stroudsburg', 0.04), ('distances', 0.04), ('verbs', 0.04), ('fs', 0.039), ('runs', 0.037), ('schemes', 0.037), ('interval', 0.037), ('belonging', 0.036), ('nouns', 0.036), ('participants', 0.035), ('learned', 0.035), ('tokens', 0.035), ('clustering', 0.035), ('referring', 0.035), ('averaged', 0.034), ('ec', 0.034), ('jey', 0.034), ('hdp', 0.034), ('qd', 0.034), ('etw', 0.034), ('anu', 0.034), ('automate', 0.034), ('damianos', 0.034), ('lemmatizer', 0.034), ('loosen', 0.034), ('mala', 0.034), ('word', 0.033), ('gold', 0.033), ('sp', 0.033), ('split', 0.033), ('achieves', 0.032), ('blei', 0.032), ('mixture', 0.031), ('classes', 0.031), ('details', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 138 emnlp-2013-Naive Bayes Word Sense Induction
Author: Do Kook Choe ; Eugene Charniak
Abstract: We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. The extended model incorporates the idea the words closer to the target word are more relevant in predicting its sense. The proposed model is very simple yet effective when evaluated on SemEval-2010 WSI data. 1
2 0.09538649 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
Author: Min Xiao ; Feipeng Zhao ; Yuhong Guo
Abstract: Domain adaptation has been popularly studied on exploiting labeled information from a source domain to learn a prediction model in a target domain. In this paper, we develop a novel representation learning approach to address domain adaptation for text classification with automatically induced discriminative latent features, which are generalizable across domains while informative to the prediction task. Specifically, we propose a hierarchical multinomial Naive Bayes model with latent variables to conduct supervised word clustering on labeled documents from both source and target domains, and then use the produced cluster distribution of each word as its latent feature representation for domain adaptation. We train this latent graphical model us- ing a simple expectation-maximization (EM) algorithm. We empirically evaluate the proposed method with both cross-domain document categorization tasks on Reuters-21578 dataset and cross-domain sentiment classification tasks on Amazon product review dataset. The experimental results demonstrate that our proposed approach achieves superior performance compared with alternative methods.
3 0.078513213 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
Author: Dimitri Kartsaklis ; Mehrnoosh Sadrzadeh
Abstract: Recent work has shown that compositionaldistributional models using element-wise operations on contextual word vectors benefit from the introduction of a prior disambiguation step. The purpose of this paper is to generalise these ideas to tensor-based models, where relational words such as verbs and adjectives are represented by linear maps (higher order tensors) acting on a number of arguments (vectors). We propose disambiguation algorithms for a number of tensor-based models, which we then test on a variety of tasks. The results show that disambiguation can provide better compositional representation even for the case of tensor-based models. Further- more, we confirm previous findings regarding the positive effect of disambiguation on vector mixture models, and we compare the effectiveness of the two approaches.
4 0.074738368 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao
Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.
5 0.073785253 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
Author: Oier Lopez de Lacalle ; Mirella Lapata
Abstract: In this paper we present an unsupervised approach to relational information extraction. Our model partitions tuples representing an observed syntactic relationship between two named entities (e.g., “X was born in Y” and “X is from Y”) into clusters corresponding to underlying semantic relation types (e.g., BornIn, Located). Our approach incorporates general domain knowledge which we encode as First Order Logic rules and automatically combine with a topic model developed specifically for the relation extraction task. Evaluation results on the ACE 2007 English Relation Detection and Categorization (RDC) task show that our model outperforms competitive unsupervised approaches by a wide margin and is able to produce clusters shaped by both the data and the rules.
6 0.07280831 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
7 0.072464004 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
8 0.071259573 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
9 0.06271106 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
10 0.061923612 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
11 0.061209302 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
12 0.060154073 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech
13 0.0594921 123 emnlp-2013-Learning to Rank Lexical Substitutions
14 0.057884041 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
15 0.055177793 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
16 0.05454509 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet
17 0.052210473 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
18 0.051267777 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
19 0.04942501 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
20 0.049234968 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
topicId topicWeight
[(0, -0.18), (1, 0.015), (2, -0.048), (3, -0.021), (4, -0.016), (5, 0.031), (6, 0.01), (7, -0.0), (8, -0.063), (9, -0.118), (10, -0.0), (11, -0.108), (12, -0.059), (13, 0.068), (14, 0.062), (15, -0.004), (16, 0.023), (17, -0.026), (18, -0.028), (19, -0.03), (20, -0.038), (21, 0.047), (22, 0.067), (23, 0.005), (24, 0.011), (25, -0.133), (26, -0.047), (27, -0.036), (28, 0.023), (29, -0.005), (30, 0.004), (31, 0.026), (32, 0.005), (33, -0.026), (34, 0.041), (35, 0.039), (36, -0.031), (37, 0.085), (38, -0.018), (39, -0.008), (40, -0.123), (41, -0.081), (42, 0.068), (43, 0.219), (44, -0.044), (45, 0.007), (46, -0.091), (47, 0.01), (48, 0.063), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.93888021 138 emnlp-2013-Naive Bayes Word Sense Induction
Author: Do Kook Choe ; Eugene Charniak
Abstract: We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. The extended model incorporates the idea the words closer to the target word are more relevant in predicting its sense. The proposed model is very simple yet effective when evaluated on SemEval-2010 WSI data. 1
2 0.64910138 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
Author: Raphael Bailly ; Xavier Carreras ; Franco M. Luque ; Ariadna Quattoni
Abstract: We derive a spectral method for unsupervised learning of Weighted Context Free Grammars. We frame WCFG induction as finding a Hankel matrix that has low rank and is linearly constrained to represent a function computed by inside-outside recursions. The proposed algorithm picks the grammar that agrees with a sample and is the simplest with respect to the nuclear norm of the Hankel matrix.
3 0.56064874 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao
Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.
Author: Valentin I. Spitkovsky ; Hiyan Alshawi ; Daniel Jurafsky
Abstract: Many statistical learning problems in NLP call for local model search methods. But accuracy tends to suffer with current techniques, which often explore either too narrowly or too broadly: hill-climbers can get stuck in local optima, whereas samplers may be inefficient. We propose to arrange individual local optimizers into organized networks. Our building blocks are operators of two types: (i) transform, which suggests new places to search, via non-random restarts from already-found local optima; and (ii) join, which merges candidate solutions to find better optima. Experiments on grammar induction show that pursuing different transforms (e.g., discarding parts of a learned model or ignoring portions of training data) results in improvements. Groups of locally-optimal solutions can be further perturbed jointly, by constructing mixtures. Using these tools, we designed several modular dependency grammar induction networks of increasing complexity. Our complete sys- tem achieves 48.6% accuracy (directed dependency macro-average over all 19 languages in the 2006/7 CoNLL data) more than 5% higher than the previous state-of-the-art. —
5 0.49782097 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
Author: Stephen Roller ; Sabine Schulte im Walde
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
6 0.48897648 123 emnlp-2013-Learning to Rank Lexical Substitutions
7 0.46570155 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
8 0.46195063 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
9 0.4594183 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
10 0.45585027 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
11 0.44729742 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
12 0.44696417 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
13 0.4427498 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
14 0.43327624 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
15 0.4329766 24 emnlp-2013-Application of Localized Similarity for Web Documents
16 0.42665505 134 emnlp-2013-Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks
17 0.4210192 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
18 0.40786684 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
19 0.40626788 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts
20 0.38551229 182 emnlp-2013-The Topology of Semantic Knowledge
topicId topicWeight
[(3, 0.032), (9, 0.032), (18, 0.038), (22, 0.054), (30, 0.066), (35, 0.291), (45, 0.016), (50, 0.015), (51, 0.209), (66, 0.055), (71, 0.02), (75, 0.037), (77, 0.021), (90, 0.01), (96, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.79437637 138 emnlp-2013-Naive Bayes Word Sense Induction
Author: Do Kook Choe ; Eugene Charniak
Abstract: We introduce an extended naive Bayes model for word sense induction (WSI) and apply it to a WSI task. The extended model incorporates the idea the words closer to the target word are more relevant in predicting its sense. The proposed model is very simple yet effective when evaluated on SemEval-2010 WSI data. 1
2 0.72678047 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
Author: Min Xiao ; Feipeng Zhao ; Yuhong Guo
Abstract: Domain adaptation has been popularly studied on exploiting labeled information from a source domain to learn a prediction model in a target domain. In this paper, we develop a novel representation learning approach to address domain adaptation for text classification with automatically induced discriminative latent features, which are generalizable across domains while informative to the prediction task. Specifically, we propose a hierarchical multinomial Naive Bayes model with latent variables to conduct supervised word clustering on labeled documents from both source and target domains, and then use the produced cluster distribution of each word as its latent feature representation for domain adaptation. We train this latent graphical model us- ing a simple expectation-maximization (EM) algorithm. We empirically evaluate the proposed method with both cross-domain document categorization tasks on Reuters-21578 dataset and cross-domain sentiment classification tasks on Amazon product review dataset. The experimental results demonstrate that our proposed approach achieves superior performance compared with alternative methods.
3 0.6359477 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
Author: Jason Weston ; Antoine Bordes ; Oksana Yakhnenko ; Nicolas Usunier
Abstract: This paper proposes a novel approach for relation extraction from free text which is trained to jointly use information from the text and from existing knowledge. Our model is based on scoring functions that operate by learning low-dimensional embeddings of words, entities and relationships from a knowledge base. We empirically show on New York Times articles aligned with Freebase relations that our approach is able to efficiently use the extra information provided by a large subset of Freebase data (4M entities, 23k relationships) to improve over methods that rely on text features alone.
4 0.6290561 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
Author: Kuzman Ganchev ; Dipanjan Das
Abstract: We present a framework for cross-lingual transfer of sequence information from a resource-rich source language to a resourceimpoverished target language that incorporates soft constraints via posterior regularization. To this end, we use automatically word aligned bitext between the source and target language pair, and learn a discriminative conditional random field model on the target side. Our posterior regularization constraints are derived from simple intuitions about the task at hand and from cross-lingual alignment information. We show improvements over strong baselines for two tasks: part-of-speech tagging and namedentity segmentation.
5 0.62818235 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
Author: Yangfeng Ji ; Jacob Eisenstein
Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.
6 0.62729615 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
7 0.62528706 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
8 0.62473065 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
9 0.62412024 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
10 0.62386936 152 emnlp-2013-Predicting the Presence of Discourse Connectives
11 0.62317473 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
12 0.622935 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
13 0.62203461 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
14 0.62166685 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
15 0.62144411 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
16 0.62128067 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
17 0.62011349 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
18 0.61913341 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations
19 0.6191237 143 emnlp-2013-Open Domain Targeted Sentiment
20 0.61836398 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution