acl acl2013 acl2013-294 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
Reference: text
sentIndex sentText sentNum sentScore
1 Re-embedding Words Igor Labutov Cornell University i l @ corne l . [sent-1, score-0.041]
2 edu i4 l Abstract We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. [sent-2, score-0.143]
3 Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. [sent-3, score-0.833]
4 We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. [sent-5, score-0.465]
5 We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small. [sent-6, score-0.218]
6 1 Introduction Incorporating the vector representation of a word as a feature, has recently been shown to benefit performance in several standard NLP tasks such as language modeling (Bengio et al. [sent-7, score-0.038]
7 , 2010), as well as in sentiment and subjectivity analysis tasks (Maas et al. [sent-10, score-0.146]
8 Real-valued word vectors mitigate sparsity by “smoothing” relevant semantic insight gained during the unsupervised training over the rare and unseen terms in the training data. [sent-12, score-0.191]
9 We might, for example, consider dramatic (term X) and pleasant (term Y) to correlate with a review of a good movie (task A), while find— — ing them of opposite polarity in the context of a Hod Lipson Cornell University hod . [sent-16, score-0.226]
10 Consequently, good vectors for X and Y should yield an inner product close to 1in the context of task A, and −1 in the context o1f i ntas tkhe e B c. [sent-19, score-0.114]
11 Moreover, we may already h thavee c on our hands embeddings for X and Y obtained from yet another (possibly unsupervised) task (C), in which X and Y are, for example, orthogonal. [sent-20, score-0.735]
12 If the embeddings for task C happen to be learned from a much larger dataset, it would make sense to reuse task C embeddings, but adapt them for task A and/or task B. [sent-21, score-0.849]
13 We will refer to task C and its embeddings as the source task and the source embeddings, and task A/B, and its embeddings as the target task and the target embeddings. [sent-22, score-1.864]
14 Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semisupervised learning, and/or we may leverage la- bels from multiple other related tasks in a multitask approach. [sent-23, score-0.97]
15 But while joint training is highly effective, a downside is that a large amount of data (and processing time) is required a-priori. [sent-25, score-0.035]
16 In the case of deep neural embeddings, for example, training time can number in days. [sent-26, score-0.177]
17 On the other hand, learned embeddings are becoming more abundant, as much research and computing effort is being invested in learning word representations using large-scale deep architectures trained on web-scale corpora. [sent-27, score-0.886]
18 Many of said embeddings are published and can be harnessed in their raw form as additional features in a number of supervised tasks (Turian et al. [sent-28, score-0.79]
19 It would, thus, be advantageous to learn a task-specific embedding directly from another (source) embedding. [sent-30, score-0.408]
20 In this paper we propose a fast method for reembedding words from a source embedding S to a target embedding T by performing unconstrained optimization of a convex objective. [sent-31, score-1.077]
21 Our objective is a linear combination of the dataset’s log- 489 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-32, score-0.05]
22 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 489–493, likelihood under the target embedding and the Frobenius norm of the distortion matrix a matrix of component-wise differences between the target and the source embeddings. [sent-34, score-0.722]
23 The latter acts as a regularizer that penalizes the Euclidean distance between the source and target embeddings. [sent-35, score-0.242]
24 The method is much faster than joint training and yields competitive results with several baselines. [sent-36, score-0.035]
25 al (201 1), where word vectors are learned specifically for sentiment classification. [sent-38, score-0.233]
26 Embeddings are learned in a semi-supervised fashion, and the components of the embedding are given an explicit probabilistic interpretation. [sent-39, score-0.409]
27 Their method produces state-of-the-art results, however, optimization is non-convex and takes approximately 10 hours on 10 machines1 . [sent-40, score-0.052]
28 Naturally, our method is significantly faster because it operates in the space of an existing embedding, and does not require a large amount of training data a-priori. [sent-41, score-0.035]
29 Collobert and Weston (2008), in their seminal paper on deep architectures for NLP, propose a multilayer neural network for learning word embeddings. [sent-42, score-0.221]
30 While the obtained embeddings can be “fine-tuned” using backpropogation for a supervised task, like all multilayer neural network training, optimization is non-convex, and is sensitive to the dimensionality of the hidden layers. [sent-44, score-0.961]
31 In machine learning literature, joint semisupervised embedding takes form in methods such as the LaplacianSVM (LapSVM) (Belkin et al. [sent-45, score-0.445]
32 These methods combine a discriminative learner with a non-linear manifold learning technique in a joint objective, and apply it to a combined set of labeled and unlabeled examples to improve performance in a supervised task. [sent-47, score-0.154]
33 Our method is different in that the (potentially) massive amount of unlabeled data is not required a-priori, but only the resultant embedding. [sent-50, score-0.101]
34 , 2011) 3 Approach R|V |×K Let ΦS, ΦT ∈ be the source and target embedding m∈atr Rices respectively, where K is the dimension of the word vector space, identical in the source and target embeddings, and V is the set of embedded words, given by VS ∩ VT. [sent-53, score-0.786]
35 Following this notation, φi – the ith row in Φ∩ – is the respective vector representation ofword wi ∈ V . [sent-54, score-0.088]
36 In what follows, we first introduce our supervised objective, then combine it with the proposed regularizer and learn the target embedding ΦT by optimizing the resulting joint convex objective. [sent-55, score-0.679]
37 1 Supervised model We model each document dj ∈ D (a movie review, for example) as a collec∈tion D o (fa w moordvsie wij (i. [sent-57, score-0.142]
38 We assign a sentiment label sj ∈ {0, 1} to each document (converting the star rating to a binary label), manedn ts (eceokn to optimize tahre r conditional likelihood of the labels (sj)j∈{1,. [sent-60, score-0.322]
39 ,s|D||D;ΦT) =Y Y p(sj|wi;ΦT) djY Y∈D wYi∈dj where p(sj = 1|wi, ΦT) is the probability of assigning a positive label to document j, given that wi ∈ dj. [sent-66, score-0.119]
40 , 2011), we use logistic regression to model the conditional likelihood: p(sj= 1|wi;ΦT) = 1 + exp1(−ψTφi) where ψ ∈ RK+1 is a regression parameter vector wwihtehr an ψin ∈clu Rded bias component. [sent-68, score-0.094]
41 Classical regularization will mitigate this effect, but can be improved further by introducing an external embedding in the regularizer. [sent-70, score-0.535]
42 In what follows, we describe re-embedding regularization— employing existing (source) embeddings to bias word vector learning. [sent-71, score-0.744]
43 2 Re-embedding regularization To leverage rich semantic word representations, we employ an external source embedding and incorporate it in the regularizer on the supervised objective. [sent-73, score-0.748]
44 We use Euclidean distance between the source and the target embeddings as the regular490 ization loss. [sent-74, score-0.874]
45 Combined with the supervised objective, the resulting log-likelihood becomes: arψg,ΦmTaxdXj∈DwXi∈djlogp(sj|wi;ΦT) − λ||∆Φ||2F where ∆Φ = ΦT −ΦS, | | · | |F is a Frobenius norm, and λ is a trade-o−ffΦ parameter. [sent-75, score-0.058]
46 There are almost no restrictions on ΦS, except that it must match (1) the desired target vector space dimension K. [sent-76, score-0.147]
47 The objective is convex in ψ and ΦT, thus, yielding a unique target re-embedding. [sent-77, score-0.189]
48 We employ L-BFGS algorithm (Liu and Nocedal, 1989) to find the optimal target embedding. [sent-78, score-0.107]
49 3 Classification with word vectors To classify documents, re-embedded word vectors can now be used to construct a document-level feature vector for a supervised learning algorithm of choice. [sent-80, score-0.266]
50 Perhaps the most direct approach is to compute a weighted linear combination of the embeddings for words that appear in the document to be classified, as done in (Maas et al. [sent-81, score-0.75]
51 We use the document’s binary bag-of-words vector vj, and compute the document’s vector space representation through the matrix-vector product ΦTvj. [sent-83, score-0.076]
52 The resulting K + 1-dimensional vector is then cosinenormalized and used as a feature vector to represent the document dj. [sent-84, score-0.12]
53 4 Experiments Data: For our experiments, we employ a large, recently introduced IMDB movie review dataset (Maas et al. [sent-85, score-0.161]
54 , 2011), in place of the smaller dataset introduced in (Pang and Lee, 2004) more commonly used for sentiment analysis. [sent-86, score-0.153]
55 The dataset (50,000 reviews) is split evenly between training and testing sets, each containing a balanced set of highly polar (≥ 7 and ≤ 4 stars out of 10) reviews. [sent-87, score-0.067]
56 Shoiguhrlyce p embeddings: ≤ W 4e s employ othf 1re0e) reexvteierwnasl. [sent-88, score-0.054]
57 , 2010)) induced using the following models: 1) hierarchical log-bilinear model (HLBL) (Mnih and Hinton, 2009) and two neural network-based models 2) Collobert and Weston’s (C&W;) deep-learning architecture, and 3) Huang et. [sent-90, score-0.124]
58 C&W; and HLBL were induced using a 37M-word newswire text (Reuters Corpus 1). [sent-93, score-0.037]
59 We also induce a Latent Semantic Analysis (LSA) based embedding from the subset of the English project Gutenberg collection of approximately 100M words. [sent-94, score-0.382]
60 No pre-processing (stemming or stopword removal), beyond case-normalization is performed in either the external or LSA-based embedding. [sent-95, score-0.06]
61 In total, – we obtain seven source embeddings: HLBL-50, HLBL-200, C&W-50;, C&W-200;, HUANG50, LSA-50, LSA-200. [sent-97, score-0.089]
62 Baselines: We generate two baseline embeddings NULL and RANDOM. [sent-98, score-0.706]
63 NULL is a set of zero vectors, and RANDOM is a set of uniformly distributed random vectors with a unit L2-norm. [sent-99, score-0.085]
64 NULL and RANDOM are treated as source vectors and re-embedded in the same way. [sent-100, score-0.174]
65 The NULL baseline is equivalent to regularizing on the target embedding without the source embedding. [sent-101, score-0.55]
66 As additional baselines, we use each of the 7 source embeddings directly as a target without re-embedding. [sent-102, score-0.874]
67 Training: For each source embedding matrix ΦS, we compute the optimal target embedding matrix ΦT by maximizing Equation 1 using the L-BFGS algorithm. [sent-103, score-1.02]
68 20 % of the training set (5,000 documents) is withheld for parameter (λ) tuning. [sent-104, score-0.035]
69 , 2008) logistic regression module to classify document-level embeddings (computed from the ΦTvj matrix-vector product). [sent-106, score-0.734]
70 Training (re-embedding and document – classification) on 20,000 documents and a 16,000 word vocabulary takes approximately 5 seconds on a 3. [sent-107, score-0.069]
71 5 Results and Discussion The main observation from the results is that our method improves performance for smaller training sets (≤ 5000 examples). [sent-109, score-0.035]
72 The reason for the performance b 5o0o0s0t eisx expected hcelarsesaiscoanl regularization of the supervised objective reduces overfitting. [sent-110, score-0.191]
73 1 corresponds to 20 correctly classified reviews) for word vectors that incorporate the source embedding in the regularizer, than those that do not (NULL), and those that are based on the random source embedding (RANDOM). [sent-112, score-1.027]
74 We hypothesize that the external embeddings, generated from a significantly larger dataset help “smooth” the word-vectors learned from a small labeled dataset alone. [sent-113, score-0.151]
75 Further observations include: – 491 FeaturesNumber of training examples + Bag-of-words features . [sent-114, score-0.035]
76 9104 Table 1: Classification accuracy for the sentiment task (IMDB movie review dataset (Maas et al. [sent-172, score-0.283]
77 Subtable A compares performance of the re-embedded vocabulary, induced from a given source embedding. [sent-174, score-0.126]
78 Subtable B contains a set of baselines: X-w/o re-embedding indicates using a source embedding X directly without re-embedding. [sent-175, score-0.471]
79 hate, pressured, unanswered , source: target: high-quality, obsession, hate . [sent-188, score-0.039]
80 Table 2: A representative set of words from the 20 closestranked (cosine-distance) words to (boring, bad, depressing, brilliant) extracted from the source and target (C&W-200;) embeddings. [sent-194, score-0.168]
81 Source embeddings give higher rank to words that are related, but not necessarily indicative of sentiment, e. [sent-195, score-0.706]
82 Training set size: We note that with a sufficient number of training instances for each word in the test set, additional knowledge from an external embedding does little to improve performance. [sent-199, score-0.477]
83 Source embeddings: We find C&W; embeddings to perform best for the task of sentiment classification. [sent-200, score-0.856]
84 These embeddings were found to per- form well in other NLP tasks as well (Turian et al. [sent-201, score-0.706]
85 Embedding dimensionality: We observe that for HLBL, C&W; and LSA source embeddings (for all training set sizes), 200 dimensions outperform 50. [sent-203, score-0.83]
86 , 2010), re-embedding words may benefit from a larger initial dimension of the word vector space. [sent-205, score-0.068]
87 6 Future Work While “semantic smoothing” obtained from introducing an external embedding helps to improve performance in the sentiment classification task, the method does not help to re-embed words that do not appear in the training set to begin with. [sent-208, score-0.631]
88 The objective for this optimization problem can be posed by requiring that the distance between every pair of words in the source and target embeddings is preserved as much as possible, i. [sent-210, score-0.951]
89 min φiφj)2 ∀i, j (where, with some abuse of notation, φ and are the source and target embeddings respectively). [sent-212, score-0.874]
90 However, this objective is no longer convex in the embeddings. [sent-213, score-0.11]
91 Global reembedding constitutes our ongoing work and may pose an interesting challenge to the community. [sent-214, score-0.058]
92 (φˆiφˆj 7 − φˆ Conclusion We presented a novel approach to adapting existing word vectors for improving performance in a text classification task. [sent-215, score-0.118]
93 While we have shown promising results in a single task, we believe that the method is general enough to be applied to a range of supervised tasks and source embed- dings. [sent-216, score-0.147]
94 As sophistication of unsupervised methods grows, scaling to ever-more massive datasets, so will the representational power and coverage of induced word vectors. [sent-217, score-0.11]
95 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-224, score-0.054]
96 A unified architecture for natural language processing: deep neural networks with multitask learning. [sent-237, score-0.18]
97 Improving word representations via global context and multiple word prototypes. [sent-250, score-0.038]
98 Advances in neural information processing systems, 21:1081– 1088. [sent-265, score-0.087]
99 Learning continuous phrase representations and syntactic parsing with recursive neural networks. [sent-273, score-0.125]
100 Learning from labeled and unlabeled data with label propagation. [sent-292, score-0.079]
wordName wordTfidf (topN-words)
[('embeddings', 0.706), ('embedding', 0.382), ('maas', 0.149), ('collobert', 0.133), ('sentiment', 0.121), ('hlbl', 0.116), ('weston', 0.11), ('source', 0.089), ('turian', 0.089), ('brilliant', 0.087), ('mnih', 0.087), ('neural', 0.087), ('vectors', 0.085), ('sj', 0.08), ('target', 0.079), ('regularizer', 0.074), ('null', 0.071), ('movie', 0.063), ('ronan', 0.061), ('convex', 0.06), ('external', 0.06), ('depressing', 0.058), ('reembedding', 0.058), ('supervised', 0.058), ('regularization', 0.057), ('deep', 0.055), ('unlabeled', 0.054), ('subtable', 0.051), ('blacoe', 0.051), ('wi', 0.05), ('objective', 0.05), ('socher', 0.048), ('hod', 0.047), ('massive', 0.047), ('multilayer', 0.045), ('pleasant', 0.045), ('belkin', 0.045), ('document', 0.044), ('manifold', 0.042), ('frobenius', 0.042), ('lsa', 0.041), ('corne', 0.041), ('boring', 0.041), ('yessenalina', 0.041), ('imdb', 0.039), ('hate', 0.039), ('representations', 0.038), ('dimensionality', 0.038), ('cornell', 0.038), ('multitask', 0.038), ('review', 0.038), ('vector', 0.038), ('semisupervised', 0.038), ('huang', 0.037), ('induced', 0.037), ('baselines', 0.036), ('mitigate', 0.036), ('yoshua', 0.036), ('training', 0.035), ('hinton', 0.035), ('dj', 0.035), ('architectures', 0.034), ('bengio', 0.034), ('classification', 0.033), ('dramatic', 0.033), ('liblinear', 0.032), ('dataset', 0.032), ('matrix', 0.031), ('norm', 0.031), ('dimension', 0.03), ('euclidean', 0.03), ('task', 0.029), ('regression', 0.028), ('employ', 0.028), ('fan', 0.028), ('jason', 0.027), ('learned', 0.027), ('optimization', 0.027), ('learn', 0.026), ('andrew', 0.026), ('eisx', 0.026), ('ejean', 0.026), ('tahre', 0.026), ('harnessed', 0.026), ('madonna', 0.026), ('pressured', 0.026), ('manedn', 0.026), ('invested', 0.026), ('tvj', 0.026), ('hossein', 0.026), ('othf', 0.026), ('sophistication', 0.026), ('masterpiece', 0.026), ('sponsoring', 0.026), ('maximizing', 0.026), ('takes', 0.025), ('subjectivity', 0.025), ('pang', 0.025), ('label', 0.025), ('nsf', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
2 0.37220952 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
3 0.2174392 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
4 0.17159399 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
Author: Karl Moritz Hermann ; Phil Blunsom
Abstract: Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is a fundamental task of Natural Language Processing. In this paper we draw upon recent advances in the learning of vector space representations of sentential semantics and the transparent interface between syntax and semantics provided by Combinatory Categorial Grammar to introduce Combinatory Categorial Autoencoders. This model leverages the CCG combinatory operators to guide a non-linear transformation of meaning within a sentence. We use this model to learn high dimensional embeddings for sentences and evaluate them in a range of tasks, demonstrating that the incorporation of syntax allows a concise model to learn representations that are both effective and general.
5 0.11948813 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words
Author: Hongliang Yu ; Zhi-Hong Deng ; Shiyingxue Li
Abstract: Sentiment Word Identification (SWI) is a basic technique in many sentiment analysis applications. Most existing researches exploit seed words, and lead to low robustness. In this paper, we propose a novel optimization-based model for SWI. Unlike previous approaches, our model exploits the sentiment labels of documents instead of seed words. Several experiments on real datasets show that WEED is effective and outperforms the state-of-the-art methods with seed words.
6 0.11765342 318 acl-2013-Sentiment Relevance
7 0.11586268 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
8 0.10184218 275 acl-2013-Parsing with Compositional Vector Grammars
9 0.087221183 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations
10 0.080657132 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction
11 0.07901299 219 acl-2013-Learning Entity Representation for Entity Disambiguation
12 0.077818207 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
13 0.073438197 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
14 0.071702801 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
15 0.06811057 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
16 0.067114919 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis
17 0.066914812 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions
18 0.065870687 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
19 0.061585836 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
20 0.061163325 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information
topicId topicWeight
[(0, 0.166), (1, 0.062), (2, 0.027), (3, 0.086), (4, -0.041), (5, -0.08), (6, 0.021), (7, 0.008), (8, -0.065), (9, 0.112), (10, 0.083), (11, -0.109), (12, 0.121), (13, -0.165), (14, -0.022), (15, 0.149), (16, -0.098), (17, 0.004), (18, -0.024), (19, -0.147), (20, 0.061), (21, -0.057), (22, -0.162), (23, -0.016), (24, 0.025), (25, -0.093), (26, 0.109), (27, -0.053), (28, 0.044), (29, 0.036), (30, -0.18), (31, 0.027), (32, -0.001), (33, -0.038), (34, 0.038), (35, -0.014), (36, -0.029), (37, -0.126), (38, 0.042), (39, -0.012), (40, 0.02), (41, -0.027), (42, 0.079), (43, 0.062), (44, 0.005), (45, -0.023), (46, 0.053), (47, -0.023), (48, -0.101), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.89596075 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
2 0.73069584 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
3 0.64362341 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada
Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.
4 0.60264373 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Author: Tiberiu Boros ; Radu Ion ; Dan Tufis
Abstract: Radu Ion Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy radu@ racai . ro Dan Tufi? Research Institute for ?????????? ???????????? ?????? Dr?????????? Romanian Academy tufi s @ racai . ro Networks (Marques and Lopes, 1996) and Conditional Random Fields (CRF) (Lafferty et Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, ?????? ?????? ??????? ??????, 1999), exploits a reduced set of tags derived by removing several recoverable features from the lexicon morpho-syntactic descriptions. A second phase is aimed at recovering the full set of morpho-syntactic features. In this paper we present an alternative method to Tiered Tagging, based on local optimizations with Neural Networks and we show how, by properly encoding the input sequence in a general Neural Network architecture, we achieve results similar to the Tiered Tagging methodology, significantly faster and without requiring extensive linguistic knowledge as implied by the previously mentioned method. 1
5 0.58954841 275 acl-2013-Parsing with Compositional Vector Grammars
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
6 0.58478433 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
7 0.54130089 219 acl-2013-Learning Entity Representation for Entity Disambiguation
9 0.53040045 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
10 0.50743914 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals
11 0.49749395 349 acl-2013-The mathematics of language learning
12 0.41779551 318 acl-2013-Sentiment Relevance
13 0.40528527 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words
14 0.38920361 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
15 0.38777971 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction
16 0.38485652 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting
17 0.37945196 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays
18 0.3781181 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions
19 0.37372556 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
20 0.35936183 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
topicId topicWeight
[(0, 0.04), (2, 0.015), (4, 0.219), (5, 0.011), (6, 0.027), (11, 0.04), (15, 0.026), (24, 0.054), (26, 0.057), (35, 0.062), (42, 0.059), (48, 0.077), (63, 0.018), (67, 0.037), (70, 0.039), (88, 0.04), (90, 0.026), (95, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.80396616 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
2 0.77903384 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur ; Gokhan Tur ; Ruhi Sarikaya
Abstract: Microsoft Research Microsoft Mountain View, CA, USA Redmond, WA, USA dilek @ ieee .org rus arika@mi cro s o ft . com gokhan .tur @ ieee .org performance (Tur and DeMori, 2011). This requires a tedious and time intensive data collection Finding concepts in natural language utterances is a challenging task, especially given the scarcity of labeled data for learning semantic ambiguity. Furthermore, data mismatch issues, which arise when the expected test (target) data does not exactly match the training data, aggravate this scarcity problem. To deal with these issues, we describe an efficient semisupervised learning (SSL) approach which has two components: (i) Markov Topic Regression is a new probabilistic model to cluster words into semantic tags (concepts). It can efficiently handle semantic ambiguity by extending standard topic models with two new features. First, it encodes word n-gram features from labeled source and unlabeled target data. Second, by going beyond a bag-of-words approach, it takes into account the inherent sequential nature of utterances to learn semantic classes based on context. (ii) Retrospective Learner is a new learning technique that adapts to the unlabeled target data. Our new SSL approach improves semantic tagging performance by 3% absolute over the baseline models, and also compares favorably on semi-supervised syntactic tagging.
3 0.77764195 273 acl-2013-Paraphrasing Adaptation for Web Search Ranking
Author: Chenguang Wang ; Nan Duan ; Ming Zhou ; Ming Zhang
Abstract: Mismatch between queries and documents is a key issue for the web search task. In order to narrow down such mismatch, in this paper, we present an in-depth investigation on adapting a paraphrasing technique to web search from three aspects: a search-oriented paraphrasing model; an NDCG-based parameter optimization algorithm; an enhanced ranking model leveraging augmented features computed on paraphrases of original queries. Ex- periments performed on the large scale query-document data set show that, the search performance can be significantly improved, with +3.28% and +1.14% NDCG gains on dev and test sets respectively.
4 0.77615952 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Author: Xiaodong Zeng ; Derek F. Wong ; Lidia S. Chao ; Isabel Trancoso
Abstract: This paper introduces a graph-based semisupervised joint model of Chinese word segmentation and part-of-speech tagging. The proposed approach is based on a graph-based label propagation technique. One constructs a nearest-neighbor similarity graph over all trigrams of labeled and unlabeled data for propagating syntactic information, i.e., label distributions. The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields (CRFs) on unlabeled data. An inductive character-based joint model is obtained eventually. Empirical results on Chinese tree bank (CTB-7) and Microsoft Research corpora (MSR) reveal that the proposed model can yield better results than the supervised baselines and other competitive semi-supervised CRFs in this task.
5 0.71738935 121 acl-2013-Discovering User Interactions in Ideological Discussions
Author: Arjun Mukherjee ; Bing Liu
Abstract: Online discussion forums are a popular platform for people to voice their opinions on any subject matter and to discuss or debate any issue of interest. In forums where users discuss social, political, or religious issues, there are often heated debates among users or participants. Existing research has studied mining of user stances or camps on certain issues, opposing perspectives, and contention points. In this paper, we focus on identifying the nature of interactions among user pairs. The central questions are: How does each pair of users interact with each other? Does the pair of users mostly agree or disagree? What is the lexicon that people often use to express agreement and disagreement? We present a topic model based approach to answer these questions. Since agreement and disagreement expressions are usually multiword phrases, we propose to employ a ranking method to identify highly relevant phrases prior to topic modeling. After modeling, we use the modeling results to classify the nature of interaction of each user pair. Our evaluation results using real-life discussion/debate posts demonstrate the effectiveness of the proposed techniques.
6 0.69583833 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
7 0.69036347 287 acl-2013-Public Dialogue: Analysis of Tolerance in Online Discussions
8 0.69034219 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
9 0.61071175 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
10 0.6062271 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction
11 0.60409033 275 acl-2013-Parsing with Compositional Vector Grammars
12 0.60040307 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words
13 0.59700829 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution
14 0.59524685 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
15 0.59308475 62 acl-2013-Automatic Term Ambiguity Detection
16 0.59304714 318 acl-2013-Sentiment Relevance
17 0.59233439 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
18 0.59220076 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
19 0.59034681 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
20 0.58982629 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis