nips nips2013 nips2013-172 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Learning word embeddings efficiently with noise-contrastive estimation Koray Kavukcuoglu DeepMind Technologies koray@deepmind. [sent-1, score-0.704]
2 com Abstract Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. [sent-3, score-1.743]
3 The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. [sent-4, score-0.272]
4 We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. [sent-5, score-0.872]
5 We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. [sent-8, score-0.365]
6 1 Introduction Natural language processing and information retrieval systems can often benefit from incorporating accurate word similarity information. [sent-9, score-0.753]
7 Learning word representations from large collections of unstructured text is an effective way of capturing such information. [sent-10, score-0.621]
8 The classic approach to this task is to use the word space model, representing each word with a vector of co-occurrence counts with other words [16]. [sent-11, score-1.109]
9 Representations of this type suffer from data sparsity problems due to the extreme dimensionality of the word count vectors. [sent-12, score-0.51]
10 To address this, Latent Semantic Analysis performs dimensionality reduction on such vectors, producing lower-dimensional real-valued word embeddings. [sent-13, score-0.51]
11 Better real-valued representations, however, are learned by neural language models which are trained to predict the next word in the sentence given the preceding words. [sent-14, score-1.102]
12 Unfortunately, few neural language models scale well to large datasets and vocabularies due to use of hidden layers and the cost of computing normalized probabilities. [sent-16, score-0.325]
13 Recently, a scalable method for learning word embeddings using light-weight tree-structured neural language models was proposed in [10]. [sent-17, score-1.016]
14 Although tree-structured models can be trained quickly, they are considerably more complex than the traditional (flat) models and their performance is sensitive to the choice of the tree over words [13]. [sent-18, score-0.373]
15 We compound the speedup obtained by using NCE to eliminate the normalization costs during training, by using very simple variants of the log-bilinear model [14], resulting in parameter update complexity linear in the word embedding dimensionality. [sent-20, score-0.544]
16 1 We evaluate our approach on two analogy-based word similarity tasks [11, 10] and show that despite the considerably shorter training times our models outperform the Skip-gram model from [10] trained on the same 1. [sent-21, score-0.838]
17 Furthermore, we can obtain performance comparable to that of the huge Skip-gram and CBOW models trained on a 125-CPU-core cluster after training for only four days on a single core using four times less training data. [sent-23, score-0.413]
18 Finally, we explore several model architectures and discover that the simplest architectures learn embeddings that are at least as good as those learned by the more complex ones. [sent-24, score-0.236]
19 2 Neural probabilistic language models Neural probabilistic language models (NPLMs) specify the distribution for the target word w, given a sequence of words h, called the context. [sent-25, score-1.188]
20 In statistical language modelling, w is typically the next word in the sentence, while the context h is the sequence of words that precede w. [sent-26, score-0.875]
21 Though some models such as recurrent neural language models [9] can handle arbitrarily long contexts, in this paper, we will restrict our attention to fixed-length contexts. [sent-27, score-0.332]
22 Since we are interested in learning word representations as opposed to assigning probabilities to sentences, we do not need to restrict our models to predicting the next word, and can, for example, predict w from the words surrounding it as was done in [4]. [sent-28, score-0.837]
23 Given a context h, an NPLM defines the distribution for the word to be predicted using the scoring function sθ (w, h) that quantifies the compatibility between the context and the candidate target word. [sent-29, score-0.776]
24 Here θ are model parameters, which include the word embeddings. [sent-30, score-0.51]
25 The first one involves using a tree-structured vocabulary with words at the leaves, resulting in training time logarithmic in the vocabulary size [15]. [sent-34, score-0.319]
26 3 Scalable log-bilinear models We are interested in highly scalable models that can be trained on billion-word datasets with vocabularies of hundreds of thousands of words within a few days on a single core, which rules out most traditional neural language models such as those from [1] and [4]. [sent-42, score-0.721]
27 We will use the log-bilinear language model (LBL) [12] as our starting point, which unlike traditional NPLMs, does not have a hidden layer and works by performing linear prediction in the word feature vector space. [sent-43, score-0.732]
28 This model, like all other models we will describe, has two sets of word representations: one for the target words (i. [sent-45, score-0.699]
29 We denote the target and the context representations for word w with qw and rw respectively. [sent-48, score-0.822]
30 , wn , the model computes the predicted representation for the target word by taking a linear combination of the context word feature vectors: n q (h) = ˆ ci i=1 2 rwi , (2) where ci is the weight vector for the context word in position i and denotes element-wise multiplication. [sent-51, score-1.823]
31 The context can consist of words preceding, following, or surrounding the word being predicted. [sent-52, score-0.721]
32 The scoring function then computes the similarity between the predicted feature vector and one for word w: sθ (w, h) = q (h) qw + bw , ˆ (3) where bw is a bias that captures the context-independent frequency of word w. [sent-53, score-1.224]
33 vLBL can be made even simpler by eliminating the position-dependent weights and computing the n 1 predicted feature vector simply by averaging the context word feature vectors: q (h) = n i=1 rwi . [sent-55, score-0.705]
34 The idea of simply averaging context word feature vectors was introduced in [8], where it was used to condition on large contexts such as entire documents. [sent-57, score-0.626]
35 As our primary concern is learning word representations as opposed to creating useful language models, we are free to move away from the paradigm of predicting the target word from its context and, for example, do the reverse. [sent-59, score-1.452]
36 This approach is motivated by the distributional hypothesis, which states that words with similar meanings often occur in the same contexts [7] and thus suggests looking for word representations that capture their context distributions. [sent-60, score-0.852]
37 The inverse language modelling approach of learning to predict the context from the word is a natural way to do that. [sent-61, score-0.878]
38 Some classic word-space models such as HAL and COALS [16] follow this approach by representing the context distribution using a bag-of-words but they do not learn embeddings from this information. [sent-62, score-0.333]
39 Unfortunately, predicting an n-word context requires modelling the joint distribution of n words, which is considerably harder than modelling the distribution of a single word. [sent-63, score-0.233]
40 We make the task tractable by assuming that the words in different context positions are conditionally independent given the current word w: n w Pθ (h) = w Pi,θ (wi ). [sent-64, score-0.706]
41 w The context word distributions Pi,θ (wi ) are simply vLBL models that condition on the current word and are defined by the scoring function si,θ (wi , w) = (ci rw ) qwi + bwi . [sent-66, score-1.324]
42 (5) The resulting model can be seen as a Naive Bayes classifier parameterized in terms of word embeddings. [sent-67, score-0.51]
43 As with our traditional language model, we also consider the simpler version of this model without position-dependent weights, defined by the scoring function si,θ (wi , w) = rw qwi + bwi . [sent-69, score-0.396]
44 Note that unlike the tree-based models, such as those in the above paper, which only learn conditional embeddings for words, in our models each word has both a conditional and a target embedding which can potentially capture complementary information. [sent-71, score-0.838]
45 Tree-based models replace target embeddings with parameters vectors associated with the tree nodes, as opposed to individual words. [sent-72, score-0.294]
46 1 Noise-contrastive estimation We train our models using noise-contrastive estimation, a method for fitting unnormalized models [6], adapted to neural language modelling in [14]. [sent-74, score-0.427]
47 The 3 main advantage of NCE is that it allows us to fit models that are not explicitly normalized making the training time effectively independent of the vocabulary size. [sent-77, score-0.234]
48 The perplexity of NPLMs trained using this approach has been shown to be on par with those trained with maximum likelihood learning, but at a fraction of the computational cost. [sent-80, score-0.254]
49 We will use the (global) unigram distribution of the training data as the noise distribution, a choice that is known to work well for training language models. [sent-84, score-0.393]
50 Thus, we estimate the contribution of a word / context pair w, h to the gradient of Eq. [sent-91, score-0.594]
51 9 involves a sum over k noise samples instead of a sum over the entire vocabulary, making the NCE training time linear in the number of noise samples and independent of the vocabulary size. [sent-94, score-0.296]
52 NCE shares some similarities with a training method for non-probabilistic neural language models that involves optimizing a margin-based ranking objective [4]. [sent-96, score-0.355]
53 As that approach is non-probabilistic, it is outside the scope of this paper, though it would be interesting to see whether it can be used to learn competitive word embeddings. [sent-97, score-0.51]
54 4 Evaluating word embeddings Using word embeddings learned by neural language models outside of the language modelling context is a relatively recent development. [sent-98, score-2.061]
55 An early example of this is the multi-layer neural network of [4] trained to perform several NLP tasks which represented words exclusively in terms of learned word embeddings. [sent-99, score-0.782]
56 [18] provided the first comparison of several word embeddings learned with different methods and showed that incorporating them into established NLP pipelines can boost their performance. [sent-100, score-0.746]
57 Microsoft Research (MSR) has released two challenge sets: a set of sentences each with a missing word to be filled in [20] and a set of analogy questions [11], designed to evaluate semantic and syntactic content of word representations respectively. [sent-102, score-1.314]
58 The task is to identify the held-out fourth word, with only exact word matches deemed correct. [sent-106, score-0.51]
59 Word embeddings learned by neural language models have been shown to perform very well on these datasets when using the following vector-similarity-based protocol for answering the questions. [sent-107, score-0.513]
60 Suppose w is the representation vector for word w normalized to unit norm. [sent-108, score-0.557]
61 , by finding the word d∗ with the representation closest to b − a + c according to cosine similarity: d∗ = arg max (b − a + c) x b−a+c x . [sent-110, score-0.532]
62 (10) We discovered that reproducing the results reported in [10] and [11] for publicly available word embeddings required excluding b and c from the vocabulary when looking for d∗ using Eq. [sent-111, score-0.803]
63 This equation suggests the following interpretation of d∗ : it is simply the word with the representation most similar to b and c and dissimilar to a, which makes it quite natural to exclude b and c themselves from consideration. [sent-115, score-0.532]
64 1 Experimental evaluation Datasets We evaluated our word embeddings on two analogy-based word similarity tasks released recently by Google and Microsoft Research that we described in Section 4. [sent-117, score-1.298]
65 Note that many words used for testing the representations are missing from this dataset, which greatly limits the accuracy achievable when using it. [sent-126, score-0.257]
66 10, excluding the second and the third word in the question from consideration, as explained in Section 4. [sent-128, score-0.51]
67 2 Details of training All models were trained on a single core, using minibatches of size 100 and the initial learning rate of 3 × 10−2 . [sent-130, score-0.244]
68 Initially we used a validation-set based learning rate adaptation scheme described in [14], which halves the learning rate whenever the validation set 5 Table 1: Accuracy in percent on word similarity tasks. [sent-132, score-0.561]
69 The models had 100D word embeddings and were trained to predict 5 words on both sides of the current word on the 1. [sent-133, score-1.526]
70 8 Table 2: Accuracy in percent on word similarity tasks for large models. [sent-173, score-0.561]
71 vLBL models predict the current word from the 5 preceding and 5 following words. [sent-176, score-0.655]
72 Though AdaGrad has already been used to train neural language models in a distributed setting [10], we found that it helped to learn better word representations even using a single CPU core. [sent-273, score-0.898]
73 We reduced the potentially prohibitive memory requirements of AdaGrad, which requires storing a running sum of squared gradient values for each parameter, by using the same learning rate for all dimensions of a word embedding. [sent-274, score-0.51]
74 To speed up training, instead of predicting all context words around the current word, we predict only one context word, sampled at random using the 6 Table 3: Results for various models trained for 20 epochs on the 47M-word Gutenberg dataset using NCE5 with AdaGrad. [sent-280, score-0.514]
75 For each task, the left (right) column give the accuracy obtained using the conditional (target) word embeddings. [sent-282, score-0.546]
76 Table 1 shows the results on the word similarity tasks for the two models trained on the Wikipedia dataset. [sent-359, score-0.727]
77 We ran NCE training several times with different numbers of noise samples to investigate the effect of this parameter on the representation quality and training time. [sent-360, score-0.249]
78 The models were trained for three epochs, which in our experience provided a reasonable compromise between training time and representation quality. [sent-361, score-0.288]
79 We then experimented with training models using AdaGrad and found that it significantly improved the quality of embeddings obtained when training with 10 or 25 noise samples, increasing the semantic score for the NCE25 model by over 10 percentage points. [sent-365, score-0.523]
80 Encouraged by this, we trained two ivLBL models with position-independent weights and different embedding dimensionalities for several days using this approach. [sent-366, score-0.267]
81 As some of the best results in [10] were obtained with the CBOW model, we also trained its non-hierarchical counterpart from Section 3, vLBL with positionindependent weights, using 100/300/600-dimensional embeddings and NCE with 5 noise samples, for shorter training times. [sent-367, score-0.453]
82 The scores for ivLBL and vLBL models were obtained using the conditional word and target word representations respectively, while the scores marked with d × 2 were obtained by concatenating the two word representations, after normalizing them. [sent-369, score-1.93]
83 For example, the 300D ivLBL model trained for just over a day, achieves accuracy scores 3-9 percentage points better than the 300D Skip-gram trained on the same amount of data for almost twice as long. [sent-371, score-0.369]
84 The same model trained for four days achieves accuracy scores that are only 2-4 percentage points lower than those of the 1000D Skip-gram trained on four times as much data using 75 times as many CPU cycles. [sent-372, score-0.415]
85 By computing word similarity scores using the conditional and the target word representations concatenated together, we can bring the accuracy gap down to 2 percentage points at no additional computational cost. [sent-373, score-1.374]
86 We report the accuracy obtained with both conditional and target representation (left and right columns respectively) for each of the models in Ta1 We checked this by training the Skip-gram model for 10 epochs, which did not result in a substantial increase in accuracy. [sent-378, score-0.236]
87 The difference is small for traditional language models (vLBL), but is quite pronounced for the inverse language model (ivLBL). [sent-387, score-0.469]
88 The best-performing representations were learned by the traditional language model with the context surrounding the word and position-independent weights. [sent-388, score-1.007]
89 Sentence completion: We also applied our approach to the MSR Sentence Completion Challenge [19], where the task is to complete each of the 1,040 test sentences by picking the missing word from the list of five candidate words. [sent-389, score-0.563]
90 Using the 47M-word Gutenberg dataset, preprocessed as in [14], as the training set, we trained several ivLBL models with NCE5 to predict 5 words preceding and 5 following the current word. [sent-390, score-0.423]
91 To complete a sentence, we compute the probability of the 10 words around the missing word (using Eq. [sent-391, score-0.62]
92 6 Discussion We have proposed a new highly scalable approach to learning word embeddings which involves training lightweight log-bilinear language models with noise-contrastive estimation. [sent-400, score-1.064]
93 It is simpler than the tree-based language modelling approach of [10] and produces better-performing embeddings faster. [sent-401, score-0.476]
94 The scores we report in this paper are also easy to compare to, because we trained our models only on publicly available data. [sent-403, score-0.267]
95 [8] have recently proposed a way of learning multiple representations for each word by clustering the contexts the word occurs in and allocating a different representation for each cluster, prior to training the model. [sent-405, score-1.263]
96 As ivLBL predicts the context from the word, it naturally allows using multiple context representations per current word, resulting in a more principled approach to the problem based on mixture modeling. [sent-406, score-0.302]
97 Sharing representations between the context and the target words is also worth investigating as it might result in betterestimated rare word representations. [sent-407, score-0.839]
98 Adaptive importance sampling to accelerate training of a e e neural probabilistic language model. [sent-416, score-0.325]
99 Improving word representations via global context and multiple word prototypes. [sent-436, score-1.215]
100 A fast and simple algorithm for training neural probabilistic language models. [sent-461, score-0.325]
wordName wordTfidf (topN-words)
[('lbl', 0.535), ('word', 0.51), ('nce', 0.198), ('embeddings', 0.194), ('language', 0.192), ('ivlbl', 0.173), ('vlbl', 0.138), ('iv', 0.137), ('cbow', 0.121), ('msr', 0.121), ('representations', 0.111), ('trained', 0.111), ('sentence', 0.095), ('words', 0.089), ('gutenberg', 0.086), ('kip', 0.086), ('nplms', 0.086), ('context', 0.084), ('training', 0.078), ('scores', 0.078), ('vocabulary', 0.076), ('mnih', 0.063), ('odel', 0.061), ('adagrad', 0.06), ('modelling', 0.058), ('models', 0.055), ('kpn', 0.052), ('similarity', 0.051), ('andriy', 0.05), ('wikipedia', 0.049), ('gram', 0.047), ('oogle', 0.046), ('days', 0.046), ('target', 0.045), ('noise', 0.045), ('ime', 0.042), ('rw', 0.042), ('learned', 0.042), ('semantic', 0.04), ('mikolov', 0.04), ('surrounding', 0.038), ('yoshua', 0.037), ('unnormalized', 0.037), ('completion', 0.037), ('accuracy', 0.036), ('scalable', 0.035), ('nlp', 0.035), ('bw', 0.035), ('bwi', 0.035), ('emantic', 0.035), ('epd', 0.035), ('kepn', 0.035), ('ontext', 0.035), ('qwi', 0.035), ('rwi', 0.035), ('yntactic', 0.035), ('epochs', 0.034), ('embedding', 0.034), ('predict', 0.034), ('released', 0.033), ('considerably', 0.033), ('normalizing', 0.033), ('preceding', 0.033), ('percentage', 0.033), ('sentences', 0.032), ('perplexity', 0.032), ('contexts', 0.032), ('simpler', 0.032), ('deepmind', 0.03), ('qw', 0.03), ('sen', 0.03), ('scoring', 0.03), ('syntactic', 0.03), ('neural', 0.03), ('traditional', 0.03), ('koray', 0.028), ('tomas', 0.028), ('zweig', 0.028), ('google', 0.028), ('geoffrey', 0.027), ('challenge', 0.027), ('microsoft', 0.026), ('distributional', 0.026), ('samples', 0.026), ('counterpart', 0.025), ('probabilistic', 0.025), ('normalized', 0.025), ('publicly', 0.023), ('vocabularies', 0.023), ('current', 0.023), ('predicted', 0.023), ('comparable', 0.023), ('cluster', 0.022), ('compromise', 0.022), ('occurred', 0.022), ('excellent', 0.022), ('unfortunately', 0.022), ('representation', 0.022), ('weights', 0.021), ('missing', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
2 0.34763896 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
3 0.18542719 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
4 0.18183185 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
5 0.14688544 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
Author: Nathaniel J. Smith, Noah Goodman, Michael Frank
Abstract: Language users are remarkably good at making inferences about speakers’ intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings. 1
6 0.13432148 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
7 0.10621507 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.095886484 98 nips-2013-Documents as multiple overlapping windows into grids of counts
9 0.091110565 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge
10 0.082623191 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
11 0.081607349 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
12 0.080943786 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
13 0.070608325 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
14 0.07010065 5 nips-2013-A Deep Architecture for Matching Short Texts
15 0.064849034 174 nips-2013-Lexical and Hierarchical Topic Regression
16 0.061651148 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
17 0.060571235 65 nips-2013-Compressive Feature Learning
18 0.052607283 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
19 0.052460842 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
20 0.0511485 75 nips-2013-Convex Two-Layer Modeling
topicId topicWeight
[(0, 0.14), (1, 0.079), (2, -0.102), (3, -0.041), (4, 0.108), (5, -0.113), (6, 0.006), (7, 0.022), (8, 0.026), (9, 0.024), (10, -0.06), (11, 0.006), (12, -0.078), (13, -0.043), (14, 0.017), (15, -0.069), (16, 0.248), (17, 0.143), (18, 0.008), (19, 0.21), (20, -0.087), (21, -0.203), (22, -0.128), (23, -0.011), (24, 0.054), (25, 0.169), (26, -0.003), (27, 0.029), (28, -0.012), (29, 0.108), (30, -0.099), (31, -0.116), (32, 0.096), (33, 0.102), (34, 0.015), (35, -0.015), (36, -0.047), (37, 0.007), (38, 0.029), (39, 0.05), (40, -0.011), (41, 0.013), (42, -0.051), (43, -0.032), (44, -0.041), (45, -0.061), (46, -0.098), (47, -0.061), (48, -0.016), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.96443504 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
2 0.92492706 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
3 0.81562912 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
4 0.73925012 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
5 0.7354843 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
Author: Nathaniel J. Smith, Noah Goodman, Michael Frank
Abstract: Language users are remarkably good at making inferences about speakers’ intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings. 1
6 0.59931672 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
7 0.56185299 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.56021595 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
9 0.51372075 98 nips-2013-Documents as multiple overlapping windows into grids of counts
10 0.45723239 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
11 0.39227703 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge
12 0.35346332 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars
13 0.3427965 335 nips-2013-Transfer Learning in a Transductive Setting
14 0.32750016 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
15 0.32411322 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.31304035 65 nips-2013-Compressive Feature Learning
17 0.30326098 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
18 0.30045241 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
19 0.29836267 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
20 0.29579481 174 nips-2013-Lexical and Hierarchical Topic Regression
topicId topicWeight
[(16, 0.026), (33, 0.134), (34, 0.072), (36, 0.014), (41, 0.016), (43, 0.023), (49, 0.026), (56, 0.052), (70, 0.015), (85, 0.038), (89, 0.021), (93, 0.484)]
simIndex simValue paperId paperTitle
1 0.94288963 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
2 0.88887256 211 nips-2013-Non-Linear Domain Adaptation with Boosting
Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua
Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1
3 0.8870303 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
same-paper 4 0.86796707 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
5 0.85897923 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
6 0.70479351 215 nips-2013-On Decomposing the Proximal Map
7 0.67861629 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.63161701 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
9 0.63082361 99 nips-2013-Dropout Training as Adaptive Regularization
10 0.62666833 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
11 0.59775966 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
12 0.59730363 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
13 0.59510112 30 nips-2013-Adaptive dropout for training deep neural networks
14 0.57306254 251 nips-2013-Predicting Parameters in Deep Learning
15 0.55840218 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.54771364 69 nips-2013-Context-sensitive active sensing in humans
17 0.5455178 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
18 0.54259515 64 nips-2013-Compete to Compute
19 0.54239988 183 nips-2013-Mapping paradigm ontologies to and from the brain
20 0.53443879 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization