nips nips2013 nips2013-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. [sent-11, score-0.527]
2 In this paper we present several extensions that improve both the quality of the vectors and the training speed. [sent-12, score-0.163]
3 By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. [sent-13, score-0.771]
4 We also describe a simple alternative to the hierarchical softmax called negative sampling. [sent-14, score-0.376]
5 An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. [sent-15, score-0.84]
6 Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. [sent-17, score-0.865]
7 1 Introduction Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words. [sent-18, score-0.426]
8 One of the earliest use of word representations dates back to 1986 due to Rumelhart, Hinton, and Williams [13]. [sent-19, score-0.49]
9 [8] introduced the Skip-gram model, an efficient method for learning highquality vector representations of words from large amounts of unstructured text data. [sent-23, score-0.311]
10 Unlike most of the previously used neural network architectures for learning word vectors, training of the Skipgram model (see Figure 1) does not involve dense matrix multiplications. [sent-24, score-0.408]
11 This makes the training extremely efficient: an optimized single-machine implementation can train on more than 100 billion words in one day. [sent-25, score-0.307]
12 The word representations computed using neural networks are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns. [sent-26, score-0.588]
13 For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector [9, 8]. [sent-28, score-0.307]
14 The training objective is to learn word vector representations that are good at predicting the nearby words. [sent-30, score-0.591]
15 We show that subsampling of frequent words during training results in a significant speedup (around 2x - 10x), and improves accuracy of the representations of less frequent words. [sent-32, score-0.919]
16 In addition, we present a simplified variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work [8]. [sent-33, score-0.854]
17 Word representations are limited by their inability to represent idiomatic phrases that are not compositions of the individual words. [sent-34, score-0.567]
18 Therefore, using vectors to represent the whole phrases makes the Skip-gram model considerably more expressive. [sent-36, score-0.403]
19 Other techniques that aim to represent meaning of sentences by composing the word vectors, such as the recursive autoencoders [15], would also benefit from using phrase vectors instead of the word vectors. [sent-37, score-0.868]
20 The extension from word based to phrase based models is relatively simple. [sent-38, score-0.499]
21 First we identify a large number of phrases using a data-driven approach, and then we treat the phrases as individual tokens during the training. [sent-39, score-0.739]
22 To evaluate the quality of the phrase vectors, we developed a test set of analogical reasoning tasks that contains both words and phrases. [sent-40, score-0.579]
23 This compositionality suggests that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations. [sent-46, score-0.503]
24 2 The Skip-gram Model The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. [sent-47, score-0.784]
25 More formally, given a sequence of training words w1 , w2 , w3 , . [sent-48, score-0.229]
26 , wT , the objective of the Skip-gram model is to maximize the average log probability 1 T T log p(wt+j |wt ) (1) t=1 −c≤j≤c,j=0 where c is the size of the training context (which can be a function of the center word wt ). [sent-51, score-0.453]
27 Larger c results in more training examples and thus can lead to a higher accuracy, at the expense of the 2 training time. [sent-52, score-0.202]
28 The basic Skip-gram formulation defines p(wt+j |wt ) using the softmax function: p(wO |wI ) = ′ exp vwO ⊤ vwI W w=1 (2) ′ exp vw ⊤ vwI ′ where vw and vw are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary. [sent-53, score-0.864]
29 1 Hierarchical Softmax A computationally efficient approximation of the full softmax is the hierarchical softmax. [sent-56, score-0.331]
30 The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. [sent-59, score-0.459]
31 More precisely, each word w can be reached by an appropriate path from the root of the tree. [sent-61, score-0.307]
32 Then the hierarchical softmax defines p(wO |wI ) as follows: L(w)−1 ⊤ ′ σ [[n(w, j + 1) = ch(n(w, j))]] · vn(w,j) vwI p(w|wI ) = (3) j=1 W where σ(x) = 1/(1 + exp(−x)). [sent-64, score-0.331]
33 Also, unlike the standard softmax formulation of the Skip-gram which ′ assigns two representations vw and vw to each word w, the hierarchical softmax formulation has ′ one representation vw for each word w and one representation vn for every inner node n of the binary tree. [sent-67, score-1.681]
34 The structure of the tree used by the hierarchical softmax has a considerable effect on the performance. [sent-68, score-0.331]
35 In our work we use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. [sent-70, score-0.266]
36 It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models [5, 8]. [sent-71, score-0.243]
37 2 Negative Sampling An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE), which was introduced by Gutmann and Hyvarinen [4] and applied to language modeling by Mnih and Teh [11]. [sent-73, score-0.446]
38 While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. [sent-76, score-0.183]
39 The figure illustrates ability of the model to automatically organize concepts and learn implicitly the relationships between them, as during the training we did not provide any supervised information about what a capital city means. [sent-86, score-0.203]
40 Thus the task is to distinguish the target word wO from draws from the noise distribution Pn (w) using logistic regression, where there are k negative samples for each data sample. [sent-88, score-0.386]
41 , U (w)3/4 /Z) outperformed significantly the unigram and the uniform distributions, for both NCE and NEG on every task we tried including language modeling (not reported here). [sent-95, score-0.195]
42 3 Subsampling of Frequent Words In very large corpora, the most frequent words can easily occur hundreds of millions of times (e. [sent-97, score-0.266]
43 Such words usually provide less information value than the rare words. [sent-100, score-0.17]
44 For example, while the Skip-gram model benefits from observing the co-occurrences of “France” and “Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as nearly every word co-occurs frequently within a sentence with “the”. [sent-101, score-0.51]
45 This idea can also be applied in the opposite direction; the vector representations of frequent words do not change significantly after training on several million examples. [sent-102, score-0.55]
46 where f (wi ) is the frequency of word wi and t is a chosen threshold, typically around 10−5 . [sent-105, score-0.409]
47 We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies. [sent-106, score-0.326]
48 Although this subsampling formula was chosen heuristically, we found it to work well in practice. [sent-107, score-0.198]
49 It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words, as will be shown in the following sections. [sent-108, score-0.173]
50 3 Empirical Results In this section we evaluate the Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words. [sent-109, score-0.299]
51 We used the analogical reasoning task1 introduced by Mikolov et al. [sent-110, score-0.259]
52 The task has two broad categories: the syntactic analogies (such as “quick” : “quickly” :: “slow” : “slowly”) and the semantic analogies, such as the country to capital city relationship. [sent-115, score-0.283]
53 For training the Skip-gram models, we have used a large dataset consisting of various news articles (an internal Google dataset with one billion words). [sent-116, score-0.226]
54 We discarded from the vocabulary all words that occurred less than 5 times in the training data, which resulted in a vocabulary of size 692K. [sent-117, score-0.303]
55 The performance of various Skip-gram models on the word analogy test set is reported in Table 1. [sent-118, score-0.373]
56 The table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation. [sent-119, score-0.292]
57 The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate. [sent-120, score-1.055]
58 It can be argued that the linearity of the skip-gram model makes its vectors more suitable for such linear analogical reasoning, but the results of Mikolov et al. [sent-121, score-0.253]
59 4 Learning Phrases As discussed earlier, many phrases have a meaning that is not a simple composition of the meanings of its individual words. [sent-123, score-0.39]
60 For example, “New York Times” and “Toronto Maple Leafs” are replaced by unique tokens in the training data, while a bigram “this is” will remain unchanged. [sent-125, score-0.158]
61 The goal is to compute the fourth phrase using the first three. [sent-131, score-0.192]
62 This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. [sent-133, score-0.341]
63 Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. [sent-134, score-0.341]
64 We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using score(wi , wj ) = count(wi wj ) − δ . [sent-135, score-0.387]
65 count(wi ) × count(wj ) (6) The δ is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. [sent-136, score-0.512]
66 Typically, we run 2-4 passes over the training data with decreasing threshold value, allowing longer phrases that consists of several words to be formed. [sent-138, score-0.57]
67 We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. [sent-139, score-0.668]
68 1 Phrase Skip-Gram Results Starting with the same news data as in the previous experiments, we first constructed the phrase based training corpus and then we trained several Skip-gram models using different hyperparameters. [sent-143, score-0.387]
69 This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens. [sent-145, score-0.528]
70 Surprisingly, while we found the Hierarchical Softmax to achieve lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words. [sent-148, score-0.185]
71 This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases. [sent-149, score-0.299]
72 txt Method NEG-5 NEG-15 HS-Huffman Dimensionality 300 300 300 No subsampling [%] 24 27 19 10−5 subsampling [%] 27 42 47 Table 3: Accuracies of the Skip-gram models on the phrase analogy dataset. [sent-153, score-0.654]
73 The models were trained on approximately one billion words from the news dataset. [sent-154, score-0.3]
74 To maximize the accuracy on the phrase analogy task, we increased the amount of the training data by using a dataset with about 33 billion words. [sent-158, score-0.47]
75 We achieved lower accuracy 66% when we reduced the size of the training dataset to 6B words, which suggests that the large amount of the training data is crucial. [sent-161, score-0.235]
76 To gain further insight into how different the representations learned by different models are, we did inspect manually the nearest neighbours of infrequent phrases using various models. [sent-162, score-0.638]
77 Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling. [sent-164, score-0.891]
78 5 Additive Compositionality We demonstrated that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. [sent-165, score-0.977]
79 Interestingly, we found that the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. [sent-166, score-0.311]
80 The additive property of the vectors can be explained by inspecting the training objective. [sent-168, score-0.163]
81 The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. [sent-169, score-0.61]
82 As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. [sent-170, score-0.913]
83 These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. [sent-171, score-0.369]
84 The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. [sent-172, score-0.625]
85 Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”. [sent-173, score-0.562]
86 6 Comparison to Published Word Representations Many authors who previously worked on the neural network based representations of words have published their resulting models for further use and comparison: amongst the most well known authors are Collobert and Weston [2], Turian et al. [sent-174, score-0.311]
87 [8] have already evaluated these word representations on the word analogy task, where the Skip-gram models achieved the best performance with a huge margin. [sent-178, score-0.863]
88 An empty cell means that the word was not in the vocabulary. [sent-182, score-0.307]
89 To give more insight into the difference of the quality of the learned vectors, we provide empirical comparison by showing the nearest neighbours of infrequent words in Table 6. [sent-183, score-0.242]
90 Interestingly, although the training set is much larger, the training time of the Skip-gram model is just a fraction of the time complexity required by the previous model architectures. [sent-186, score-0.202]
91 We show how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible. [sent-188, score-1.094]
92 This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities. [sent-191, score-0.577]
93 We also found that the subsampling of the frequent words results in both faster training and significantly better representations of uncommon words. [sent-192, score-0.748]
94 Another contribution of our paper is the Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words. [sent-193, score-0.422]
95 In our experiments, the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window. [sent-195, score-0.299]
96 A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition. [sent-196, score-0.369]
97 Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. [sent-197, score-0.865]
98 Our work can thus be seen as complementary to the existing approach that attempts to represent phrases using recursive matrix-vector operations [16]. [sent-199, score-0.341]
99 We made the code for training the word and phrase vectors based on the techniques described in this paper available as an open-source project4 . [sent-200, score-0.662]
100 A fast and simple algorithm for training neural probabilistic language models. [sent-237, score-0.216]
wordName wordTfidf (topN-words)
[('phrases', 0.341), ('word', 0.307), ('softmax', 0.241), ('vec', 0.211), ('nce', 0.206), ('subsampling', 0.198), ('phrase', 0.192), ('analogical', 0.191), ('representations', 0.183), ('mikolov', 0.162), ('wo', 0.138), ('frequent', 0.138), ('words', 0.128), ('vwi', 0.128), ('language', 0.115), ('river', 0.114), ('tomas', 0.104), ('vw', 0.104), ('wi', 0.102), ('training', 0.101), ('airlines', 0.094), ('mnih', 0.093), ('hierarchical', 0.09), ('mountain', 0.086), ('montreal', 0.085), ('lufthansa', 0.085), ('volga', 0.085), ('compositionality', 0.081), ('billion', 0.078), ('analogies', 0.075), ('toronto', 0.069), ('capital', 0.069), ('reasoning', 0.068), ('analogy', 0.066), ('france', 0.065), ('sentence', 0.065), ('canadiens', 0.064), ('havel', 0.064), ('vectors', 0.062), ('contrastive', 0.061), ('google', 0.06), ('tokens', 0.057), ('maple', 0.056), ('leafs', 0.056), ('collobert', 0.055), ('boston', 0.055), ('neg', 0.052), ('redmond', 0.049), ('meanings', 0.049), ('russia', 0.049), ('trained', 0.047), ('news', 0.047), ('unigram', 0.046), ('yoshua', 0.046), ('negative', 0.045), ('wt', 0.045), ('paris', 0.044), ('infrequent', 0.043), ('chess', 0.043), ('cincinnati', 0.043), ('detroit', 0.043), ('huffman', 0.043), ('idiomatic', 0.043), ('ionian', 0.043), ('lukas', 0.043), ('memphis', 0.043), ('skipgram', 0.043), ('vaclav', 0.043), ('vwo', 0.043), ('rare', 0.042), ('turian', 0.041), ('kai', 0.039), ('germany', 0.038), ('spain', 0.038), ('globe', 0.038), ('phoenix', 0.038), ('burget', 0.038), ('morin', 0.038), ('nashville', 0.038), ('vocabulary', 0.037), ('syntactic', 0.037), ('pn', 0.036), ('learned', 0.036), ('berlin', 0.035), ('moscow', 0.035), ('teams', 0.035), ('gutmann', 0.035), ('baltimore', 0.035), ('country', 0.035), ('greece', 0.035), ('neighbours', 0.035), ('rumelhart', 0.035), ('russian', 0.035), ('task', 0.034), ('speech', 0.034), ('weston', 0.034), ('geoffrey', 0.034), ('table', 0.033), ('accuracy', 0.033), ('city', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
2 0.34763896 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
3 0.22646731 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
4 0.15314306 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
5 0.1494313 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
6 0.11070462 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
7 0.10400105 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
8 0.073612012 251 nips-2013-Predicting Parameters in Deep Learning
9 0.073475145 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
10 0.073360533 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
11 0.071417511 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
12 0.070992395 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series
13 0.070335589 174 nips-2013-Lexical and Hierarchical Topic Regression
14 0.069695614 5 nips-2013-A Deep Architecture for Matching Short Texts
15 0.0689556 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
16 0.068759345 98 nips-2013-Documents as multiple overlapping windows into grids of counts
17 0.068091631 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
18 0.066023625 75 nips-2013-Convex Two-Layer Modeling
19 0.06598907 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
20 0.065925866 331 nips-2013-Top-Down Regularization of Deep Belief Networks
topicId topicWeight
[(0, 0.148), (1, 0.078), (2, -0.104), (3, -0.061), (4, 0.119), (5, -0.13), (6, -0.015), (7, 0.038), (8, 0.038), (9, 0.003), (10, -0.043), (11, 0.012), (12, -0.058), (13, -0.044), (14, 0.017), (15, -0.023), (16, 0.24), (17, 0.126), (18, 0.002), (19, 0.206), (20, -0.044), (21, -0.199), (22, -0.097), (23, -0.019), (24, 0.103), (25, 0.169), (26, -0.053), (27, 0.038), (28, -0.02), (29, 0.071), (30, -0.076), (31, -0.1), (32, 0.037), (33, 0.088), (34, 0.031), (35, 0.005), (36, -0.05), (37, -0.044), (38, 0.037), (39, -0.009), (40, -0.015), (41, 0.037), (42, -0.037), (43, -0.033), (44, -0.048), (45, -0.004), (46, -0.1), (47, -0.099), (48, -0.042), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96452856 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
2 0.94172049 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
3 0.79077858 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
4 0.70546031 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
Author: Nathaniel J. Smith, Noah Goodman, Michael Frank
Abstract: Language users are remarkably good at making inferences about speakers’ intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings. 1
5 0.6791926 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
6 0.60599327 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
7 0.60217196 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
8 0.54249275 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
9 0.52301979 98 nips-2013-Documents as multiple overlapping windows into grids of counts
10 0.43424004 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
11 0.43120167 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
12 0.37468246 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
13 0.37466305 5 nips-2013-A Deep Architecture for Matching Short Texts
14 0.36021757 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge
15 0.35569087 85 nips-2013-Deep content-based music recommendation
16 0.35390627 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars
17 0.34519008 65 nips-2013-Compressive Feature Learning
18 0.34469211 335 nips-2013-Transfer Learning in a Transductive Setting
19 0.3440696 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
20 0.33910128 174 nips-2013-Lexical and Hierarchical Topic Regression
topicId topicWeight
[(16, 0.043), (33, 0.172), (34, 0.068), (36, 0.012), (41, 0.014), (43, 0.31), (49, 0.027), (56, 0.045), (70, 0.021), (85, 0.039), (89, 0.017), (93, 0.145)]
simIndex simValue paperId paperTitle
same-paper 1 0.7868104 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
2 0.73444438 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
Author: Vibhav Vineet, Carsten Rother, Philip Torr
Abstract: Many methods have been proposed to solve the problems of recovering intrinsic scene properties such as shape, reflectance and illumination from a single image, and object class segmentation separately. While these two problems are mutually informative, in the past not many papers have addressed this topic. In this work we explore such joint estimation of intrinsic scene properties recovered from an image, together with the estimation of the objects and attributes present in the scene. In this way, our unified framework is able to capture the correlations between intrinsic properties (reflectance, shape, illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a given scene. For example, our model is able to enforce the condition that if a set of pixels take same object label, e.g. table, most likely those pixels would receive similar reflectance values. We cast the problem in an energy minimization framework and demonstrate the qualitative and quantitative improvement in the overall accuracy on the NYU and Pascal datasets. 1
3 0.66361457 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
Author: Daniel S. Levine, Jonathan P. How
Abstract: We consider the sensor selection problem on multivariate Gaussian distributions where only a subset of latent variables is of inferential interest. For pairs of vertices connected by a unique path in the graph, we show that there exist decompositions of nonlocal mutual information into local information measures that can be computed efficiently from the output of message passing algorithms. We integrate these decompositions into a computationally efficient greedy selector where the computational expense of quantification can be distributed across nodes in the network. Experimental results demonstrate the comparative efficiency of our algorithms for sensor selection in high-dimensional distributions. We additionally derive an online-computable performance bound based on augmentations of the relevant latent variable set that, when such a valid augmentation exists, is applicable for any distribution with nuisances. 1
4 0.6128906 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
5 0.602714 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
6 0.59224212 65 nips-2013-Compressive Feature Learning
7 0.58487666 211 nips-2013-Non-Linear Domain Adaptation with Boosting
8 0.58119732 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
9 0.57679355 339 nips-2013-Understanding Dropout
10 0.57178098 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
11 0.56898361 30 nips-2013-Adaptive dropout for training deep neural networks
12 0.56422353 99 nips-2013-Dropout Training as Adaptive Regularization
13 0.5630883 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
14 0.56252682 251 nips-2013-Predicting Parameters in Deep Learning
15 0.56238902 215 nips-2013-On Decomposing the Proximal Map
16 0.55987656 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
17 0.55872554 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
18 0.5563255 331 nips-2013-Top-Down Regularization of Deep Belief Networks
19 0.55336422 183 nips-2013-Mapping paradigm ontologies to and from the brain
20 0.55186862 200 nips-2013-Multi-Prediction Deep Boltzmann Machines