nips nips2013 nips2013-336 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. [sent-3, score-0.694]
2 Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. [sent-5, score-0.443]
3 Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. [sent-7, score-0.279]
4 1 Introduction Multi-relational data refers to directed graphs whose nodes correspond to entities and edges of the form (head, label, tail) (denoted (h, , t)), each of which indicates that there exists a relationship of name label between the entities head and tail. [sent-8, score-0.971]
5 Modeling multi-relational data In general, the modeling process boils down to extracting local or global connectivity patterns between entities, and prediction is performed by using these patterns to generalize the observed relationship between a specific entity and all others. [sent-12, score-0.328]
6 The notion of locality for a single relationship may be purely structural, such as the friend of my friend is my friend in 1 freebase. [sent-13, score-0.22]
7 org 2 1 social networks, but can also depend on the entities, such as those who liked Star Wars IV also liked Star Wars V, but they may or may not like Titanic. [sent-17, score-0.056]
8 Relationships as translations in the embedding space In this paper, we introduce TransE, an energy-based model for learning low-dimensional embeddings of entities. [sent-25, score-0.351]
9 In TransE, relationships are represented as translations in the embedding space: if (h, , t) holds, then the embedding of the tail entity t should be close to the embedding of the head entity h plus some vector that depends on the relationship . [sent-26, score-1.221]
10 Our approach relies on a reduced set of parameters as it learns only one low-dimensional vector for each entity and each relationship. [sent-27, score-0.148]
11 The main motivation behind our translation-based parameterization is that hierarchical relationships are extremely common in KBs and translations are the natural transformations for representing them. [sent-28, score-0.259]
12 embeddings of the nodes in dimension 2), the siblings are close to each other and nodes at a given height are organized on the x-axis, the parent-child relationship corresponds to a translation on the y-axis. [sent-31, score-0.329]
13 Since a null translation vector corresponds to an equivalence relationship between entities, the model can then represent the sibling relationship as well. [sent-32, score-0.239]
14 Hence, we chose to use our parameter budget per relationship (one low-dimensional vector) to represent what we considered to be the key relationships in KBs. [sent-33, score-0.292]
15 This suggests that there may exist embedding spaces in which 1-to-1 relationships between entities of different types may, as well, be represented by translations. [sent-35, score-0.668]
16 The intention of our model is to enforce such a structure of the embedding space. [sent-36, score-0.106]
17 Besides, its light parameterization allows it to be successfully trained on a large scale split of Freebase containing 1M entities, 25k relationships and more than 17M training samples. [sent-38, score-0.301]
18 2 Algorithm 1 Learning TransE input Training set S = {(h, , t)}, entities and rel. [sent-42, score-0.364]
19 (h, ,t),(h , ,t ) ∈Tbatch 13: end loop 2 Translation-based model Given a training set S of triplets (h, , t) composed of two entities h, t ∈ E (the set of entities) and a relationship ∈ L (the set of relationships), our model learns vector embeddings of the entities and the relationships. [sent-48, score-1.35]
20 The embeddings take values in Rk (k is a model hyperparameter) and are denoted with the same letters, in boldface characters. [sent-49, score-0.184]
21 The basic idea behind our model is that the functional relation induced by the -labeled edges corresponds to a translation of the embeddings, i. [sent-50, score-0.051]
22 Following an energy-based framework, the energy of a triplet is equal to d(h + , t) for some dissimilarity measure d, which we take to be either the L1 or the L2 -norm. [sent-53, score-0.116]
23 To learn such embeddings, we minimize a margin-based ranking criterion over the training set: γ + d(h + , t) − d(h + , t ) L= (h, ,t)∈S (h , ,t )∈S(h, + (1) ,t) where [x]+ denotes the positive part of x, γ > 0 is a margin hyperparameter, and S(h, ,t) = (h , , t)|h ∈ E ∪ (h, , t )|t ∈ E . [sent-54, score-0.114]
24 (2) The set of corrupted triplets, constructed according to Equation 2, is composed of training triplets with either the head or tail replaced by a random entity (but not both at the same time). [sent-55, score-0.819]
25 The loss function (1) favors lower values of the energy for training triplets than for corrupted triplets, and is thus a natural implementation of the intended criterion. [sent-56, score-0.398]
26 Note that for a given entity, its embedding vector is the same when the entity appears as the head or as the tail of a triplet. [sent-57, score-0.508]
27 The optimization is carried out by stochastic gradient descent (in minibatch mode), over the possible h, and t, with the additional constraints that the L2 -norm of the embeddings of the entities is 1 (no regularization or norm constraints are given to the label embeddings ). [sent-58, score-0.752]
28 This constraint is important for our model, as it is for previous embedding-based methods [3, 6, 2], because it prevents the training process to trivially minimize L by artificially increasing entity embeddings norms. [sent-59, score-0.382]
29 All embeddings for entities and relationships are first initialized following the random procedure proposed in [4]. [sent-61, score-0.746]
30 At each main iteration of the algorithm, the embedding vectors of the entities are first normalized. [sent-62, score-0.47]
31 Then, a small set of triplets is sampled from the training set, and will serve as the training triplets of the minibatch. [sent-63, score-0.65]
32 For each such triplet, we then sample a single corrupted triplet. [sent-64, score-0.073]
33 3 Related work Section 1 described a large body of work on embedding KBs. [sent-67, score-0.106]
34 OF PARAMETERS O(ne k) O(ne k + nr k2 ) O(ne k + 2nr k2 ) O(ne k + nr k + 4k2 ) O(ne k + nr k + 2k3 ) O(ne k + nr k + 10k2 ) O(ne k + nr k) ON Table 2: Statistics of the data sets used in this paper and extracted from the two knowledge bases, Wordnet and Freebase. [sent-73, score-0.302]
35 5×106 50,000 177,404 SE [3] embeds entities into Rk , and relationships into two matrices L1 ∈ Rk×k and L2 ∈ Rk×k such that d(L1 h, L2 t) is large for corrupted triplets (h, , t) (and small otherwise). [sent-86, score-0.91]
36 The basic idea is that when two entities belong to the same triplet, their embeddings should be close to each other in some subspace that depends on the relationship. [sent-87, score-0.548]
37 Using two different projection matrices for the head and for the tail is intended to account for the possible asymmetry of relationship . [sent-88, score-0.348]
38 SE, with L2 as the identity matrix and L1 taken so as to reproduce a translation is then equivalent to TransE. [sent-92, score-0.051]
39 We believe this is because (1) our model is a more direct way to represent the true properties of the relationship, and (2) optimization is difficult in embedding models. [sent-94, score-0.106]
40 A special case of this model corresponds to learning scores s(h, , t) (lower scores for corrupted triplets) of the form: s(h, , t) = hT Lt + T 1h + T 2t (3) where L ∈ Rk×k , L1 ∈ Rk and L2 ∈ Rk , all of them depending on . [sent-99, score-0.073]
41 We could not run experiments with this model (since it has been published simultaneously as ours), but once again TransE has much fewer parameters: this could simplify the training and prevent underfitting, and may compensate for a lower expressiveness. [sent-102, score-0.05]
42 Its entities (termed synsets) correspond to word senses, and relationships define lexical relations between them. [sent-112, score-0.61]
43 Examples of triplets are ( score NN 1, hypernym, evaluation NN 1) or ( score NN 2, has part, musical notation NN 1). [sent-114, score-0.335]
44 First, to make a small data set to experiment on we selected the subset of entities that are also present in the Wikilinks database5 and that also have at least 100 mentions in Freebase (for both entities and relationships). [sent-118, score-0.728]
45 /people/person/nationality’ which just reverses the head and tail compared to the relationship ’/people/person/nationality’. [sent-120, score-0.348]
46 This resulted in 592,213 triplets with 14,951 entities and 1,345 relationships which were randomly split as shown in Table 2. [sent-121, score-0.859]
47 This led to a split with around 25k relationships and more than 17 millions training triplets, which we refer to as FB1M. [sent-125, score-0.27]
48 For each test triplet, the head is removed and replaced by each of the entities of the dictionary in turn. [sent-128, score-0.535]
49 Dissimilarities (or energies) of those corrupted triplets are first computed by the models and then sorted by ascending order; the rank of the correct entity is finally stored. [sent-129, score-0.496]
50 This whole procedure is repeated while removing the tail instead of the head. [sent-130, score-0.105]
51 We report the mean of those predicted ranks and the hits@10, i. [sent-131, score-0.073]
52 the proportion of correct entities ranked in the top 10. [sent-133, score-0.364]
53 These metrics are indicative but can be flawed when some corrupted triplets end up being valid ones, from the training set for instance. [sent-134, score-0.398]
54 In this case, those may be ranked above the test triplet, but this should not be counted as an error because both triplets are true. [sent-135, score-0.297]
55 To avoid such a misleading behavior, we propose to remove from the list of corrupted triplets all the triplets that appear either in the training, validation or test set (except the test triplet of interest). [sent-136, score-0.811]
56 This ensures that all corrupted triplets do not belong to the data set. [sent-137, score-0.348]
57 In the following, we report mean ranks and hits@10 according to both settings: the original (possibly flawed) one is termed raw, while we refer to the newer as filtered (or filt. [sent-138, score-0.073]
58 We only provide raw results for experiments on FB1M. [sent-140, score-0.073]
59 Baselines The first method is Unstructured, a version of TransE which considers the data as mono-relational and sets all translations to 0 (it was already used as baseline in [2]). [sent-141, score-0.083]
60 We also compare with RESCAL, the collective matrix factorization model presented in [11, 12], and the energy-based models SE [3], SME(linear)/SME(bilinear) [2] and LFM [6]. [sent-142, score-0.055]
61 RESCAL is trained via an alternating least-square method, whereas the others are trained by stochastic gradient descent, like TransE. [sent-143, score-0.062]
62 While SME(linear), SME(bilinear), LFM and TransE have about the same number of parameters as Unstructured for low dimensional embeddings, the other algorithms SE and RESCAL, which learn at least one k × k matrix for each relationship rapidly need to learn many parameters. [sent-145, score-0.13]
63 RESCAL needs about 87 times more parameters on FB15k because it requires a much larger embedding space than other models to achieve good performance. [sent-146, score-0.106]
64 We did not experiment on FB1M with RESCAL, SME(bilinear) and LFM for scalability reasons in terms of numbers of parameters or training duration. [sent-147, score-0.05]
65 For RESCAL, we had to set the regularization parameter to 0 for scalability reasons, as it is indicated in [11], and chose the latent dimension k among {50, 250, 500, 1000, 2000} that led to the lowest mean predicted ranks on the validation sets (using the raw setting). [sent-149, score-0.261]
66 For Unstructured, SE, SME(linear) and SME(bilinear), we 4 WN is composed of senses, its entities are denoted by the concatenation of a word, its part-of-speech tag and a digit indicating which sense it refers to i. [sent-150, score-0.383]
67 1}, k among {20, 50}, and selected the best model by early stopping using the mean rank on the validation sets (with a total of at most 1,000 epochs over the training data). [sent-196, score-0.138]
68 For LFM, we also used the mean validation ranks to select the model and to choose the latent dimension among {25, 50, 75}, the number of factors among {50, 100, 200, 500} and the learning rate among {0. [sent-197, score-0.204]
69 1}, the margin γ among {1, 2, 10} and the latent dimension k among {20, 50} on the validation set of each data set. [sent-204, score-0.132]
70 The dissimilarity measure d was set either to the L1 or L2 distance according to validation performance as well. [sent-205, score-0.088]
71 For all data sets, training time was limited to at most 1, 000 epochs over the training set. [sent-210, score-0.1]
72 The best models were selected by early stopping using the mean predicted ranks on the validation sets (raw setting). [sent-211, score-0.142]
73 As expected, the filtered setting provides lower mean ranks and higher hits@10, which we believe are a clearer evaluation of the performance of the methods in link prediction. [sent-215, score-0.144]
74 However, generally the trends between raw and filtered are the same. [sent-216, score-0.073]
75 5% on a subset of 50k triplets of the training set, whereas TransE reaches 127 and 42. [sent-224, score-0.325]
76 TransE without translation), mean ranks of Unstructured appear to be rather good (best runner-up on WN), but hits@10 are very poor. [sent-232, score-0.095]
77 Unstructured simply clusters all entities cooccurring together, independent of the relationships involved, and hence can only make guesses of which entities are related. [sent-233, score-0.926]
78 On FB1M, the mean ranks of TransE and Unstructured are almost similar, but TransE places 10 times more predictions in the top 10. [sent-234, score-0.073]
79 Bold indicates the test triplet’s true tail and italics other true tails present in the training set. [sent-291, score-0.228]
80 We categorized the relationships according to the cardinalities of their head and tail arguments into four classes: 1- TO -1, 1- TO -M ANY, M ANY- TO -1, M ANY- TO -M ANY. [sent-305, score-0.452]
81 A given relationship is 1- TO -1 if a head can appear with at most one tail, 1- TO -M ANY if a head can appear with many tails, M ANY- TO -1 if many heads can appear with the same tail, or M ANY- TO -M ANY if multiple heads can appear with multiple tails. [sent-306, score-0.552]
82 We classified the relationships into these four classes by computing, for each relationship , the averaged number of heads h (respect. [sent-307, score-0.328]
83 tails t) appearing in the FB15k data set, given a pair ( , t) (respect. [sent-308, score-0.051]
84 For example, a relationship having an average of 1. [sent-312, score-0.094]
85 First, it appears that, as one would expect, it is easier to predict entities on the “side 1” of triplets (i. [sent-321, score-0.639]
86 , predicting head in 1- TO -M ANY and tail in M ANY- TO -1), that is when multiple entities point to it. [sent-323, score-0.618]
87 Unstructured performs well on 1- TO -1 relationships: this shows that arguments of such relationships must share common hidden types that Unstructured is able to somewhat uncover by clustering entities linked together in the embedding space. [sent-326, score-0.668]
88 upgrading Unstructured into TransE) brings the ability to move in the embeddings space, from one entity cluster to another by following relationships. [sent-330, score-0.332]
89 Illustration Table 5 gives examples of link prediction results of TransE on the FB15k test set (predicting tail). [sent-332, score-0.071]
90 Given a head and a label, the top 7 Figure 1: Learning new relationships with few examples. [sent-334, score-0.347]
91 4 Learning to predict new relationships with few examples Using FB15k, we wanted to test how well methods could generalize to new facts by checking how fast they were learning new relationships. [sent-341, score-0.279]
92 To that end, we randomly selected 40 relationships and split the data into two sets: a set (named FB15k-40rel) containing all triplets with these 40 relationships and another set (FB15k-rest) containing the rest. [sent-342, score-0.693]
93 FB15k-rest has then been split into a training set of 353,788 triplets and a validation set of 53,266, and FB15k-40rel into a training set of 40,000 triplets (1,000 for each relationship) and a test set of 45,159. [sent-344, score-0.741]
94 We repeated this procedure while using 0, 10, 100 and 1000 examples of each relationship in phase (2). [sent-346, score-0.094]
95 The performance of Unstructured is the best when no example of the unknown relationship is provided, because it does not use this information to predict. [sent-348, score-0.094]
96 5 Conclusion and future work We proposed a new approach to learn embeddings of KBs, focusing on the minimal parametrization of the model to primarily represent hierarchical relationships. [sent-352, score-0.202]
97 Although it remains unclear to us if all relationship types can be modeled adequately by our approach, by breaking down the evaluation into categories (1-to-1, 1-to-Many, . [sent-354, score-0.116]
98 A three-way model for collective learning on multirelational data. [sent-442, score-0.061]
99 Learning new facts from knowledge bases with neural tensor networks and semantic word vectors. [sent-465, score-0.132]
100 Connecting language and knowledge bases with embedding models for relation extraction. [sent-478, score-0.147]
wordName wordTfidf (topN-words)
[('transe', 0.5), ('entities', 0.364), ('triplets', 0.275), ('sme', 0.252), ('relationships', 0.198), ('embeddings', 0.184), ('freebase', 0.153), ('se', 0.153), ('head', 0.149), ('entity', 0.148), ('county', 0.142), ('rescal', 0.142), ('unstructured', 0.14), ('lfm', 0.125), ('bilinear', 0.112), ('embedding', 0.106), ('hits', 0.105), ('tail', 0.105), ('kbs', 0.095), ('relationship', 0.094), ('wordnet', 0.088), ('triplet', 0.075), ('ranks', 0.073), ('raw', 0.073), ('corrupted', 0.073), ('mtv', 0.069), ('wn', 0.067), ('tbatch', 0.063), ('translations', 0.061), ('nr', 0.056), ('bordes', 0.055), ('translation', 0.051), ('tails', 0.051), ('training', 0.05), ('link', 0.049), ('ank', 0.047), ('ean', 0.047), ('movie', 0.047), ('validation', 0.047), ('award', 0.045), ('friend', 0.042), ('rk', 0.041), ('bases', 0.041), ('relational', 0.041), ('dissimilarity', 0.041), ('facts', 0.039), ('elder', 0.038), ('nn', 0.037), ('heads', 0.036), ('collective', 0.035), ('weston', 0.033), ('comedy', 0.032), ('oksana', 0.032), ('redicting', 0.032), ('sbatch', 0.032), ('wars', 0.032), ('yakhnenko', 0.032), ('ltered', 0.031), ('trained', 0.031), ('kb', 0.03), ('glorot', 0.029), ('senses', 0.028), ('liked', 0.028), ('expressiveness', 0.028), ('nickel', 0.028), ('compi', 0.028), ('gne', 0.028), ('latent', 0.027), ('expressive', 0.027), ('ranking', 0.026), ('tensor', 0.026), ('word', 0.026), ('awed', 0.026), ('tresp', 0.026), ('multirelational', 0.026), ('expressivity', 0.023), ('modeling', 0.023), ('test', 0.022), ('evaluation', 0.022), ('sets', 0.022), ('lexical', 0.022), ('patterns', 0.022), ('split', 0.022), ('appear', 0.022), ('tting', 0.021), ('table', 0.021), ('margin', 0.02), ('lm', 0.02), ('factorization', 0.02), ('ht', 0.02), ('ex', 0.02), ('category', 0.02), ('wanted', 0.02), ('minibatch', 0.02), ('connectivity', 0.019), ('among', 0.019), ('composed', 0.019), ('score', 0.019), ('learn', 0.018), ('besides', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
2 0.38354272 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
3 0.11121421 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
Author: Le Song, Bo Dai
Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1
4 0.082623191 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
5 0.079766348 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
6 0.054422431 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
7 0.054190665 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
8 0.048959777 75 nips-2013-Convex Two-Layer Modeling
9 0.048482038 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
10 0.045008879 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
11 0.044401169 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
12 0.044007007 251 nips-2013-Predicting Parameters in Deep Learning
13 0.04358615 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
14 0.042007022 335 nips-2013-Transfer Learning in a Transductive Setting
15 0.040938023 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
16 0.040871117 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
17 0.039835818 98 nips-2013-Documents as multiple overlapping windows into grids of counts
18 0.039053809 5 nips-2013-A Deep Architecture for Matching Short Texts
19 0.037091646 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
20 0.036803957 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
topicId topicWeight
[(0, 0.115), (1, 0.064), (2, -0.039), (3, 0.015), (4, 0.064), (5, -0.106), (6, 0.02), (7, -0.009), (8, 0.037), (9, 0.015), (10, -0.04), (11, -0.019), (12, -0.026), (13, -0.034), (14, 0.041), (15, -0.02), (16, 0.105), (17, 0.109), (18, 0.021), (19, 0.075), (20, 0.006), (21, -0.105), (22, -0.04), (23, 0.073), (24, 0.068), (25, 0.115), (26, -0.024), (27, 0.102), (28, 0.062), (29, 0.056), (30, -0.096), (31, -0.196), (32, 0.178), (33, 0.045), (34, -0.066), (35, -0.036), (36, -0.039), (37, 0.054), (38, 0.057), (39, 0.051), (40, -0.047), (41, -0.087), (42, -0.075), (43, 0.044), (44, -0.078), (45, -0.101), (46, -0.074), (47, 0.098), (48, 0.084), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.9263525 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
2 0.85436374 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
Author: Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng
Abstract: Knowledge bases are an important resource for question answering and other tasks but often suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. In this paper we introduce an expressive neural tensor network suitable for reasoning over relationships between two entities. Previous work represented entities as either discrete atomic units or with a single entity vector representation. We show that performance can be improved when entities are represented as an average of their constituting word vectors. This allows sharing of statistical strength between, for instance, facts involving the “Sumatran tiger” and “Bengal tiger.” Lastly, we demonstrate that all models improve when these word vectors are initialized with vectors learned from unsupervised large corpora. We assess the model by considering the problem of predicting additional true relations between entities given a subset of the knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively. 1
3 0.69489902 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
4 0.62787104 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
5 0.45575225 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
Author: Nathaniel J. Smith, Noah Goodman, Michael Frank
Abstract: Language users are remarkably good at making inferences about speakers’ intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings. 1
6 0.4009532 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
7 0.39232954 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
8 0.36855119 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
9 0.35951236 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
10 0.32789245 294 nips-2013-Similarity Component Analysis
11 0.32704914 5 nips-2013-A Deep Architecture for Matching Short Texts
12 0.3269012 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
13 0.31375122 98 nips-2013-Documents as multiple overlapping windows into grids of counts
14 0.31066823 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
15 0.30315286 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars
16 0.30230302 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
17 0.27110121 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
18 0.2652353 210 nips-2013-Noise-Enhanced Associative Memories
19 0.26385704 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
20 0.26372677 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
topicId topicWeight
[(16, 0.037), (33, 0.133), (34, 0.09), (36, 0.012), (41, 0.016), (49, 0.056), (56, 0.063), (70, 0.035), (72, 0.082), (74, 0.25), (85, 0.047), (89, 0.024), (93, 0.05), (95, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.76766104 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
2 0.73057669 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi
Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difficulty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difficulty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efficient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We first prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efficiency, we also propose an efficient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1
3 0.6937353 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack
Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1
4 0.67928618 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel
Author: Ping Li, Gennady Samorodnitsk, John Hopcroft
Abstract: The method of stable random projections is useful for efficiently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments confirm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1
5 0.66820824 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
6 0.65808594 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
7 0.63920838 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
8 0.62359381 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
9 0.62252271 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
10 0.60886401 244 nips-2013-Parametric Task Learning
11 0.60033357 167 nips-2013-Learning the Local Statistics of Optical Flow
12 0.58889282 64 nips-2013-Compete to Compute
13 0.58572936 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
14 0.58533198 121 nips-2013-Firing rate predictions in optimal balanced networks
15 0.58500224 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
16 0.58348519 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
17 0.58262783 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
18 0.58191478 183 nips-2013-Mapping paradigm ontologies to and from the brain
19 0.5818128 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
20 0.58177835 5 nips-2013-A Deep Architecture for Matching Short Texts