nips nips2013 nips2013-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Mountain View, CA, USA Abstract Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. [sent-5, score-0.366]
2 This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. [sent-6, score-0.335]
3 One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. [sent-7, score-0.311]
4 In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. [sent-8, score-1.134]
5 Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. [sent-10, score-0.621]
6 1 Introduction The visual world is populated with a vast number of objects, the most appropriate labeling of which is often ambiguous, task specific, or admits multiple equally correct answers. [sent-11, score-0.271]
7 This has led to building labeled image data sets according to these artificial categories and in turn to building visual recognition systems based on N-way discrete classifiers. [sent-13, score-0.507]
8 While growing the number of labels and labeled images has improved the utility of visual recognition systems [7], scaling such systems beyond a limited number of discrete categories remains an unsolved problem. [sent-14, score-0.661]
9 This problem is exacerbated by the fact that N-way discrete classifiers treat all labels as disconnected and unrelated, resulting in visual recognition systems that cannot transfer semantic information about learned labels to unseen words or phrases. [sent-15, score-0.933]
10 One way of dealing with this issue is to respect the natural continuity of visual space instead of artificially partitioning it into disjoint categories [20]. [sent-16, score-0.299]
11 We propose an approach that addresses these shortcomings by training a visual recognition model with both labeled images and a comparatively large and independent dataset – semantic information from unannotated text data. [sent-17, score-0.887]
12 This deep visual-semantic embedding model (DeViSE) leverages textual data to learn semantic relationships between labels, and explicitly maps images into a rich semantic embedding space. [sent-18, score-1.044]
13 We show that this model performs comparably to state-of-the-art visual object classifiers when trained and evaluated on flat 1-of-N metrics, while simultaneously making fewer semantically unreasonable mistakes along the way. [sent-19, score-0.603]
14 1 visual and semantic similarity to correctly predict object category labels for unseen categories, i. [sent-21, score-0.853]
15 “zero-shot” classification, even when the number of unseen visual categories is 20,000 for a model trained on just 1,000 categories. [sent-23, score-0.528]
16 2 Previous Work The current state-of-the-art approach to image classification is a deep convolutional neural network trained with a softmax output layer (i. [sent-24, score-0.812]
17 However, as the number of classes grows, the distinction between classes blurs, and it becomes increasingly difficult to obtain sufficient numbers of training images for rare concepts. [sent-27, score-0.311]
18 One solution to this problem, termed WSABIE [20], is to train a joint embedding model of both images and labels, by employing an online learning-to-rank algorithm. [sent-28, score-0.345]
19 The proposed model contained two sets of parameters: (1) a linear mapping from image features to the joint embedding space, and (2) an embedding vector for each possible label. [sent-29, score-0.58]
20 Compared to the proposed approach, WSABIE only explored linear mappings from image features to the embedding space, and the available labels were only those provided in the image training set. [sent-30, score-0.672]
21 The authors trained a linear mapping between the image representations and the word embeddings representing 8 classes for which they had labeled images, thus linking the image representation space to the embedding space. [sent-33, score-0.855]
22 They also trained a simple model to determine if a given image was from any of the 8 original classes or not (i. [sent-35, score-0.39]
23 When the model determined an image to be in the set of 8 classes, a separately trained softmax model was used to perform the 8-way classification; otherwise the model predicted the nearest class in the embedding space (in their setting, only 2 outlier classes were considered). [sent-38, score-1.145]
24 By contrast, our approach learns its semantic representation directly from unannotated data. [sent-41, score-0.313]
25 3 Proposed Approach Our objective is to leverage semantic knowledge learned in the text domain, and transfer it to a model trained for visual object recognition. [sent-42, score-0.839]
26 In parallel, we pre-train a state-of-the-art deep neural network for visual object recognition [11], complete with a traditional softmax output layer. [sent-44, score-0.843]
27 We then construct a deep visual-semantic model by taking the lower layers of the pre-trained visual object recognition network and re-training them to predict the vector representation of the image label text as learned by the language model. [sent-45, score-1.081]
28 Because synonyms tend to appear in similar contexts, this simple objective function drives the model to learn similar embedding vectors for semantically related words. [sent-53, score-0.36]
29 We trained a skip-gram text model on a corpus of 5. [sent-54, score-0.279]
30 The text of the web pages was tokenized into a lexicon of roughly 155,000 single- and multi-word terms consisting of common English words and phrases as well as terms from commonly used visual object recognition datasets [7]. [sent-58, score-0.449]
31 Our skip-gram model used a hierarchical softmax layer for predicting adjacent terms and was trained using a 20-word window with a single pass through the corpus. [sent-59, score-0.753]
32 We trained skip-gram models of varying hidden dimensions, ranging from 100-D to 2,000-D, and found 500- and 1,000-D embeddings to be a good compromise between training speed, semantic quality, and the ultimate performance of the DeViSE model described below. [sent-61, score-0.549]
33 The semantic quality of the embedding representations learned by these models is impressive. [sent-62, score-0.506]
34 1 A visualization of the language embedding space over a subset of ImageNet labels indicates that the language model learned a rich semantic structure that could be exploited in vision tasks (Figure 1b). [sent-63, score-0.952]
35 2 Visual Model Pre-training The visual model architecture we employ is based on the winning model for the 1,000-class ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 [11, 6]. [sent-65, score-0.366]
36 The deep neural network model consists of several convolutional filtering, local contrast normalization, and max-pooling layers, followed by several fully connected neural network layers trained using the dropout regularization technique [10]. [sent-66, score-0.3]
37 We trained this model with a softmax output layer, as described in [11], to predict one of 1,000 object categories from the ILSVRC 2012 1K dataset [7], and were able to reproduce their results. [sent-67, score-0.808]
38 3 Deep Visual-Semantic Embedding Model Our deep visual-semantic embedding model (DeViSE) is initialized from these two pre-trained neural network models (Figure 1a). [sent-70, score-0.332]
39 The embedding vectors learned by the language model are unit normed and used to map label terms into target vector representations2 . [sent-71, score-0.538]
40 The core visual model, with its softmax prediction layer now removed, is trained to predict these vectors for each image, by means of a projection layer and a similarity metric. [sent-72, score-1.1]
41 The projection layer is a linear transformation that maps the 4,096-D representation at the top of our core visual model into the 500- or 1,000-D representation native to our language model. [sent-73, score-0.576]
42 1 For example, the 9 nearest terms to tiger shark using cosine distance are bull shark, blacktip shark, shark, oceanic whitetip shark, sandbar shark, dusky shark, blue shark, requiem shark, and great white shark. [sent-74, score-0.399]
43 2 In [13], which introduced the skip-gram model for text, cosine similarity between vectors is used for measuring semantic similarity. [sent-76, score-0.329]
44 We also experimented with an L2 loss between visual and label embeddings, as suggested by Socher et al. [sent-84, score-0.353]
45 As above, the model was presented only with images drawn from the ILSVRC 2012 1K training set, but now trained to predict the term strings as text4 . [sent-88, score-0.396]
46 The parameters of the projection layer M were first trained while holding both the core visual model and the text representation fixed. [sent-89, score-0.66]
47 In the later stages of training the derivative of the loss function was backpropagated into the core visual model to fine-tune its output5 , which typically improved accuracy by 1-3% (absolute). [sent-90, score-0.382]
48 At test time, when a new image arrives, one first computes its vector representation using the visual model and the transformation layer; then one needs to look for the nearest labels in the embedding space. [sent-92, score-0.848]
49 4 Results The goals of this work are to develop a vision model that makes semantically relevant predictions even when it makes errors and that generalizes to classes outside of its labeled training set, i. [sent-95, score-0.471]
50 Both use the trained visual model described in Section 3. [sent-99, score-0.424]
51 In order to demonstrate parity with the softmax baseline on the most commonly-reported metric, we compute “flat” hit@k metrics – the percentage of test images for which the model returns the one true label in its top k predictions. [sent-101, score-0.765]
52 To measure the semantic quality of predictions beyond the true label, we employ a hierarchical precision@k metric based on the label hierarchy provided with the 3 The margin was chosen to be a fraction of the norm of the vectors, which is 1. [sent-102, score-0.613]
53 4 ImageNet image labels are synsets, a set of synonymous terms, where each term is a word or phrase. [sent-105, score-0.292]
54 In this case, the language model should continue to train simultaneously in order to maintain the global semantic structure over all terms in the vocabulary. [sent-108, score-0.373]
55 In particular, for each true label and value of k, we generate a ground truth list from the semantic hierarchy, and compute a per-example precision equal to the fraction of the model’s k predictions that overlap with the ground truth list. [sent-161, score-0.48]
56 Table 1 shows results for the DeViSE model for 500- and 1000-dimensional skip-gram models compared to the random embedding and softmax baseline models, on both the flat and hierarchical metrics. [sent-166, score-0.834]
57 6 On the flat metric, the softmax baseline shows higher accuracy for k = 1, 2. [sent-167, score-0.503]
58 We expected the softmax model to be the best performing model on the flat metric, given that its cross-entropy training objective is most well matched to the evaluation metric, and are surprised that the performance of DeViSE is so close to softmax performance. [sent-169, score-0.979]
59 On the hierarchical metric, the DeViSE models show better semantic generalization than the softmax baseline, especially for larger k. [sent-170, score-0.686]
60 At k = 5, the 500-D DeViSE model shows a 3% relative improvement over the softmax baseline, and at k = 20 almost a 7% relative improvement. [sent-171, score-0.459]
61 This is a surprisingly large gain, considering that the softmax baseline is a reproduction of the best published model on these data. [sent-172, score-0.556]
62 The gap that exists between the DeViSE model and softmax baseline on the hierarchical metric reflects the benefit of semantic information above and beyond visual similarity [8]. [sent-173, score-1.187]
63 The gap between the DeViSE model and the random embeddings model establishes that the source of the gain is the well-structured embeddings learned by the language model not some other property of our architecture. [sent-174, score-0.475]
64 2 Generalization and Zero-Shot Learning A distinct advantage of our model is its ability to make reasonable inferences about candidate labels it has never visually observed. [sent-176, score-0.302]
65 Our softmax baseline is able to reproduce the performance of the model in [11] when evaluated with the same procedure. [sent-180, score-0.556]
66 macaque titi, titi monkey guenon, guenon monkey E F Our model Softmax over ImageNet 1K fruit pineapple, ananas pineapple coral fungus pineapple plant, Ananas . [sent-196, score-0.501]
67 Figure 2: For each image, the top 5 zero-shot predictions of DeViSE+1K from the 2011 21K label set and the softmax baseline model, both trained on ILSVRC 2012 1K. [sent-214, score-0.869]
68 DeViSE-0 and DeViSE+1K are the same trained model, but DeViSE-0 is restricted to only predict zero-shot classes, whereas DeViSE+1K predicts both the zero-shot and the 1K training labels. [sent-248, score-0.29]
69 These are “zeroshot” data sets in the sense that our model has no visual knowledge of these labels, though embeddings for the labels were learned by the language model. [sent-251, score-0.696]
70 The softmax baseline is only able to predict labels from ILSVRC 2012 1K. [sent-252, score-0.739]
71 1, but it is evaluated in two ways: DeViSE-0 only predicts the zero-shot labels, and DeViSE+1K predicts zero-shot labels and the ILSVRC 2012 1K training labels. [sent-254, score-0.304]
72 Figure 2 shows label predictions for a handful of selected examples from this dataset to qualitatively illustrate model behavior. [sent-255, score-0.276]
73 Note that DeViSE successfully predicts a wide range of labels outside its training set, and furthermore, the incorrect predictions are generally semantically “close” to the desired label. [sent-256, score-0.47]
74 For example, in Figure 2 (a), the DeViSE model is able to predict a number of lens-related labels even though it was not trained on images in any of the predicted categories. [sent-258, score-0.516]
75 Figure 2 (d) illustrates a case where the top softmax prediction is quite good, but where it is unable to generalize to new labels and its remaining predictions are off the mark, while our model’s predictions are more plausible. [sent-259, score-0.783]
76 Figure 2 (f) shows a case where the softmax model emits more nearly correct labels than the DeViSE model. [sent-261, score-0.683]
77 To quantify the performance of the model on zero-shot data, we constructed from our ImageNet 2011 21K zero-shot data three test data sets of increasing difficulty based on the image labels’ tree distance from the training ILSVRC 2012 1K labels in the ImageNet label hierarchy [7]. [sent-262, score-0.584]
78 The easiest dataset, “2-hop”, is comprised of the 1,589 labels that are within two tree hops of the training labels, making them visually and semantically similar to the training set. [sent-263, score-0.468]
79 6 Data Set 2-hop 3-hop ImageNet 2011 21K Model DeViSE-0 DeViSE+1K Softmax baseline DeViSE-0 DeViSE+1K Softmax baseline DeViSE-0 DeViSE+1K Softmax baseline 1 0. [sent-266, score-0.291]
80 Performance of DeViSE compared to the softmax baseline model across the same datasets as in Table 2. [sent-309, score-0.556]
81 Note that the softmax model can never directly predict the correct label so its precision@1 is 0. [sent-310, score-0.717]
82 The [12] model uses a curated hierarchy over labels for zero-shot classification, but without using this information, our model is close in performance on the 200 zero-shot class label task. [sent-320, score-0.511]
83 Since a traditional softmax visual model can never produce the correct label on zero-shot data, its performance would be 0% for all k. [sent-328, score-0.89]
84 To provide a stronger baseline for comparison, we compared the performance of our model and the softmax model on the hierarchical metric we employed above. [sent-330, score-0.736]
85 Although the softmax baseline model can never predict exactly the correct label, the hierarchical metric will give the model credit for predicting labels that are in the neighborhood of the correct label in the ImageNet hierarchy (for k > 1). [sent-331, score-1.271]
86 Visual similarity is strongly correlated with semantic similarity for nearby object categories [8], and the softmax model does leverage visual similarity between zero-shot and training images to make predictions that will be scored favorably (e. [sent-332, score-1.489]
87 The easiest dataset, “2-hop”, contains object categories that are as visually and semantically similar to the training set as possible. [sent-335, score-0.377]
88 For this dataset the softmax model outperforms the DeViSE model for hierarchical precision@2, demonstrating just how large a role visual similarity plays in predicting semantically “nearby” labels (Table 3). [sent-336, score-1.156]
89 For the two more difficult datasets, where there are more novel categories and the novel categories are less closely related to those in the training data set, DeViSE outperforms the softmax model at all measured hierarchical precisions. [sent-338, score-0.732]
90 The quantitative gains can be quite large, as much as 82% relative improvement over softmax performance, and qualitatively, the softmax model’s predictions can be surprisingly unreasonable in some cases (e. [sent-339, score-0.91]
91 These results indicate that our architecture succeeds in leveraging the semantic knowledge captured by the language model to make reasonable predictions, even as test images become increasingly dissimilar from those used in the training set. [sent-343, score-0.55]
92 These were performed on a particular 800/200 split of the 1000 classes 7 from ImageNet 2010: training and model tuning is performed using the 800 classes, and test images are drawn from the remaining 200 classes. [sent-345, score-0.281]
93 Taken together, these zero-shot experiments indicate that the DeViSE model can exploit both visual and semantic information to predict novel classes never before observed. [sent-347, score-0.664]
94 We have also shown that this model is able to make correct predictions across thousands of previously unseen classes by leveraging semantic knowledge elicited only from unannotated text. [sent-350, score-0.591]
95 Perhaps more importantly, though here we trained on a curated academic image dataset, our model’s architecture naturally lends itself to being trained on all available images that can be annotated with any text term contained in the (larger) vocabulary. [sent-356, score-0.642]
96 We believe that training massive “open” image datasets of this form will dramatically improve the quality of visual object categorization systems. [sent-357, score-0.528]
97 Second, we believe that the 1-of-N (and nearly balanced) visual object classification problem is soon to be outmoded by practical visual object categorization systems that can handle very large numbers of labels [5] and the re-definition of valid label sets at test time. [sent-358, score-0.97]
98 For example, our model can be trained once on all available data, and simultaneously used in one application requiring only coarse object categorization (e. [sent-359, score-0.324]
99 Moreover, because test time computation can be sub-linear in the number of labels contained in the training set, our model can be used in exactly such systems with much larger numbers of labels, including overlapping or never-observed categories. [sent-364, score-0.295]
100 Moving forward, we are experimenting with techniques which more directly leverage the structure inherent in the learned language embedding, greatly reducing training costs of the joint model and allowing even greater scaling [15]. [sent-365, score-0.266]
wordName wordTfidf (topN-words)
[('softmax', 0.406), ('devise', 0.344), ('shark', 0.263), ('visual', 0.228), ('semantic', 0.21), ('embedding', 0.208), ('imagenet', 0.2), ('labels', 0.181), ('ilsvrc', 0.174), ('trained', 0.143), ('label', 0.125), ('image', 0.111), ('language', 0.11), ('semantically', 0.099), ('predictions', 0.098), ('baseline', 0.097), ('pineapple', 0.088), ('images', 0.084), ('text', 0.083), ('classes', 0.083), ('embeddings', 0.082), ('layer', 0.081), ('object', 0.08), ('hit', 0.079), ('unannotated', 0.071), ('categories', 0.071), ('deep', 0.071), ('oceanic', 0.07), ('hierarchical', 0.07), ('car', 0.069), ('mikolov', 0.067), ('similarity', 0.066), ('lens', 0.064), ('flat', 0.062), ('training', 0.061), ('jeffrey', 0.059), ('greg', 0.059), ('monkey', 0.059), ('recognition', 0.058), ('tomas', 0.057), ('metric', 0.057), ('predict', 0.055), ('model', 0.053), ('hierarchy', 0.053), ('ananas', 0.053), ('submarine', 0.053), ('whitecap', 0.053), ('corrado', 0.052), ('categorization', 0.048), ('precision', 0.047), ('curated', 0.046), ('jonathon', 0.046), ('representations', 0.046), ('correct', 0.043), ('samy', 0.043), ('kai', 0.043), ('learned', 0.042), ('socher', 0.041), ('shlens', 0.04), ('ilya', 0.04), ('core', 0.04), ('bengio', 0.04), ('labeled', 0.039), ('vision', 0.038), ('nearest', 0.035), ('never', 0.035), ('anemone', 0.035), ('anglais', 0.035), ('artichoke', 0.035), ('babbler', 0.035), ('bassoon', 0.035), ('bongo', 0.035), ('buggy', 0.035), ('cackler', 0.035), ('dressing', 0.035), ('frome', 0.035), ('guenon', 0.035), ('hussar', 0.035), ('missile', 0.035), ('patas', 0.035), ('scottsdale', 0.035), ('searcher', 0.035), ('telephoto', 0.035), ('titi', 0.035), ('tlabel', 0.035), ('zeroshot', 0.035), ('mark', 0.033), ('layers', 0.033), ('visually', 0.033), ('easiest', 0.033), ('unseen', 0.033), ('dean', 0.033), ('architecture', 0.032), ('representation', 0.032), ('predicts', 0.031), ('sutskever', 0.031), ('jia', 0.031), ('english', 0.031), ('bull', 0.031), ('fruit', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999857 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
2 0.25632194 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
3 0.22646731 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
5 0.21906406 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
6 0.20656371 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
7 0.14639826 251 nips-2013-Predicting Parameters in Deep Learning
8 0.13432148 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
9 0.13151756 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
10 0.12464756 75 nips-2013-Convex Two-Layer Modeling
11 0.11982576 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
12 0.11929997 84 nips-2013-Deep Neural Networks for Object Detection
13 0.11740539 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
14 0.11636496 331 nips-2013-Top-Down Regularization of Deep Belief Networks
16 0.11121452 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
17 0.1089502 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
18 0.10400632 5 nips-2013-A Deep Architecture for Matching Short Texts
19 0.10379659 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
20 0.094575495 226 nips-2013-One-shot learning by inverting a compositional causal process
topicId topicWeight
[(0, 0.193), (1, 0.122), (2, -0.212), (3, -0.142), (4, 0.226), (5, -0.202), (6, -0.072), (7, 0.02), (8, -0.068), (9, 0.026), (10, -0.125), (11, -0.014), (12, -0.032), (13, -0.021), (14, 0.021), (15, -0.051), (16, 0.084), (17, 0.017), (18, 0.014), (19, 0.118), (20, 0.004), (21, -0.073), (22, 0.009), (23, -0.013), (24, 0.029), (25, 0.143), (26, -0.01), (27, 0.001), (28, 0.051), (29, 0.061), (30, -0.068), (31, -0.116), (32, -0.01), (33, -0.038), (34, -0.044), (35, 0.025), (36, 0.054), (37, 0.003), (38, -0.054), (39, -0.002), (40, 0.041), (41, 0.019), (42, 0.038), (43, 0.062), (44, -0.003), (45, 0.054), (46, 0.029), (47, -0.031), (48, -0.069), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.97243816 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
2 0.90796298 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
3 0.8010329 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
4 0.72394991 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
5 0.71155649 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
6 0.69416612 84 nips-2013-Deep Neural Networks for Object Detection
7 0.69272619 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
8 0.678536 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
9 0.67044187 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
10 0.63778704 166 nips-2013-Learning invariant representations and applications to face verification
11 0.62938219 226 nips-2013-One-shot learning by inverting a compositional causal process
12 0.61181802 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
14 0.59297395 119 nips-2013-Fast Template Evaluation with Vector Quantization
15 0.58722132 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
16 0.58561105 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
17 0.57688344 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
18 0.57270634 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
19 0.56313384 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
20 0.54959512 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
topicId topicWeight
[(16, 0.019), (33, 0.602), (34, 0.05), (43, 0.014), (49, 0.018), (56, 0.047), (70, 0.024), (85, 0.026), (89, 0.017), (93, 0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.99272573 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
2 0.98704225 88 nips-2013-Designed Measurements for Vector Count Data
Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin
Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classification (photon counting). 1
3 0.9811306 217 nips-2013-On Poisson Graphical Models
Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its infinite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1
4 0.97732353 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh
Abstract: Multiple hypothesis testing is a significant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacrificing the fidelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1
5 0.97119802 160 nips-2013-Learning Stochastic Feedforward Neural Networks
Author: Yichuan Tang, Ruslan Salakhutdinov
Abstract: Multilayer perceptrons (MLPs) or neural networks are popular models used for nonlinear regression and classification tasks. As regressors, MLPs model the conditional distribution of the predictor variables Y given the input variables X. However, this predictive distribution is assumed to be unimodal (e.g. Gaussian). For tasks involving structured prediction, the conditional distribution should be multi-modal, resulting in one-to-many mappings. By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. However, previously proposed learning algorithms for SBNs are not efficient and unsuitable for modeling real-valued data. In this paper, we propose a stochastic feedforward network with hidden layers composed of both deterministic and stochastic variables. A new Generalized EM training procedure using importance sampling allows us to efficiently learn complicated conditional distributions. Our model achieves superior performance on synthetic and facial expressions datasets compared to conditional Restricted Boltzmann Machines and Mixture Density Networks. In addition, the latent features of our model improves classification and can learn to generate colorful textures of objects. 1
6 0.96113116 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
7 0.95977205 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
8 0.94720644 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
9 0.9420734 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
10 0.93246645 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
11 0.90870726 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
12 0.88969004 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
13 0.8867923 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
14 0.8758716 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
15 0.87439322 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
16 0.87367594 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
17 0.87361372 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
18 0.87273449 331 nips-2013-Top-Down Regularization of Deep Belief Networks
19 0.8725878 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
20 0.87220001 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality