nips nips2013 nips2013-356 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. [sent-8, score-0.802]
2 Our deep learning model does not require any manually defined semantic or visual features for either words or images. [sent-11, score-0.474]
3 Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. [sent-12, score-1.682]
4 We then use novelty detection methods to differentiate unseen classes from seen classes. [sent-13, score-1.163]
5 We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. [sent-14, score-1.195]
6 1 Introduction The ability to classify instances of an unseen visual class, called zero-shot learning, is useful in several situations. [sent-15, score-0.669]
7 In this work, we show how to make use of the vast amount of knowledge about the visual world available in natural language to classify unseen objects. [sent-17, score-0.693]
8 We attempt to model people’s ability to identify unseen objects even if the only knowledge about that object came from reading about it. [sent-18, score-0.572]
9 We introduce a zero-shot model that can predict both seen and unseen classes. [sent-20, score-0.704]
10 For instance, without ever seeing a cat image, it can determine whether an image shows a cat or a known category from the training set such as a dog or a horse. [sent-21, score-0.721]
11 First, images are mapped into a semantic space of words that is learned by a neural network model [15]. [sent-25, score-0.65]
12 By learning an image mapping into this space, the word vectors get implicitly grounded by the visual modality, allowing us to give prototypical instances for various words. [sent-27, score-0.549]
13 Second, because classifiers prefer to assign test images into classes for which they have seen training examples, the model incorporates novelty detection which determines whether a new image is on the manifold of known categories. [sent-28, score-1.056]
14 Otherwise, images are assigned to a class based on the likelihood of being an unseen category. [sent-30, score-0.82]
15 The first strategy prefers high accuracy for unseen classes, the second for seen classes. [sent-32, score-0.744]
16 Unlike previous work on zero-shot learning which can only predict intermediate features or differentiate between various zero-shot classes [21, 27], our joint model can achieve both state of the art accuracy on known classes as well as reasonable performance on unseen classes. [sent-33, score-1.23]
17 Furthermore, compared to related work on knowledge transfer [21, 28] we do not require manually defined semantic 1 Manifold of known classes truck horse auto dog Tra ini ng im a New test image from unknown class cat ge s Figure 1: Overview of our cross-modal zero-shot model. [sent-34, score-1.288]
18 We first map each new testing image into a lower dimensional semantic word vector space. [sent-35, score-0.679]
19 If the image is ‘novel’, meaning not on the manifold, we classify it with the help of unsupervised semantic word vectors. [sent-37, score-0.79]
20 In this example, the unseen classes are truck and cat. [sent-38, score-0.942]
21 or visual attributes for the zero-shot classes, allowing us to use state-of-the-art unsupervised and unaligned image features instead along with unsupervised and unaligned language corpora. [sent-39, score-0.51]
22 They are able to predict semantic features even for words for which they have not seen scans and experiment with differentiating between several zero-shot classes. [sent-46, score-0.51]
23 However, they do not classify new test instances into both seen and unseen classes. [sent-47, score-0.724]
24 [21] construct a set of binary attributes for the image classes that convey various visual characteristics, such as “furry” and “paws” for bears and “wings” and “flies” for birds. [sent-50, score-0.502]
25 [21, 10] were two of the first to use well-designed visual attributes of unseen classes to classify them. [sent-61, score-0.952]
26 Similar to our work, they use unsupervised large text corpora to 2 learn semantic word representations. [sent-73, score-0.631]
27 Unless otherwise mentioned, all word vectors are initialized with pre-trained d = 50-dimensional word vectors from the unsupervised model of Huang et al. [sent-82, score-0.64]
28 Using free Wikipedia text, their model learns word vectors by predicting how likely it is for each word to occur in its context. [sent-84, score-0.536]
29 Their model uses both local context in the window around each word and global document contex, thus capturing distributional syntactic and semantic information. [sent-85, score-0.614]
30 4 Projecting Images into Semantic Word Spaces In order to learn semantic relationships and class membership of images we project the image feature vectors into the d-dimensional, semantic word space F . [sent-90, score-1.288]
31 Some of the classes y in this set will have available training data, others will be zero-shot classes without any training data. [sent-92, score-0.604]
32 We define the former as the seen classes Ys and the latter as the unseen classes Yu . [sent-93, score-1.175]
33 Let W = Ws ∪ Wu be the set of word vectors in Rd for both seen and unseen visual classes, respectively. [sent-94, score-1.041]
34 All training images x(i) ∈ Xy of a seen class y ∈ Ys are mapped to the word vector wy corresponding to the class name. [sent-95, score-0.97]
35 By projecting images into the word vector space, we implicitly extend the semantics with a visual grounding, allowing us to query the space, for instance for prototypical visual instances of a word. [sent-100, score-0.632]
36 2 shows a visualization of the 50-dimensional semantic space with word vectors and images of both seen and unseen classes. [sent-102, score-1.522]
37 We can observe that most classes are tightly clustered around their corresponding word vector while the zero-shot classes (cat and truck for this mapping) do not have close-by vectors. [sent-105, score-0.896]
38 However, the images of the two zero-shot classes are close to semantically similar classes (such as in the case of cat, which is close to dog and horse but is far away from car or ship). [sent-106, score-0.935]
39 This observation motivated the idea for first detecting images of unseen classes and then classifying them to the zero-shot word vectors. [sent-107, score-1.271]
40 In general, we want to predict p(y|x), the conditional probability for both seen and unseen classes y ∈ Ys ∪ Yu given an image from the test set x ∈ Xt . [sent-109, score-1.096]
41 To achieve this we will employ the semantic vectors to which these images have been mapped to f ∈ Ft . [sent-110, score-0.659]
42 Word vector locations are highlighted and mapped image locations are shown both for images for which this mapping has been trained and unseen images. [sent-112, score-1.053]
43 Let Xs be the set of all feature vectors for training images of seen classes and Fs their corresponding semantic vectors. [sent-115, score-0.997]
44 We predict a class y for a new input image x and its mapped semantic vector f via: P (y|V, x, Xs , Fs , W, θ)P (V |x, Xs , Fs , W, θ). [sent-117, score-0.603]
45 p(y|x, Xs , Fs , W, θ) = V ∈{s,u} Marginalizing out the novelty variable V allows us to first distinguish between seen and unseen classes. [sent-118, score-0.855]
46 The seen image classifier can be a state of the art softmax classifier while the unseen classifier can be a simple Gaussian discriminator. [sent-120, score-0.888]
47 1 Strategies for Novelty Detection We now consider two strategies for predicting whether an image is of a seen or unseen class. [sent-122, score-0.817]
48 The term P (V = u|x, Xs , Fs , W, θ) is the probability of an image being in an unseen class. [sent-123, score-0.685]
49 An image from an unseen class will not be very close to the existing training images but will still be roughly in the same semantic region. [sent-124, score-1.304]
50 For instance, cat images are closest to dogs even though they are not as close to the dog word vector as most dog images are. [sent-125, score-1.124]
51 Hence, at test time, we can use outlier detection methods to determine whether an image is in a seen or unseen class. [sent-126, score-0.996]
52 Both are computed on the manifold of training images that were mapped to the semantic word space. [sent-128, score-0.965]
53 The mapped points of seen classes are used to obtain this marginal. [sent-131, score-0.478]
54 The Gaussian of each class is parameterized by the corresponding semantic word vector wy for its mean and a covariance matrix Σy that is estimated from all the mapped training points with that label. [sent-133, score-0.851]
55 For a new image x, the outlier detector then becomes the indicator function that is 1 if the marginal probability is below a certain threshold Ty for all the classes: P (V = u|f, Xs , W, θ) := 1{∀y ∈ Ys : P (f |Fy , wy ) < Ty } We provide an experimental analysis for various thresholds T below. [sent-135, score-0.457]
56 The thresholds are selected to make at least some fraction of the vectors from training images above threshold, that is, to be classified as a seen class. [sent-136, score-0.485]
57 Then, we can obtain the conditional class probability using a weighted combination of classifiers for both seen and unseen classes (described below). [sent-140, score-0.971]
58 2 shows that many unseen images are not technically outliers of the complete data manifold. [sent-142, score-0.774]
59 This probability can now be used to weigh the seen and unseen classifiers by the appropriate amount given our belief about the outlierness of a new test image. [sent-155, score-0.675]
60 For the zero-shot case where V = u we assume an isometric Gaussian distribution around each of the novel class word vectors and assign classes based on their likelihood. [sent-161, score-0.653]
61 For word vectors, we use a set of 50-dimensional word vectors from the Huang dataset [15] that correspond to each CIFAR category. [sent-165, score-0.536]
62 In this section we first analyze the classification performance for seen classes and unseen classes separately. [sent-168, score-1.175]
63 Then, we combine images from the two types of classes, and discuss the trade-offs involved in our two unseen class detection strategies. [sent-169, score-0.864]
64 1 Seen and Unseen Classes Separately First, we evaluate the classification accuracy when presented only with images from classes that have been used in training. [sent-174, score-0.55]
65 8 (c) Comparison (b) LoOP model 1 unseen classes 0. [sent-192, score-0.793]
66 8 Fraction of points classified as unseen 1 0 0 0. [sent-208, score-0.577]
67 8 1 Fraction unseen/outlier threshold Figure 4: Comparison of accuracies for images from previously seen and unseen categories when unseen images are detected under the (a) Gaussian threshold model, (b) LoOP model. [sent-217, score-1.958]
68 classes excluding cat and truck, which closely matches the SVM-based classification results in the original Coates and Ng paper [6] that used all 10 classes. [sent-221, score-0.429]
69 In this case, the classification is based on isometric Gaussians which amounts to simply comparing distances between word vectors of unseen classes and an image mapped into semantic space. [sent-223, score-1.678]
70 For instance, when cat and dog are taken out from training, the resulting zero-shot classification does not work well because none of the other 8 categories is similar enough to both images to learn a good semantic distinction. [sent-225, score-0.892]
71 On the other hand, if cat and truck are taken out, then the cat vectors can be mapped to the word space thanks to similarities to dogs and trucks can be distinguished thanks to car, yielding better performance. [sent-226, score-0.92]
72 We compare the performance when each image is passed through either of the two novelty detectors which decide with a certain probability (in the second scenario) whether an image belongs to a class that was used in training. [sent-245, score-0.514]
73 Depending on this choice, the image is either passed through the softmax classifier for seen category images, or assigned to the class of the nearest semantic word vector for unseen category images. [sent-246, score-1.616]
74 4 shows the accuracies for test images for different choices made by the two scenarios for novelty detection. [sent-248, score-0.513]
75 The test set includes an equal number of images from each category, with 8 categories having been seen before, and 2 being new. [sent-249, score-0.451]
76 Firstly, at the left extreme of the curve, the Gaussian unseen image detector treats all of the images as unseen, and the LoOP model takes the probability threshold for an image being unseen to be 0. [sent-251, score-1.633]
77 At this point, with all unseen images in the test set being treated as such, we achieve the highest accuracies, at 90% for this zero-shot pair. [sent-252, score-0.774]
78 Similarly, at the other extreme of the curve, all images are classified as belonging to a seen category, and hence the softmax classifier for seen images gives the best possible accuracy for these images. [sent-253, score-0.841]
79 6 Between the extremes, the curves for unseen image accuracies and seen image accuracies fall and rise at different rates. [sent-254, score-1.211]
80 Since the Gaussian model is liberal in designating an image as belonging to an unseen category, it treats more of the images as unseen, and hence we continue to get high unseen class accuracies along the curve. [sent-255, score-1.656]
81 The LoOP model, which tries to detect whether an image could be regarded as an outlier for each class, does not assign very high outlier probabilities to zero-shot images due to a large number of them being spread on inside the manifold of seen images (see Fig. [sent-256, score-1.055]
82 Hence, the LoOP model can be used in scenarios where one does not want to degrade the high performance on classes from the training set but allow for the possibility of unseen classes. [sent-259, score-0.845]
83 4 (c) that since most images in the test set belong to previously seen categories, the LoOP model, which is conservative in assigning the unseen label, gives better overall accuracies than the Gaussian model. [sent-261, score-1.058]
84 In general, we can choose an acceptable threshold for seen class accuracy and achieve a corresponding unseen class accuracy. [sent-262, score-0.868]
85 For example, at 70% seen class accuracy in the Gaussian model, unseen classes can be classified with accuracies of between 30% to 15%, depending on the class. [sent-263, score-1.166]
86 3 Combining predictions for seen and unseen classes The final step in our experiments is to perform the full Bayesian pipeline as defined by Equation 2. [sent-266, score-0.976]
87 We train each binary attribute classifier separately, and use the trained classifiers to construct attribute labels for unseen classes. [sent-276, score-0.68]
88 In the mapped space, we observe that of the 100 images assigned the highest probability of being an outlier, 12% of those images are false positives. [sent-292, score-0.558]
89 This is intuitively explained by the fact that the mapping function gathers extra semantic information from the word vectors it is trained on, and images are able to cluster better around these assumed Gaussian centroids. [sent-294, score-0.851]
90 In the original space, there is no semantic information, and the Gaussian centroids need to be inferred from among the images themselves, which are not truly representative of the center of the image space for their classes. [sent-295, score-0.663]
91 8 Fraction of points classified as seen 1 Figure 5: Comparison of accuracies for images from previously seen and unseen categories for the modified CIFAR-100 dataset, after training the semantic mapping with a one-layer network and two-layer network. [sent-303, score-1.669]
92 When all images are labeled as zero shot, the peak accuracy for the 6 unseen classes is 52. [sent-312, score-1.093]
93 Because of the large semantic space corresponding to 100 classes, the proximity of an image to its appropriate class vector is dependent on the quality of the mapping into semantic space. [sent-315, score-0.809]
94 9 Neighbors of cat Neighbors of truck Accuracy We would like zero-shot images to be classi0. [sent-324, score-0.559]
95 8 fied correctly when there are a large number of unseen categories to choose from. [sent-325, score-0.631]
96 6 correct unseen classes we create a set of distractor words. [sent-328, score-0.89]
97 For the zero-shot class cat and 0 10 20 30 40 Number of distractor words truck, the nearest neighbors distractors include rabbit, kitten and mouse, among others. [sent-337, score-0.445]
98 This is consistent with our expectation that a certain number of closely-related semantic neighbors would distract the classifier; however, beyond that limited set, other categories would be further away in semantic space and would not affect classification accuracy. [sent-346, score-0.718]
99 7 Conclusion We introduced a novel model for jointly doing standard and zero-shot classification based on deep learned word and image representations. [sent-347, score-0.437]
100 If the task was only to differentiate between various zero-shot classes we could obtain accuracies of up to 90% with a fully unsupervised model. [sent-349, score-0.476]
wordName wordTfidf (topN-words)
[('unseen', 0.543), ('semantic', 0.29), ('classes', 0.25), ('word', 0.247), ('images', 0.231), ('cat', 0.179), ('novelty', 0.156), ('truck', 0.149), ('image', 0.142), ('outlier', 0.135), ('seen', 0.132), ('accuracies', 0.126), ('wy', 0.12), ('fs', 0.112), ('dog', 0.104), ('distractor', 0.097), ('mapped', 0.096), ('classi', 0.095), ('categories', 0.088), ('loop', 0.083), ('lampert', 0.078), ('visual', 0.077), ('distributional', 0.077), ('lof', 0.075), ('shot', 0.069), ('accuracy', 0.069), ('deer', 0.068), ('isometric', 0.068), ('pdist', 0.068), ('category', 0.065), ('unsupervised', 0.062), ('ys', 0.059), ('er', 0.058), ('attribute', 0.056), ('ship', 0.056), ('xs', 0.053), ('training', 0.052), ('pipeline', 0.051), ('neighbors', 0.05), ('classify', 0.049), ('manifold', 0.049), ('deep', 0.048), ('fy', 0.048), ('softmax', 0.046), ('class', 0.046), ('auto', 0.045), ('detection', 0.044), ('horse', 0.043), ('vectors', 0.042), ('unaligned', 0.042), ('mapping', 0.041), ('socher', 0.04), ('coates', 0.04), ('nearest', 0.04), ('transfer', 0.04), ('differentiate', 0.038), ('layer', 0.037), ('visualization', 0.037), ('baroni', 0.034), ('bruni', 0.034), ('classified', 0.034), ('attributes', 0.033), ('multimodal', 0.033), ('words', 0.033), ('representations', 0.033), ('semantically', 0.032), ('manning', 0.032), ('threshold', 0.032), ('text', 0.032), ('xy', 0.03), ('sentiment', 0.03), ('frog', 0.03), ('airplane', 0.03), ('erf', 0.03), ('cvpr', 0.03), ('acl', 0.03), ('predict', 0.029), ('object', 0.029), ('detectors', 0.028), ('thresholds', 0.028), ('dogs', 0.028), ('palatucci', 0.028), ('automobile', 0.026), ('afrl', 0.026), ('conservative', 0.026), ('gaussian', 0.026), ('domain', 0.026), ('linguistics', 0.026), ('features', 0.026), ('car', 0.025), ('train', 0.025), ('ers', 0.025), ('art', 0.025), ('liberal', 0.025), ('nouns', 0.025), ('grounding', 0.025), ('distinguish', 0.024), ('language', 0.024), ('salakhutdinov', 0.024), ('embeddings', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
2 0.25632194 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
3 0.2428896 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
4 0.19271715 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: High capacity classifiers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classification performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classifier’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-specified tree, this algorithm modifies the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classification task. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1
5 0.18542719 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
7 0.16079763 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
8 0.1494313 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
9 0.13896562 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
10 0.10171566 166 nips-2013-Learning invariant representations and applications to face verification
11 0.09962976 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
12 0.09781982 226 nips-2013-One-shot learning by inverting a compositional causal process
13 0.092353679 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
14 0.091922142 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
15 0.091064282 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
16 0.090640977 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
17 0.090638302 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.090258293 84 nips-2013-Deep Neural Networks for Object Detection
19 0.088709764 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
20 0.087695912 183 nips-2013-Mapping paradigm ontologies to and from the brain
topicId topicWeight
[(0, 0.18), (1, 0.108), (2, -0.185), (3, -0.116), (4, 0.204), (5, -0.157), (6, -0.043), (7, 0.021), (8, -0.081), (9, 0.061), (10, -0.168), (11, 0.023), (12, -0.051), (13, -0.025), (14, -0.035), (15, -0.077), (16, 0.139), (17, 0.013), (18, 0.01), (19, 0.158), (20, -0.014), (21, -0.032), (22, -0.023), (23, -0.024), (24, 0.008), (25, 0.189), (26, 0.044), (27, 0.006), (28, 0.007), (29, 0.039), (30, -0.066), (31, -0.071), (32, -0.022), (33, 0.085), (34, -0.022), (35, 0.035), (36, 0.063), (37, -0.004), (38, -0.057), (39, 0.069), (40, 0.012), (41, 0.023), (42, 0.027), (43, 0.083), (44, 0.036), (45, 0.026), (46, 0.076), (47, 0.031), (48, -0.001), (49, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.9774555 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
2 0.89257246 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov
Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1
3 0.84780973 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
4 0.72299379 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
5 0.70549935 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
6 0.69955647 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
7 0.63323408 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
8 0.63061297 226 nips-2013-One-shot learning by inverting a compositional causal process
9 0.62218183 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
10 0.61885029 84 nips-2013-Deep Neural Networks for Object Detection
11 0.61396748 166 nips-2013-Learning invariant representations and applications to face verification
12 0.61216331 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
13 0.57909912 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
14 0.57057959 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars
15 0.56651515 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
16 0.55977589 119 nips-2013-Fast Template Evaluation with Vector Quantization
17 0.55320138 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
18 0.54849839 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
19 0.54686749 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
20 0.54335171 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
topicId topicWeight
[(16, 0.031), (33, 0.255), (34, 0.077), (41, 0.022), (49, 0.033), (56, 0.056), (70, 0.048), (85, 0.04), (89, 0.034), (93, 0.085), (95, 0.01), (98, 0.209)]
simIndex simValue paperId paperTitle
same-paper 1 0.86727124 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
2 0.81561512 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
Author: James L. Sharpnack, Akshay Krishnamurthy, Aarti Singh
Abstract: The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lov´ sz extended scan statistic (LESS) that uses submodularity to approximate the a intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, knearest neighbor graphs, and ǫ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds. 1
3 0.80011314 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
Author: Yu Zhang
Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1
4 0.79353124 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
5 0.78951234 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
6 0.78767133 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
7 0.78385651 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
8 0.78249794 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
9 0.78130633 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
10 0.78035718 160 nips-2013-Learning Stochastic Feedforward Neural Networks
11 0.77988851 335 nips-2013-Transfer Learning in a Transductive Setting
12 0.77742201 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
13 0.77713317 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
14 0.77708489 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
15 0.77708024 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
16 0.77534831 166 nips-2013-Learning invariant representations and applications to face verification
17 0.77514982 88 nips-2013-Designed Measurements for Vector Count Data
18 0.77411288 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.77382338 30 nips-2013-Adaptive dropout for training deep neural networks
20 0.77359205 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising