nips nips2012 nips2012-92 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. [sent-3, score-0.533]
2 Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. [sent-4, score-0.366]
3 In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. [sent-5, score-0.48]
4 ), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. [sent-8, score-0.571]
5 Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. [sent-9, score-0.296]
6 1 Introduction The development of successful methods for training deep architectures have influenced the development of representation learning algorithms either on top of SIFT descriptors [1, 2] or raw pixel input [3, 4, 5] for feature extraction of full-sized images. [sent-11, score-0.327]
7 Furthermore, self-taught learning [6] can be employed, taking advantage of feature learning from image databases independent of the target dataset. [sent-13, score-0.304]
8 Image auto-annotation is a multi-label classification task of assigning a set of relevant, descriptive tags to an image where tags often come from a vocabulary of hundreds to thousands of words. [sent-14, score-0.705]
9 Tags may describe objects, colors, scenes, local regions of the image (e. [sent-17, score-0.238]
10 Consequently, many of the most successful annotation algorithms in the literature [7, 8, 9, 10, 11] have opted to focus on tag assignment and often fix a large number of hand-crafted features for input to their algorithms. [sent-22, score-0.405]
11 Our main contribution in this paper is to remove the need to compute over a dozen hand-crafted features for annotating images and consequently remove the need for feature selection. [sent-26, score-0.402]
12 We introduce a deep learning algorithm for learning hierarchical representations of full-sized color images from the pixel level, which may be seen as a generalization of the approach by Coates et al. [sent-27, score-0.453]
13 For annotation, we use the TagProp discriminitve metric learning algorithm [9] which has enjoyed state-of-the-art performance on popular annotation benchmarks. [sent-31, score-0.278]
14 This gives the advantage of focusing new research on improving tag assignment algorithms without the need of deciding which features are best suited for the task. [sent-34, score-0.163]
15 Figure 1: Sample annotation results on IAPRTC-12 (top) and ESP-Game (bottom) using TagProp when each image is represented by a 256-bit code. [sent-35, score-0.445]
16 The first column of tags is the gold standard and the second column are the predicted tags. [sent-36, score-0.232]
17 Predicted tags that are italic are those that are also gold standard. [sent-37, score-0.232]
18 [10] who construct visual synsets of images and Weston et al. [sent-40, score-0.168]
19 Our second contribution proposes the use of representing an image with a 256-bit code for annotation. [sent-42, score-0.203]
20 [14] performed an extensive analysis of small codes for image retrieval showing that even on databases with millions of images, linear search with Hamming distance can be performed efficiently. [sent-44, score-0.549]
21 We utilize an autoencoder with a single hidden layer on top of our learned hierarchical representations to construct codes. [sent-45, score-0.287]
22 In exchange, 256-bit codes are efficient to store and can be compared quickly with bitwise operations. [sent-47, score-0.166]
23 To our knowledge, our approach is the first to learn binary codes from full-sized color images without the use of handcrafted features. [sent-48, score-0.401]
24 2 Hierarchical representation learning In this section we describe our approach for learning a deep feature representation from the pixellevel of a color image. [sent-51, score-0.261]
25 Our approach involves aspects of typical pipelines: pre-processing and whitening, dictionary learning, convolutional extraction and pooling. [sent-52, score-0.22]
26 We define a module as a pass through each of the above operations. [sent-53, score-0.46]
27 Extract randomly selected patches from each image and apply pre-processing. [sent-57, score-0.203]
28 Convolve the dictionary with larger tiles extracted across the image with a pre-defined stride length. [sent-61, score-0.447]
29 Pool over the reassembled features with a 2 layer pyramid. [sent-64, score-0.163]
30 Given a receptive field of size r ×c, we first extract np patches across all images of size r ×c×3, followed by flatting each patch into a column vector. [sent-77, score-0.401]
31 Next we follow [13] by performing ZCA whitening, which results in patches having zero mean, np np 1 (i) = 0, and identity covariance, np i=1 x(i) (x(i) )T = I. [sent-87, score-0.369]
32 K-SVD constructs a ˆ dictionary D ∈ Rn×k and a sparse representation S ∈ Rk×np by solving the following optimization problem: ˆ S − DS minimize ˆ D,S 2 F subject to ||ˆ(i) ||0 ≤ q ∀i s (1) where k is the desired number of bases. [sent-98, score-0.156]
33 When D is fixed, the ˆ problem of obtaining S can be decomposed into np subproblems of the form s(i) − Dˆ(i) 2 subject s (i) to ||ˆ ||0 ≤ q ∀i which can be solved approximately using batch orthogonal matching pursuit [15]. [sent-100, score-0.212]
34 3 Convolutional feature extraction Given an image I (i) , we first partition the image into a set of tiles T (i) of size nt × nt with a pre(i) defined stride length s between each tile. [sent-109, score-0.752]
35 Each patch in tile Tt is processed in the same way as before dictionary construction (mean centering, contrast normalization, whitening) for which the (i) mean and whitening matrices M and W are used. [sent-110, score-0.347]
36 Let Ttj denote the t-th tile and j-th channel with (l) respect to image I (i) and let Dj ∈ Rr×c denote the l-th basis for channel j of D. [sent-111, score-0.312]
37 The encoding (i) ftl for tile t and basis l is given by: 3 (i) (i) (l) Ttj ∗ Dj ftl = max tanh ,0 (3) j=1 where * denotes convolution and max and tanh operations are applied componentwise. [sent-112, score-0.281]
38 Let ft denote the concatenated encodings over bases, which have a resulting dimension of (nt − r + 1) × (nt − c + 1) × k. [sent-120, score-0.156]
39 Figure 2: Left: D is convolved with each tile (large green square) with receptive field (small blue square) over a given stride. [sent-127, score-0.193]
40 4 Pooling The final step of our pipeline is to perform spatial pooling over the re-assembled regions of the (i) encodings ft . [sent-131, score-0.31]
41 Consider the l-th cross section corresponding to the l-th dictionary element, l ∈ {1, . [sent-132, score-0.202]
42 We may then pool over each of the spatial regions of this cross section by summing over the activations of the corresponding spatial regions. [sent-136, score-0.228]
43 This is done in the form of a 2-layer spatial pyramid, where the base of the pyramid consists of 4 blocks of 2×2 tiling and the top of the pyramid consisting of a single block across the whole cross section. [sent-137, score-0.317]
44 Once pooling is performed, the re-assembled encodings result in a shape of size 1 × 1 × k and 2 × 2 × k from each layer of the pyramid. [sent-139, score-0.238]
45 To obtain the final feature vector, each layer is flattened into a vector and the resulting vectors are concatinated into a single long feature vector of dimension 5k for each image I (i) . [sent-140, score-0.391]
46 5 Training multiple modules What we have described up until now is how to extract features using a single module corresponding to dictionary learning, extraction and pooling. [sent-143, score-0.772]
47 Once the first module has been trained, we can take the pooled features to be input to a second module. [sent-145, score-0.557]
48 Freezing the learned dictionary from the first module, we can then apply all the same steps a second time to the pooled representations. [sent-146, score-0.228]
49 To be more specific on the input to the second module, we use an additional spatial pooling operation on the re-assembled encodings of the first module, where we extract 256 blocks of 16 × 16 tiling, resulting in a representation of size 16×16×k. [sent-148, score-0.266]
50 As an illustration, the same operations for the second module are used as in figure 2 except the image is replaced with the 16 × 16 × k pooled features. [sent-151, score-0.695]
51 3 Code construction and discriminitive metric learning In this section we first show to learn binary codes from our learned features, followed by a review of the TagProp algorithm [9] used for annotation. [sent-153, score-0.239]
52 1 Learning binary codes for annotation Our codes are learned by adding an autoencoder with a single hidden layer on top of the learned output representations. [sent-155, score-0.802]
53 Let f (i) ∈ Rdm denote the learned representation for image I (i) of dimension dm using either a one or two module architecture. [sent-156, score-0.697]
54 Figure 3: Coding layer activaAs is, the optimization does not take into consideration the round- tion values after training the auing used in the coding layer and consequently the output is not toencoder. [sent-160, score-0.313]
55 [17] and use additive ‘deterministic’ Gaussian noise with zero mean in the coding layer that is fixed in advance for each datapoint when performing a bottom-up pass through the network. [sent-163, score-0.201]
56 Figure 3 shows the coding layer activation values after backpropagation when noise has been added. [sent-166, score-0.165]
57 2 The tag propagation (TagProp) algorithm Let V denote a fixed vocabulary of tags and I denote a list of input images. [sent-168, score-0.368]
58 Our goal at test time, given a new input image i , is to assign a set of tags v ∈ V that are most relevant to the content of i . [sent-169, score-0.435]
59 More specifically, let yiw ∈ {1, −1}, i ∈ I, w ∈ V be an indicator for whether tag w is present in image i. [sent-171, score-0.408]
60 In TagProp, the probability that yiw = 1 is given by σ(αw xiw + βw ), xiw = −1 is the logistic function, (αw , βw ) are word-specific j πij yjw where σ(z) = (1 + exp(−z)) model parameters to be estimated and πij are distance-based weights also to be estimated. [sent-172, score-0.213]
61 More specifically, πij is expressed as πij = exp(−dh (i, j)) , j exp(−dh (i, j )) dh (i, j) = hdij , h≥0 (4) where we shall call dij the base distance between images i and j. [sent-173, score-0.274]
62 The model is trained to maximize the following quasilikelihood of the data given by L = i,w ciw log p(yiw ), ciw = n1 if yiw = 1 and n1 otherwise, + − where n+ is the total number of positive labels of w and likewise for n− and missing labels. [sent-175, score-0.213]
63 Combined with the logistic word models, it accounts for much higher recall in rare tags which would normally be less likely to be recalled in a basic k-NN setup. [sent-177, score-0.293]
64 The choice of base distance used depends on the image representation. [sent-179, score-0.279]
65 For all our experiments, we use k1 = 512 first module bases, k2 = 1024 second module bases, receptive field sizes of 6 × 6 and 2 × 2 and tile sizes (nt ) of 16 × 16 and 6 × 6. [sent-187, score-1.041]
66 The total number of features for the combined first and second module representation is thus 5(k1 + k2 ) = 7680. [sent-188, score-0.522]
67 The first module stride length is chosen based on the length of the longest side of the image: 4 if the side is less than 128 pixels, 6 if less than 214 pixels and 8 otherwise. [sent-190, score-0.532]
68 3 We also incorporate the use of self-taught learning [6] in our annotation experiments by utilizing the Mirflickr dataset for dictionary learning. [sent-194, score-0.408]
69 We randomly sampled 10000 images from this dataset for training K-SVD on both modules. [sent-196, score-0.217]
70 1 STL-10 The STL-10 dataset is a collection of 96 × 96 images of 10 classes with images partitioned into 10 folds of 1000 Table 1: A selection of the best results obtained on the STLimages each and a test set of size 10 dataset. [sent-200, score-0.291]
71 1 % randomly chose 10000 images from the unlabeled set for training and use a linear L2-SVM for classification with 5-fold cross validation for model selection. [sent-212, score-0.253]
72 Our 2 module architecture outperforms all existing approaches except for the recently proposed hierarchical matching pursuit (HMP). [sent-214, score-0.569]
73 2 Natural scenes The Natural Scenes dataset is a multi-label collection of 2000 images from 5 classes: desert, forest, mountain, ocean and sunset. [sent-220, score-0.269]
74 3 IAPRTC-12 and ESP-Game IAPRTC-12 is a collection of 20000 images with a vocabulary size of |V | = 291 and an average of 5. [sent-318, score-0.162]
75 Using standard protocol performance is evaluated using 3 measures: precision (P), recall (R) and the number of recalled tags (N+). [sent-324, score-0.293]
76 N + indicates the number of tags that were recalled at least once for annotation on the test set. [sent-325, score-0.535]
77 Annotations are made by choosing the 5 most probable tags for each image as is done with previous evaluations. [sent-326, score-0.435]
78 As with the natural scenes dataset, we perform 5-fold cross validation to determine K for training TagProp. [sent-327, score-0.231]
79 Our 256-bit codes suffer a loss of performance on IAPRTC-12 but give near equivalent results on ESP-Game. [sent-332, score-0.166]
80 Figure 4 shows sample unsupervised retrieval results using the learned 256-bit codes on IAPRTC-12 and ESP-Game while figure 5 illustrates sample annotation performance when training on one dataset and annotating the other. [sent-335, score-0.724]
81 These results show that our codes are able to capture high-level semantic concepts that perform well for retrieval and transfer learning across datasets. [sent-336, score-0.205]
82 We note however, that when annotating ESP-game when training was done on IAPRTC-12 led to more false human annotations (such as the bottom-right 7 Figure 4: Sample 256-bit unsupervised retrieval results on ESP-Game (top) and IAPRTC-12 (bottom). [sent-337, score-0.236]
83 A query image from the test set is used to retrieve the four nearest neighbors from the training set. [sent-338, score-0.288]
84 Figure 5: Sample 256-bit annotation results when training on one dataset and annotating the other. [sent-339, score-0.442]
85 5 Conclusion In this paper we introduced a hierarchical model for learning feature representations of standard sized color images for the task of image annotation. [sent-344, score-0.559]
86 Our results compare favorably to existing approaches that use over a dozen handcrafted image descriptors. [sent-345, score-0.329]
87 Our primary goal for future work is to test the effectiveness of this approach on web-scale annotation systems with millions of images. [sent-346, score-0.284]
88 The success of self-taught learning in this setting means only one dictionary per module ever needs to be learned. [sent-347, score-0.547]
89 It is our hope that the successful use of binary codes for annotation will allow further research to bridge the gap between the annotation algorithms used on small scale problems to those required for web scale tasks. [sent-349, score-0.65]
90 We also intend to evaluate the effectiveness of semantic hashing on large databases when much smaller codes are used. [sent-350, score-0.222]
91 Linear spatial pyramid matching using sparse coding for image classification. [sent-357, score-0.425]
92 Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. [sent-373, score-0.2]
93 Learning image representations from the pixel level via hierarchical sparse coding. [sent-379, score-0.338]
94 Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. [sent-405, score-0.274]
95 Large scale image annotation: learning to rank with joint wordimage embeddings. [sent-422, score-0.203]
96 Discovering binary codes for documents by learning deep generative models. [sent-462, score-0.27]
97 Beyond spatial pyramids: Receptive field learning for pooled image features. [sent-496, score-0.328]
98 Multi-label learning by image-to-class distance for scene classification and image annotation. [sent-510, score-0.246]
99 Multiple Bernoulli relevance models for image and video annotation. [sent-522, score-0.203]
100 Using very deep autoencoders for content-based image retrieval. [sent-529, score-0.341]
wordName wordTfidf (topN-words)
[('module', 0.424), ('tagprop', 0.374), ('annotation', 0.242), ('tags', 0.232), ('image', 0.203), ('codes', 0.166), ('images', 0.124), ('dictionary', 0.123), ('np', 0.123), ('tile', 0.109), ('annotating', 0.107), ('yiw', 0.107), ('deep', 0.104), ('scenes', 0.102), ('layer', 0.098), ('tag', 0.098), ('ickr', 0.094), ('receptive', 0.084), ('whitening', 0.081), ('rdm', 0.08), ('cross', 0.079), ('encodings', 0.078), ('ft', 0.078), ('coates', 0.076), ('dh', 0.074), ('stride', 0.074), ('guillaumin', 0.071), ('modules', 0.068), ('pooled', 0.068), ('coding', 0.067), ('handcrafted', 0.065), ('mir', 0.065), ('features', 0.065), ('hamming', 0.064), ('bases', 0.064), ('nt', 0.062), ('pooling', 0.062), ('recalled', 0.061), ('dozen', 0.061), ('rubinstein', 0.058), ('spatial', 0.057), ('hierarchical', 0.056), ('extraction', 0.056), ('databases', 0.056), ('autoencoder', 0.056), ('bo', 0.055), ('cvpr', 0.054), ('ccd', 0.053), ('centering', 0.053), ('ciw', 0.053), ('ftl', 0.053), ('rdb', 0.053), ('ttj', 0.053), ('xiw', 0.053), ('pyramid', 0.052), ('training', 0.05), ('rgb', 0.049), ('alberta', 0.049), ('hmp', 0.047), ('ages', 0.047), ('tiles', 0.047), ('db', 0.046), ('color', 0.046), ('matching', 0.046), ('sized', 0.045), ('feature', 0.045), ('et', 0.044), ('tsai', 0.044), ('krizhevsky', 0.044), ('omp', 0.044), ('tiling', 0.044), ('distance', 0.043), ('dataset', 0.043), ('pursuit', 0.043), ('millions', 0.042), ('convolutional', 0.041), ('representations', 0.04), ('unsupervised', 0.04), ('edmonton', 0.039), ('pixel', 0.039), ('retrieval', 0.039), ('vocabulary', 0.038), ('exchange', 0.037), ('eld', 0.037), ('learned', 0.037), ('extract', 0.036), ('metric', 0.036), ('pass', 0.036), ('oe', 0.036), ('huang', 0.036), ('distances', 0.035), ('nearest', 0.035), ('regions', 0.035), ('autoencoders', 0.034), ('longest', 0.034), ('patch', 0.034), ('representation', 0.033), ('gure', 0.033), ('base', 0.033), ('tanh', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
2 0.18366411 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
3 0.16545559 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
4 0.15072137 148 nips-2012-Hamming Distance Metric Learning
Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov
Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1
5 0.14833777 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
6 0.13972878 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
7 0.12107044 193 nips-2012-Learning to Align from Scratch
8 0.1186135 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
9 0.11521566 210 nips-2012-Memorability of Image Regions
10 0.11310365 62 nips-2012-Burn-in, bias, and the rationality of anchoring
11 0.11310365 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning
12 0.11028014 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
13 0.10404886 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
14 0.10226095 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
15 0.095000423 202 nips-2012-Locally Uniform Comparison Image Descriptor
16 0.093641721 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
17 0.092596784 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection
18 0.091784045 100 nips-2012-Discriminative Learning of Sum-Product Networks
19 0.091220044 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
20 0.086237065 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search
topicId topicWeight
[(0, 0.203), (1, 0.07), (2, -0.249), (3, -0.016), (4, 0.139), (5, -0.055), (6, -0.017), (7, -0.029), (8, 0.05), (9, -0.014), (10, 0.053), (11, 0.023), (12, -0.037), (13, 0.122), (14, -0.14), (15, -0.01), (16, 0.041), (17, -0.012), (18, 0.022), (19, -0.107), (20, 0.017), (21, -0.046), (22, -0.003), (23, 0.012), (24, 0.013), (25, 0.017), (26, -0.039), (27, 0.028), (28, 0.011), (29, 0.064), (30, -0.035), (31, 0.031), (32, 0.059), (33, -0.122), (34, 0.095), (35, 0.1), (36, 0.014), (37, 0.076), (38, 0.078), (39, -0.061), (40, 0.014), (41, -0.018), (42, -0.046), (43, 0.087), (44, 0.009), (45, -0.051), (46, -0.06), (47, -0.044), (48, -0.007), (49, 0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.95412755 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
2 0.80664337 210 nips-2012-Memorability of Image Regions
Author: Aditya Khosla, Jianxiong Xiao, Antonio Torralba, Aude Oliva
Abstract: While long term human visual memory can store a remarkable amount of visual information, it tends to degrade over time. Recent works have shown that image memorability is an intrinsic property of an image that can be reliably estimated using state-of-the-art image features and machine learning algorithms. However, the class of features and image information that is forgotten has not been explored yet. In this work, we propose a probabilistic framework that models how and which local regions from an image may be forgotten using a data-driven approach that combines local and global images features. The model automatically discovers memorability maps of individual images without any human annotation. We incorporate multiple image region attributes in our algorithm, leading to improved memorability prediction of images as compared to previous works. 1
3 0.80512011 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
4 0.79987705 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
Author: Ren Xiaofeng, Liefeng Bo
Abstract: Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for computing sparse codes on oriented local neighborhoods, and apply multi-scale pooling and power transforms before classifying them with linear SVMs. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth maps and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. 1
5 0.75884449 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
6 0.74811929 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
7 0.74374282 202 nips-2012-Locally Uniform Comparison Image Descriptor
8 0.73449558 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
9 0.73368394 193 nips-2012-Learning to Align from Scratch
10 0.72857445 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
11 0.66742086 146 nips-2012-Graphical Gaussian Vector for Image Categorization
12 0.64437509 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
13 0.62990379 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
14 0.59264988 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
15 0.57404506 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.57295948 170 nips-2012-Large Scale Distributed Deep Networks
17 0.56410414 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
18 0.55910176 185 nips-2012-Learning about Canonical Views from Internet Image Collections
19 0.52090931 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
20 0.51388347 148 nips-2012-Hamming Distance Metric Learning
topicId topicWeight
[(0, 0.031), (21, 0.026), (38, 0.1), (42, 0.026), (44, 0.012), (47, 0.233), (53, 0.012), (54, 0.018), (55, 0.05), (74, 0.101), (76, 0.126), (80, 0.095), (92, 0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.79467648 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
2 0.68134886 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
3 0.68059379 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
4 0.68036723 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
Author: Ren Xiaofeng, Liefeng Bo
Abstract: Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for computing sparse codes on oriented local neighborhoods, and apply multi-scale pooling and power transforms before classifying them with linear SVMs. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth maps and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. 1
5 0.67719764 168 nips-2012-Kernel Latent SVM for Visual Recognition
Author: Weilong Yang, Yang Wang, Arash Vahdat, Greg Mori
Abstract: Latent SVMs (LSVMs) are a class of powerful tools that have been successfully applied to many applications in computer vision. However, a limitation of LSVMs is that they rely on linear models. For many computer vision tasks, linear models are suboptimal and nonlinear models learned with kernels typically perform much better. Therefore it is desirable to develop the kernel version of LSVM. In this paper, we propose kernel latent SVM (KLSVM) – a new learning framework that combines latent SVMs and kernel methods. We develop an iterative training algorithm to learn the model parameters. We demonstrate the effectiveness of KLSVM using three different applications in visual recognition. Our KLSVM formulation is very general and can be applied to solve a wide range of applications in computer vision and machine learning. 1
6 0.67558175 260 nips-2012-Online Sum-Product Computation Over Trees
7 0.67475593 197 nips-2012-Learning with Recursive Perceptual Representations
8 0.67414278 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
9 0.67062086 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
10 0.66916108 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
11 0.66912556 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
12 0.66830999 8 nips-2012-A Generative Model for Parts-based Object Segmentation
13 0.66727912 210 nips-2012-Memorability of Image Regions
14 0.66617173 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
15 0.66273779 148 nips-2012-Hamming Distance Metric Learning
16 0.66093767 188 nips-2012-Learning from Distributions via Support Measure Machines
17 0.66061521 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
18 0.66022849 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
19 0.65816212 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
20 0.6578663 65 nips-2012-Cardinality Restricted Boltzmann Machines