nips nips2012 nips2012-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com 1 Abstract We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. [sent-8, score-0.34]
2 With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. [sent-9, score-1.451]
3 The network encodes invariance which are increasingly complex with hierarchy. [sent-10, score-0.235]
4 Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. [sent-11, score-0.435]
5 During their development, training stimuli are not incoherent sequences of images, but natural visual streams modulated by fixations [1]. [sent-15, score-0.296]
6 Likewise, we expect a machine vision system to learn from coherent image sequences extracted from the natural environment. [sent-16, score-0.318]
7 Through this learning process, it is desired that features become robust to temporal transfromations and perform significantly better in recognition. [sent-17, score-0.355]
8 However, it remains unclear to what extent sparsity and subspace pooling [3, 4] could produce invariance exhibited in higher levels of visual systems. [sent-20, score-0.468]
9 Another approach to learning invariance is temporal slowness [1, 5, 6, 7]. [sent-21, score-1.053]
10 Experimental evidence suggests that high-level visual representations become slow-changing and tolerant towards non-trivial transformations, by associating low-level features which appear in a coherent sequence [5]. [sent-22, score-0.275]
11 To learn features using slowness, a key observation is that during our visual fixations, moving objects remain in visual focus for a sustained amount of time through smooth pursuit eye movements. [sent-23, score-0.463]
12 At these feature locations, we apply local contrast normalization [8], template matching [9] to find local correspondences between successive video frames. [sent-27, score-0.269]
13 In prior work [10, 11, 12], a single layer of features learned using temporal slowness results in translation-invariant edge detectors, reminiscent of complex-cells. [sent-30, score-1.348]
14 However, it remains unclear whether higher levels of invariances [1], such as ones exhibited in IT, can be learned using temporal 1 Figure 1: Simulating smooth pursuit eye movements. [sent-31, score-0.481]
15 Using temporal slowness, the first layer units become locally translational invariant, similar to subspace or spatial pooling; the second layer units can then encode more complex invariances such as out-of-plane transformations and non-linear warping. [sent-37, score-0.964]
16 Using this approach, we show a surprising result that despite being trained on videos, our features encode complex invariances which translate to recognition performance on still images. [sent-38, score-0.414]
17 We first learn a set of features using simulated fixations in unlabeled videos, and then apply the learned features to classification tasks. [sent-40, score-0.459]
18 The learned features improve accuracy by a significant 4% to 5% across four still image recognition datasets. [sent-41, score-0.386]
19 Finally, we quantify the invariance learned using temporal slowness and simulated fixations by a set of control experiments. [sent-43, score-1.146]
20 2 Related work Unsupervised learning image features from pixels is a relatively new approach in computer vision. [sent-44, score-0.259]
21 [20] showed that temporal slowness could improve recognition on a video-like COIL-100 dataset. [sent-53, score-0.922]
22 Despite being one of the first to apply temporal slowness in deep architectures, the authors trained a fully supervised convolutional network and used temporal slowness as a regularizing step in the optimization procedure. [sent-54, score-1.949]
23 The influential work of Slow Feature Analysis (SFA) [7] was an early example of unsupervised algorithm using temporal slowness. [sent-55, score-0.269]
24 SFA solves a constrained problem and optimizes for temporal slowness by mapping data into a quadratic expansion and performing eigenvector decomposition. [sent-56, score-0.861]
25 [12] proposed to train deep architectures with temporal slowness and decorrelation, and illustrated training a first layer on MNIST digits. [sent-61, score-1.216]
26 [24] trained a two-layer algorithm to learn visual transformations in videos, with limited emphasis on temporal slowness. [sent-65, score-0.407]
27 The computer vision literature has a number of works which, similar to us, use the idea of video tracking to learn invariant features. [sent-66, score-0.417]
28 [25] show improvement in performance when SIFT/HOG parameters are optimized using tracked image patch sequences in specific application domains. [sent-69, score-0.361]
29 In contrast to these recent examples, our algorithm learns features directly from raw image pixels, and adapts to pixel-level image statistics—in particular, it does not rely on hand-designed preprocessing such as SIFT/HOG. [sent-76, score-0.295]
30 In particular, our learning modules use a combination of temporal slowness and a non-degeneracy principle similar to orthogonality [30, 31]. [sent-79, score-0.934]
31 To learn invariant features with temporal slowness, we use a two layer network, where the first layer is convolutional and replicates neurons with local receptive field across dense grid locations, and the second (non-convolutional) layer is fully connected. [sent-82, score-1.335]
32 This pooling mechanism is implemented by a subspace pooling matrix H with a group size of two [30]. [sent-87, score-0.282]
33 ) Although the algorithm is driven by temporal slowness, sparsity also helps to obtain good features from natural images. [sent-94, score-0.42]
34 This basic algorithm trained on the Hans van Hateren’s natural video repository [24] produced oriented edge filters. [sent-96, score-0.329]
35 The learned features are highly invariant to local translations. [sent-97, score-0.322]
36 The reason for this is that temporal slowness requires hidden features to be slow-changing across time. [sent-98, score-1.054]
37 2 Stacked Architecture The first layer modules described in the last section are trained on a smaller patch size (16x16 pixels) of locally tracked video sequences. [sent-102, score-0.801]
38 To construct the set of inputs to the second stacked layer, first layer features are replicated on a dense grid in a larger scale (32x32 pixels). [sent-103, score-0.455]
39 The input to layer two is extracted after L2 pooling. [sent-104, score-0.283]
40 This architecture produces an over-complete number of local 16x16 features across the larger feature area. [sent-105, score-0.299]
41 Due to the high dimensionality of the first layer outputs, we apply PCA to reduce their dimensions for the second layer algorithm. [sent-107, score-0.504]
42 avi 3 Figure 2: Neural network architecture of the basic learning module Figure 3: Translational invariance in first layer features; columns correspond to interpolation angle θ at multiples of 45 degrees a fully connected module is trained with temporal slowness on the output of PCA. [sent-111, score-1.663]
43 The stacked architecture learns features in a signficantly larger 2-D area than the first layer algorithm, and able to learn invariance to larger-scale transformations seen in videos. [sent-112, score-0.777]
44 Figure 4: Two-layer architecture of our algorithm used to learn invariance from videos. [sent-113, score-0.283]
45 3 Invariance Visualization After unsupervised training with video sequences, we visualize the features learned by the two layer network. [sent-115, score-0.796]
46 On the left of Figure 5, we show the optimal stimuli which maximally activates each of the first layer pooling units. [sent-116, score-0.456]
47 The optimal stimuli for units learned without slowness are shown at the top, and appears to give high frequency grating-like patterns. [sent-118, score-0.831]
48 At the bottom, we show the optimal stimuli for features learned with slowness; here, the optimal stimuli appear much smoother because the pairs of Gabor-like features being pooled over are usually a quadrature pair. [sent-119, score-0.58]
49 The second layer features are learned on top of the pooled first layer features. [sent-121, score-0.791]
50 We visualize the second layer features by plotting linear combinations of the first layer features’ optimal stimuli (as shown on the left of Figure 5), and varying the interpolation angle as in [24]. [sent-122, score-0.796]
51 Each row corresponds to a motion sequence to which we would expect the second layer features to be roughly invariant. [sent-124, score-0.419]
52 A video animation of this visualization is also available online2 . [sent-126, score-0.259]
53 avi 4 Figure 5: (Left) Comparison of optimal stimuli of first layer pooling units (patch size 16x16) learned without (top) and with (bottom) temporal slowness. [sent-130, score-0.739]
54 (Right) visualization of second layer features (patch size 32x32), with each row corresponding to one pooling unit. [sent-131, score-0.622]
55 The learned features are then used to classify single images in each of four datasets. [sent-135, score-0.295]
56 1 Training with Tracked Sequences To extract data from the Hans van Hateren natural video repository, we apply spatial-temporal Difference-of-Gaussian blob detector and select areas of high response to simulate visual fixations. [sent-138, score-0.365]
57 After the initial frame is selected, the image patch is tracked across 20 frames using a tracker we built and customized for this task. [sent-139, score-0.366]
58 The first layer algorithm is learned on 16x16 patches with 128 features (pooled from 256 linear bases). [sent-141, score-0.528]
59 The second layer learns 150 features (pooled from 300 linear bases). [sent-144, score-0.419]
60 The videos we trained on to obtain the temporal slowness features were based on the van Hataren videos, and were thus unrelated to COIL-100. [sent-148, score-1.241]
61 COIL-100 (unrelated video) Method VTU [32] ConvNet regularized with video [20] Our results without video Our results using video Performance increase by training on video Acc. [sent-157, score-0.819]
62 0% Method Two-layer ConvNet [36] ScSPM [37] Hierarchical sparse-coding [38] Macrofeatures [39] Our results without video Our results using video Performance increase with video Ave. [sent-163, score-0.591]
63 STL-10 Method Reconstruction ICA [31] Sparse Filtering [40] SC features, K-means encoding [16] SC features, SC encoding [16] Local receptive field selection [19] Our result without video Our result using video Performance increase with video Ave. [sent-174, score-0.628]
64 PubFig faces Method Our result without video Our result using video Performance increase with video Acc. [sent-185, score-0.591]
65 3 Test Pipeline On still images, we apply our trained network to extract features at dense grid locations. [sent-196, score-0.28]
66 A linear SVM classifier is trained on features from both first and second layers. [sent-197, score-0.237]
67 For Caltech 101, we use a three layer spatial pyramid. [sent-201, score-0.285]
68 However, performance is not particularly sensitive to the weighting between temporal slowness objective compared to reconstruction objective in Equation 1, as we will illustrate in Section 4. [sent-205, score-0.933]
69 For each dataset, we compare results using features trained with and without the temporal slowness objective term in Equation 1. [sent-208, score-1.098]
70 Despite the feature being learned from natural videos and then being transferred to different recognition tasks (i. [sent-209, score-0.329]
71 The application of temporal slowness increases recognition accuracy consistently by 4% to 5%, bringing our results to be competitive with the state-of-the-art. [sent-212, score-0.922]
72 As shown on the left of Figure 6, training on tracked sequences reduces the translation invariance learned in the second layer. [sent-217, score-0.569]
73 In 6 comparison to other forms of invariances, translation is less useful because it is easy to encode with spatial pooling [17]. [sent-218, score-0.259]
74 Instead, the features encode other invariance such as different forms of nonlinear warping. [sent-219, score-0.386]
75 The advantage of using tracked data is reflected in object recognition performance on the STL-10 dataset. [sent-220, score-0.237]
76 Shown on the right of Figure 6, recognition accuracy is increased by a considerable margin by training on tracked sequences. [sent-221, score-0.224]
77 Figure 6: (Left) Comparison of second layer invariance visualization when training data was obtained with tracking and without; (Right) Ave. [sent-222, score-0.601]
78 on STL-10 with features trained on tracked sequences compared to non-tracked; λ in this plot is slowness weighting parameter from Equation 1 . [sent-224, score-1.169]
79 2 Importance of Temporal Slowness to Recognition Performance To understand how much the slowness principle helps to learn good features, we vary the slowness parameter across a range of values to observe its effect on recognition accuracy. [sent-227, score-1.465]
80 Figure 7 shows recognition accuracy on STL-10, plotted as a function of a slowness weighting parameter λ in the first and second layers. [sent-228, score-0.773]
81 Figure 7: Performance on STL-10 versus the amount of temporal slowness, on the first layer (left) and second layer (right); in these plots λ is the slowness weighting parameter from Equation 1; different colored curves are shown for different λ values in the other layer. [sent-231, score-1.404]
82 3 Invariance Tests We quantify invariance encoded in the unsupservised learned features with invariance tests. [sent-234, score-0.619]
83 In this experiment, we take the approach described in [4] and measure the change in features as input image undergoes transformations. [sent-235, score-0.231]
84 A patch is extracted from a natural image, and transformed through tranlation, rotation and zoom. [sent-236, score-0.21]
85 We measure the Mean Squared Error (MSE) between the L2 normalized feature vector of the transformed patch and the feature vector of the original patch 3 . [sent-237, score-0.248]
86 Results of invariance tests are 3 MSE is normalized against feature dimensions, and averaged across 100 randomly sampled patches. [sent-239, score-0.301]
87 Our features trained with temporal slowness have better invariance properties compared to features learned only using sparity, and SIFT 5 . [sent-243, score-1.525]
88 Specifically, as shown on the left of Figure 8, feature tracking reduces translation invariance in agreement with our analysis in Section 4. [sent-245, score-0.361]
89 At the same time, middle and right plots of Figure 8 show that feature tracking increases the non-trivial rotation and zoom invariance in the second layer of our temporal slowness features. [sent-248, score-1.536]
90 Figure 8: Invariance tests comparing our temporal slowness features using tracked and non-tracked sequences, against SIFT and features trained only with sparsity, shown for different transformations: Translation (left), Rotation (middle) and Zoom (right). [sent-249, score-1.433]
91 5 Conclusion We have described an unsupervised learning algorithm for learning invariant features from video using the temporal slowness principle. [sent-250, score-1.393]
92 The system is improved by using simulated fixations and smooth pursuit to generate the video sequences provided to the learning algorithm. [sent-251, score-0.394]
93 We illustrate by virtual of visualization and invariance tests, that the learned features are invariant to a collection of non-trivial transformations. [sent-252, score-0.576]
94 With concrete recognition experiments, we show that the features learned from natural videos not only apply to still images, but also give competitive results on a number of object recognition benchmarks. [sent-253, score-0.554]
95 Unsupervised natural experience rapidly alters invariant object representation in visual cortex. [sent-259, score-0.245]
96 Unsupervised learning of visual features through spike timing dependent plasticity. [sent-291, score-0.245]
97 4 Translation test is performed with 16x16 patches and first layer features, rotation and zoom tests are performed with 32x32 patches and second layer features. [sent-315, score-0.742]
98 Temporal coherence, natural image sequences and the visual cortex. [sent-321, score-0.266]
99 An analysis of single layer networks in unsupervised feature learning. [sent-340, score-0.38]
100 Learning optimized features for hierarchical models of invariant object recogo nition. [sent-471, score-0.298]
wordName wordTfidf (topN-words)
[('slowness', 0.673), ('layer', 0.252), ('video', 0.197), ('invariance', 0.192), ('temporal', 0.188), ('features', 0.167), ('pooling', 0.141), ('tracked', 0.132), ('xations', 0.126), ('videos', 0.117), ('pubfig', 0.102), ('invariances', 0.089), ('sequences', 0.088), ('invariant', 0.087), ('sfa', 0.082), ('unsupervised', 0.081), ('visual', 0.078), ('patch', 0.077), ('modules', 0.073), ('deep', 0.072), ('trained', 0.07), ('learned', 0.068), ('caltech', 0.066), ('rotation', 0.066), ('tracking', 0.064), ('image', 0.064), ('stimuli', 0.063), ('hateren', 0.062), ('module', 0.062), ('visualization', 0.062), ('recognition', 0.061), ('images', 0.06), ('architecture', 0.059), ('translation', 0.058), ('pursuit', 0.054), ('zoom', 0.054), ('pooled', 0.052), ('cadieu', 0.05), ('hyvarinen', 0.05), ('feature', 0.047), ('hans', 0.047), ('coates', 0.046), ('hurri', 0.044), ('object', 0.044), ('network', 0.043), ('mse', 0.042), ('convolutional', 0.042), ('cvpr', 0.041), ('ngiam', 0.041), ('convnet', 0.041), ('mobahi', 0.041), ('patches', 0.041), ('sc', 0.04), ('frames', 0.039), ('weighting', 0.039), ('transformations', 0.039), ('vision', 0.037), ('receptive', 0.037), ('ica', 0.036), ('stacked', 0.036), ('natural', 0.036), ('vlfeat', 0.036), ('tests', 0.036), ('interpolation', 0.035), ('reconstruction', 0.033), ('slow', 0.033), ('spatial', 0.033), ('learn', 0.032), ('bergstra', 0.031), ('koh', 0.031), ('leistner', 0.031), ('training', 0.031), ('extracted', 0.031), ('smooth', 0.03), ('translational', 0.03), ('coherent', 0.03), ('bases', 0.029), ('sparsity', 0.029), ('tracker', 0.028), ('warping', 0.028), ('boureau', 0.028), ('xation', 0.028), ('sift', 0.028), ('levels', 0.028), ('pixels', 0.028), ('simulate', 0.028), ('angle', 0.027), ('bilinear', 0.027), ('topographic', 0.027), ('encode', 0.027), ('units', 0.027), ('le', 0.027), ('van', 0.026), ('across', 0.026), ('kavukcuoglu', 0.026), ('coding', 0.025), ('simulated', 0.025), ('correspondences', 0.025), ('layers', 0.024), ('eye', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
2 0.20606898 62 nips-2012-Burn-in, bias, and the rationality of anchoring
Author: Falk Lieder, Thomas Griffiths, Noah Goodman
Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1
3 0.20606898 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning
Author: Adam Coates, Andrej Karpathy, Andrew Y. Ng
Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1
4 0.19017397 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
5 0.18366411 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
6 0.17258602 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
7 0.15035118 197 nips-2012-Learning with Recursive Perceptual Representations
8 0.14167561 193 nips-2012-Learning to Align from Scratch
9 0.14113933 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
10 0.096239462 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
11 0.095356554 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
12 0.095238827 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
13 0.087848224 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
14 0.086729996 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
15 0.084950276 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
16 0.079062924 195 nips-2012-Learning visual motion in recurrent neural networks
17 0.078476898 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition
18 0.075793378 289 nips-2012-Recognizing Activities by Attribute Dynamics
19 0.072837502 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
20 0.070411958 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
topicId topicWeight
[(0, 0.17), (1, 0.06), (2, -0.3), (3, 0.029), (4, 0.156), (5, 0.009), (6, -0.004), (7, -0.094), (8, 0.038), (9, -0.064), (10, -0.014), (11, 0.055), (12, -0.137), (13, 0.052), (14, -0.05), (15, -0.093), (16, 0.064), (17, -0.015), (18, 0.058), (19, -0.022), (20, -0.003), (21, -0.035), (22, -0.008), (23, -0.024), (24, 0.03), (25, -0.026), (26, -0.014), (27, 0.007), (28, 0.023), (29, 0.003), (30, 0.04), (31, 0.086), (32, 0.037), (33, -0.069), (34, -0.0), (35, 0.015), (36, 0.036), (37, 0.025), (38, 0.028), (39, 0.034), (40, 0.011), (41, -0.027), (42, 0.011), (43, 0.038), (44, -0.048), (45, -0.032), (46, 0.029), (47, -0.007), (48, -0.027), (49, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.96637791 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
2 0.85675937 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
3 0.82424188 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
4 0.77240515 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
5 0.76438653 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
Author: Ryan Kiros, Csaba Szepesvári
Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1
6 0.7119379 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
7 0.68557084 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
8 0.6846388 62 nips-2012-Burn-in, bias, and the rationality of anchoring
9 0.6846388 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning
10 0.65133405 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex
11 0.6474579 170 nips-2012-Large Scale Distributed Deep Networks
12 0.63784438 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
13 0.61184567 210 nips-2012-Memorability of Image Regions
14 0.59616417 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
15 0.56427759 146 nips-2012-Graphical Gaussian Vector for Image Categorization
16 0.55885094 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
17 0.55523896 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
18 0.54503626 197 nips-2012-Learning with Recursive Perceptual Representations
19 0.53663605 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
20 0.53138733 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding
topicId topicWeight
[(0, 0.038), (17, 0.015), (20, 0.184), (21, 0.043), (38, 0.083), (42, 0.045), (54, 0.027), (55, 0.081), (74, 0.082), (76, 0.139), (80, 0.062), (92, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.83139104 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng
Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1
2 0.80733317 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
Author: Felipe Trevizan, Manuela Veloso
Abstract: Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artificial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP always finishes and is asymptotically optimal under sufficient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectorybased short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately 1070 states. 1
3 0.72872138 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
Author: Ren Xiaofeng, Liefeng Bo
Abstract: Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for computing sparse codes on oriented local neighborhoods, and apply multi-scale pooling and power transforms before classifying them with linear SVMs. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth maps and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. 1
4 0.72528189 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
Author: Thanh Ngo, Yousef Saad
Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1
5 0.72332251 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
6 0.72312516 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
7 0.72279078 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
8 0.72004575 193 nips-2012-Learning to Align from Scratch
9 0.71772534 210 nips-2012-Memorability of Image Regions
10 0.71395993 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
11 0.7134735 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
12 0.71287233 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
13 0.71211427 215 nips-2012-Minimizing Uncertainty in Pipelines
14 0.71150339 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
15 0.71116555 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
16 0.709571 188 nips-2012-Learning from Distributions via Support Measure Machines
17 0.70898938 329 nips-2012-Super-Bit Locality-Sensitive Hashing
18 0.70833647 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search
19 0.7079106 260 nips-2012-Online Sum-Product Computation Over Trees
20 0.70701158 197 nips-2012-Learning with Recursive Perceptual Representations