nips nips2012 nips2012-87 knowledge-graph by maker-knowledge-mining

87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification


Source: pdf

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. [sent-7, score-0.501]

2 We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. [sent-9, score-0.363]

3 The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. [sent-10, score-0.255]

4 RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. [sent-11, score-0.128]

5 Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. [sent-13, score-0.167]

6 New sensing technology, such as the Kinect, that can record high quality RGB and depth images (RGB-D) has now become affordable and could be combined with standard vision systems in household robots. [sent-15, score-0.356]

7 The depth modality provides useful extra information to the complex problem of general object detection [1] since depth information is invariant to lighting or color variations, provides geometrical cues and allows better separation from the background. [sent-16, score-0.742]

8 Most recent methods for object recognition with RGB-D images use hand-designed features such as SIFT for 2d images [2], Spin Images [3] for 3D point clouds, or specific color, shape and geometry features [4, 5]. [sent-17, score-0.467]

9 In this paper, we introduce the first convolutional-recursive deep learning model for object recognition that can learn from raw RGB-D images. [sent-18, score-0.25]

10 Compared to other recent 3D feature learning methods [6, 7], our approach is fast, does not need additional input channels such as surface normals and obtains state of the art results on the task of detecting household objects. [sent-19, score-0.171]

11 Our model starts with raw RGB and depth images and first separately extracts features from them. [sent-25, score-0.43]

12 Each modality is first given to a single convolutional neural net layer (CNN, [8]) which provides useful translational invariance of low level features such as edges and allows parts of an object to be deformable to some extent. [sent-26, score-0.558]

13 The pooled filter responses are then given to a recursive neural network (RNN, [9]) which can learn compositional features and part interactions. [sent-27, score-0.3]

14 We also explore new deep learning architectures for computer vision. [sent-29, score-0.111]

15 The concatenation of all the resulting vectors forms the final feature vector for a softmax classifier. [sent-33, score-0.107]

16 In this paper, we expand the space of possible RNN-based architectures in these four dimensions by using fixed tree structures and multiple RNNs on the same input and allow n-ary trees. [sent-35, score-0.121]

17 The hierarchically composed RNN features of each modality are concatenated and given to a joint softmax classifier. [sent-38, score-0.237]

18 So far random weights have only been shown to work for convolutional neural networks [15, 16]. [sent-40, score-0.21]

19 Because the supervised training reduces to optimizing the weights of the final softmax classifier, a large set of RNN architectures can quickly be explored. [sent-41, score-0.167]

20 We first briefly describe the unsupervised learning of filter weights and their convolution to obtain low level features. [sent-43, score-0.194]

21 We first learn the CNN filters in an unsupervised way by clustering random patches and then feed these patches into a CNN layer. [sent-49, score-0.192]

22 The resulting low-level, translationally invariant features are given to recursive neural networks. [sent-50, score-0.235]

23 RNNs compose higher order features that can then be used to classify the images. [sent-51, score-0.109]

24 First, random patches are extracted into two sets, one for each modality (RGB and depth). [sent-55, score-0.155]

25 One interesting result when applying this method to the depth channel is that the edges are much sharper. [sent-61, score-0.228]

26 This is due to the large discontinuities between object boundaries and the background. [sent-62, score-0.114]

27 While the depth channel is often quite noisy most of the features are still smooth. [sent-63, score-0.303]

28 2 Filters: RGB Depth Gray scale Figure 2: Visualization of the k-means filters used in the CNN layer after unsupervised pre-training: (left) Standard RGB filters (best viewed in color) capture edges and colors. [sent-64, score-0.222]

29 When the method is applied to depth images (center) the resulting filters have sharper edges which arise due to the strong discontinuities at object boundaries. [sent-65, score-0.426]

30 The same is true, though to a lesser extent, when compared to filters trained on gray scale versions of the color images (right). [sent-66, score-0.159]

31 2 A Single CNN Layer To generate features for the RNN layer, a CNN architecture is chosen for its translational invariance properties. [sent-68, score-0.114]

32 Our single layer CNN is similar to the one proposed by Jarrett et. [sent-70, score-0.146]

33 LCN was inspired by computational neuroscience and is used to contrast features within a feature map, as well as across feature maps at the same spatial location [17, 18, 14]. [sent-72, score-0.159]

34 We then average pool them with square regions of size d and a stride size of s, to obtain a pooled response with width and height equal to r = (dI − d )/s + 1. [sent-74, score-0.139]

35 So the output X of the CNN layer applied to one image is a K × r × r dimensional 3D matrix. [sent-75, score-0.185]

36 We apply this same procedure to both color and depth images separately. [sent-76, score-0.387]

37 3 Fixed-Tree Recursive Neural Networks The idea of recursive neural networks [19, 9] is to learn hierarchical feature representations by applying the same neural network recursively in a tree structure. [sent-78, score-0.338]

38 In our case, the leaf nodes of the tree are K-dimensional vectors (the result of the CNN pooling over an image patch repeated for all K filters) and there are r2 of them. [sent-79, score-0.166]

39 While this allows for more flexibility, we found that for the task of object classification in conjunction with a CNN layer it was not necessary for obtaining high performance. [sent-81, score-0.26]

40 We generalize our RNN architecture to allow each layer to merge blocks of adjacent vectors instead of only pairs. [sent-86, score-0.234]

41 3 shows an example of a pooled CNN output of size K × 4 × 4 and a RNN tree structure with blocks of 4 children. [sent-108, score-0.182]

42 However, our original task is to classify each block into one of many object categories. [sent-110, score-0.114]

43 Therefore, we use the top vector Ptop as the feature vector to a softmax classifier. [sent-111, score-0.107]

44 In order to minimize the cross entropy error of the softmax, we could backpropagate through the recursive neural network [12] and convolutional layers [8]. [sent-112, score-0.253]

45 Related Work There has been great interest in object recognition and scene understanding using RGB-D data. [sent-124, score-0.196]

46 The most common approach today for standard object recognition is to use well-designed features based on orientation histograms such as SIFT, SURF [22] or textons and give them as input to a classifier such as a random forest. [sent-128, score-0.224]

47 Despite their success, they have several shortcomings such as being only applicable to one modality (grey scale images in the case of SIFT), not adapting easily to new modalities such as RGB-D or to varying image domains. [sent-129, score-0.22]

48 There have been some attempts to modify these features to colored images via color histograms [23] or simply extending SIFT to the depth channel [2]. [sent-130, score-0.462]

49 More advanced methods that generalize these ideas and can combine several important RGB-D image characteristics such as size, 3D shape and depth edges are kernel descriptors [5]. [sent-131, score-0.305]

50 4 Another related line of work is about spatial pyramids in object classification, in particular the pyramid matching kernel [24]. [sent-132, score-0.114]

51 Another solution to the above mentioned problems is to employ unsupervised feature learning methods [25, 26, 27] (among many others) which have made large improvements in object recognition. [sent-134, score-0.232]

52 While many deep learning methods exist for learning features from rgb images, few deep learning architectures have yet been investigated for 3D images. [sent-135, score-0.455]

53 [6] introduced convolutional k-means descriptors (CKM) for RGB-D data. [sent-137, score-0.131]

54 Their work is similar to ours in that they also learn features in an unsupervised way. [sent-139, score-0.151]

55 [7] uses unsupervised feature learning based on sparse coding to learn dictionaries from 8 different channels including grayscale intensity, RGB, depth scalars, and surface normals. [sent-141, score-0.393]

56 Each layer has three modules: batch orthogonal matching pursuit, pyramid max pooling, and contrast normalization. [sent-143, score-0.146]

57 Lastly, recursive autoencoders have been introduced by Pollack [19] and Socher et al. [sent-145, score-0.161]

58 Recursive neural networks have been applied to full scene segmentation [9] but they used hand-designed features. [sent-147, score-0.115]

59 [29] also introduce a model for scene segmentation that is based on multi-scale convolutional neural networks and learns feature representations. [sent-149, score-0.25]

60 Each object instance is imaged from 3 different angles resulting in roughly 600 images per instance. [sent-153, score-0.198]

61 We subsample every 5th frame of the 600 images resulting in a total of 120 images per instance. [sent-155, score-0.168]

62 For each split’s test set we sample one object from each class resulting in 51 test objects, each with about 120 independently classified images. [sent-158, score-0.114]

63 Before unsupervised pretraining, the 9 × 9 × 3 patches for RGB and 9 × 9 patches for depth are individually normalized by subtracting the mean and divided by the standard deviation of its elements. [sent-162, score-0.42]

64 In addition, ZCA whitening is performed to de-correlate pixels and get rid of redundant features in raw images [30]. [sent-163, score-0.202]

65 A valid convolution is performed with filter bank size K = 128 and filter width and height of 9. [sent-164, score-0.105]

66 Average pooling is then performed with pooling regions of size d = 10 and stride size s = 5 to produce a 3D matrix of size 128 × 27 × 27 for each image. [sent-165, score-0.156]

67 This leads to the following matrices at each depth of the tree: X ∈ R128×27×27 to P1 ∈ R128×9×9 to P2 ∈ R128×3×3 to finally P3 ∈ R128 . [sent-167, score-0.228]

68 The combination of RGB and depth is done by concatenating the final features which have 2 × 1282 = 32, 768 dimensions. [sent-169, score-0.303]

69 [5] investigates multiple kernel descriptors on top of various features, including 3D shape, physical size of the object, depth edges, gradients, kernel PCA, local binary patterns,etc. [sent-174, score-0.266]

70 In contrast, all our features are learned in an unsupervised way from the raw color and depth images. [sent-175, score-0.497]

71 which uses an extra input modality of surface normals. [sent-217, score-0.144]

72 They make additional use of surface normals and gray scale images on top of RGB and depth channels and also learn features from these inputs with unsupervised methods based on sparse coding. [sent-222, score-0.548]

73 We picked one of the splits as our development fold and focus on RGB images and RNNs with random weights only unless otherwise noted. [sent-226, score-0.133]

74 4 (left) shows a comparison between our CNN-RNN model and a two layer CNN. [sent-229, score-0.146]

75 In both settings, the CNN-RNN outperforms the two layer CNN. [sent-231, score-0.146]

76 4 (left) also gives results when the weights of the random RNNs are untied across layers in the tree (TNN). [sent-237, score-0.172]

77 In other words, different random weights are used at each depth of the tree. [sent-238, score-0.277]

78 Since weights are still tied inside each layer this setting can be seen as a convolution where the stride size is equal to the filter size. [sent-239, score-0.302]

79 We call this a tree neural network (TNN) because it is technically not a recursive neural network. [sent-240, score-0.261]

80 We carefully cross validated the RNN training procedure, objectives (adding reconstruction costs at each layer as in [10], classifying each layer or only at the top node), regularization, layer size etc. [sent-245, score-0.438]

81 4 (right) shows that combining RGB and depth features from RNNs improves performance. [sent-251, score-0.303]

82 In this experiment we investigate whether CNN-RNNs learn better features than simply using a single layer of features on raw pixels. [sent-254, score-0.339]

83 15 80 Accuracy (%) Filters 90 85 70 80 60 75 50 40 1 816 32 64 128 Number of RNNs 70 RGB DepthRGB+Depth Figure 4: Model analysis on the development split (left and center use rgb only). [sent-266, score-0.211]

84 Left: Comparison of two layer CNN with CNN-RNN with different pre-processing ([17] and [13]). [sent-267, score-0.146]

85 TNN is a tree structured neural net with untied weights across layers, tRNN is a single RNN trained with backpropagation (see text for details). [sent-268, score-0.205]

86 This shows that even random recursive neural nets can clearly capture more of the underlying class structure in their feature representations than a single layer autoencoder. [sent-273, score-0.348]

87 Most model confusions are very reasonable showing that recursive deep learning methods on raw pixels and depth can give provide high quality features. [sent-277, score-0.456]

88 Both garlic and mushrooms have very similar appearances and colors. [sent-281, score-0.139]

89 5 Conclusion We introduced a new model based on a combination of convolutional and recursive neural networks. [sent-283, score-0.253]

90 Unlike previous RNN models, we fix the tree structure, allow multiple vectors to be combined, use multiple RNN weights and keep parameters randomly initialized. [sent-284, score-0.117]

91 This architecture allows for parallelization and high speeds, outperforms two layer CNNs and obtains state of the art performance without any external features. [sent-285, score-0.185]

92 We also demonstrate the applicability of convolutional and recursive feature learning to the new domain of depth images. [sent-286, score-0.49]

93 High-accuracy 3D sensing for mobile manipulation: improving object detection and door opening. [sent-302, score-0.114]

94 Many misclassifications are between (a) garlic and mushroom (b) food-box and kleenex. [sent-306, score-0.139]

95 Figure 6: Examples of confused classes: Shampoo bottle and water bottle, mushrooms labeled as garlic, pitchers classified as caps due to shape and color similarity, white caps classified as kleenex boxes at certain angles. [sent-307, score-0.36]

96 Learning continuous phrase representations and syntactic parsing with recursive neural networks. [sent-385, score-0.16]

97 What is the best multi-stage architecture for object recognition? [sent-410, score-0.153]

98 Unsupervised learning of invariant feature hierarchies with applications to object recognition. [sent-495, score-0.156]

99 Flexible, high performance convolutional neural networks for image classification. [sent-532, score-0.2]

100 Hardware accelerated convolutional neural networks for synthetic vision systems. [sent-541, score-0.161]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rnns', 0.527), ('cnn', 0.33), ('rnn', 0.321), ('depth', 0.228), ('rgb', 0.211), ('layer', 0.146), ('recursive', 0.127), ('food', 0.118), ('object', 0.114), ('modality', 0.097), ('convolutional', 0.093), ('garlic', 0.091), ('lters', 0.088), ('images', 0.084), ('unsupervised', 0.076), ('features', 0.075), ('color', 0.075), ('shampoo', 0.073), ('tnn', 0.073), ('convolution', 0.069), ('tree', 0.068), ('softmax', 0.065), ('bo', 0.065), ('pooled', 0.065), ('socher', 0.065), ('bottle', 0.064), ('cnns', 0.064), ('filters', 0.063), ('pooling', 0.059), ('deep', 0.058), ('patches', 0.058), ('ablations', 0.055), ('farabet', 0.055), ('kleenex', 0.055), ('mug', 0.055), ('untied', 0.055), ('surf', 0.053), ('architectures', 0.053), ('coates', 0.051), ('sift', 0.05), ('weights', 0.049), ('blocks', 0.049), ('mushrooms', 0.048), ('coffee', 0.048), ('mushroom', 0.048), ('surface', 0.047), ('scene', 0.047), ('water', 0.046), ('household', 0.044), ('raw', 0.043), ('ranzato', 0.043), ('feature', 0.042), ('ng', 0.042), ('jarrett', 0.04), ('parent', 0.04), ('architecture', 0.039), ('image', 0.039), ('descriptors', 0.038), ('lter', 0.038), ('normals', 0.038), ('stride', 0.038), ('icra', 0.037), ('lai', 0.037), ('height', 0.036), ('binder', 0.036), ('calculator', 0.036), ('caps', 0.036), ('cellphone', 0.036), ('cereal', 0.036), ('eraser', 0.036), ('flashlight', 0.036), ('glue', 0.036), ('greens', 0.036), ('jar', 0.036), ('koppula', 0.036), ('lcn', 0.036), ('lemon', 0.036), ('lightbulb', 0.036), ('lime', 0.036), ('noodles', 0.036), ('notebook', 0.036), ('onion', 0.036), ('peach', 0.036), ('pear', 0.036), ('pennington', 0.036), ('pitcher', 0.036), ('potato', 0.036), ('rubber', 0.036), ('sponge', 0.036), ('stapler', 0.036), ('tomato', 0.036), ('toothbrush', 0.036), ('toothpaste', 0.036), ('trnn', 0.036), ('recognition', 0.035), ('networks', 0.035), ('indoor', 0.034), ('compose', 0.034), ('autoencoders', 0.034), ('neural', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

2 0.26346353 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

3 0.14113933 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng

Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1

4 0.14046772 62 nips-2012-Burn-in, bias, and the rationality of anchoring

Author: Falk Lieder, Thomas Griffiths, Noah Goodman

Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1

5 0.14046772 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

Author: Adam Coates, Andrej Karpathy, Andrew Y. Ng

Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1

6 0.13972878 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

7 0.13790412 197 nips-2012-Learning with Recursive Perceptual Representations

8 0.13137569 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

9 0.12088549 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

10 0.11353693 193 nips-2012-Learning to Align from Scratch

11 0.11243061 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

12 0.087712206 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

13 0.08024434 8 nips-2012-A Generative Model for Parts-based Object Segmentation

14 0.079210266 344 nips-2012-Timely Object Recognition

15 0.077508539 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

16 0.077240244 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

17 0.075511724 202 nips-2012-Locally Uniform Comparison Image Descriptor

18 0.074688353 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

19 0.06912411 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

20 0.066717707 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.159), (1, 0.062), (2, -0.265), (3, 0.0), (4, 0.119), (5, -0.055), (6, -0.007), (7, -0.089), (8, -0.028), (9, -0.019), (10, 0.003), (11, 0.076), (12, -0.068), (13, 0.074), (14, -0.088), (15, -0.038), (16, 0.012), (17, 0.005), (18, 0.054), (19, -0.046), (20, -0.021), (21, -0.058), (22, 0.074), (23, -0.014), (24, 0.013), (25, -0.024), (26, -0.051), (27, -0.028), (28, 0.033), (29, -0.038), (30, 0.053), (31, -0.005), (32, -0.031), (33, -0.049), (34, 0.102), (35, 0.103), (36, -0.008), (37, 0.112), (38, 0.021), (39, -0.03), (40, -0.007), (41, -0.101), (42, 0.015), (43, 0.051), (44, 0.001), (45, 0.02), (46, 0.011), (47, -0.106), (48, 0.002), (49, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93837488 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

2 0.82840306 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

3 0.78839374 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

4 0.7874276 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

Author: Ryan Kiros, Csaba Szepesvári

Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1

5 0.78234464 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng

Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1

6 0.7248227 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

7 0.7241028 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

8 0.69531018 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

9 0.6387949 170 nips-2012-Large Scale Distributed Deep Networks

10 0.62367564 210 nips-2012-Memorability of Image Regions

11 0.61652684 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

12 0.57850426 202 nips-2012-Locally Uniform Comparison Image Descriptor

13 0.57132256 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

14 0.56737584 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

15 0.5570448 146 nips-2012-Graphical Gaussian Vector for Image Categorization

16 0.53200036 8 nips-2012-A Generative Model for Parts-based Object Segmentation

17 0.52301115 197 nips-2012-Learning with Recursive Perceptual Representations

18 0.50796914 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

19 0.49281499 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

20 0.48712453 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (17, 0.019), (21, 0.016), (38, 0.049), (42, 0.014), (54, 0.021), (55, 0.057), (74, 0.088), (76, 0.074), (80, 0.074), (92, 0.473)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86248779 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

2 0.80619562 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

3 0.79609436 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy

Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1

4 0.79281211 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1

5 0.7527132 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models. 1

6 0.72388119 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

7 0.58267057 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

8 0.54876065 329 nips-2012-Super-Bit Locality-Sensitive Hashing

9 0.53354234 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

10 0.53156865 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

11 0.51206559 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

12 0.50178778 260 nips-2012-Online Sum-Product Computation Over Trees

13 0.49623013 94 nips-2012-Delay Compensation with Dynamical Synapses

14 0.49426484 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.48837411 65 nips-2012-Cardinality Restricted Boltzmann Machines

16 0.48741785 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

17 0.47831517 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

18 0.47566289 102 nips-2012-Distributed Non-Stochastic Experts

19 0.47440577 163 nips-2012-Isotropic Hashing

20 0.4726913 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines