nips nips2012 nips2012-193 knowledge-graph by maker-knowledge-mining

193 nips-2012-Learning to Align from Scratch


Source: pdf

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. [sent-7, score-0.669]

2 Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. [sent-8, score-0.555]

3 However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. [sent-9, score-0.739]

4 In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. [sent-10, score-0.574]

5 Specifically, we incorporate deep learning into the congealing alignment framework. [sent-11, score-1.217]

6 Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. [sent-12, score-0.258]

7 In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. [sent-13, score-0.539]

8 Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. [sent-15, score-0.624]

9 This variability can be seen in Figure 1, which shows sample images from Labeled Faces in the Wild (LFW), a data set used for benchmarking unconstrained face verification performance. [sent-19, score-0.29]

10 The task in LFW is, given a pair of face images, determine if both faces are of the same person (matched pair), or if each shows a different person (mismatched pair). [sent-20, score-0.225]

11 Figure 1: Sample images from LFW: matched pairs (top row) and mismatched pairs (bottom row) Recognition performance can be significantly improved by removing undesired intra-class variability, by first aligning the images to some canonical pose or configuration. [sent-21, score-0.392]

12 For instance, face verification accuracy can be dramatically increased through image alignment, by detecting facial feature points on the image and then warping these points to a canonical configuration. [sent-22, score-0.315]

13 This alignment process can lead to significant gains in recognition accuracy on real-world face verification, even 1 for algorithms that were explicitly designed to be robust to some misalignment [1]. [sent-23, score-0.56]

14 Therefore, the majority of face recognition systems evaluated on LFW currently make use of a preprocessed version of the data set known as LFW-a,1 where the images have been aligned by a commercial fiducial point-based supervised alignment method [2]. [sent-24, score-0.851]

15 Fiducial point (or landmark-based) alignment algorithms [1, 3–5], however, require a large amount of supervision or manual effort. [sent-25, score-0.415]

16 These methods are thus hard to apply to new object classes, since all of this manual collection of data must be re-done, and the alignment results may be sensitive to the choice of fiducial points and quality of training examples. [sent-27, score-0.416]

17 An alternative to this supervised approach is to take a set of poorly aligned images (e. [sent-28, score-0.288]

18 , images drawn from approximately the same distribution as the inputs to the recognition system) and attempt to make the images more similar to each other, using some measure of joint similarity such as entropy. [sent-30, score-0.361]

19 This framework of iteratively transforming images to reduce the entropy of the set is known as congealing [6], and was originally applied to specific types of images such as binary handwritten characters and magnetic resonance image volumes [7–9]. [sent-31, score-1.079]

20 However, this required a careful selection of hand-crafted feature representation (SIFT [11]) and soft clustering, and does not achieve as large of an improvement in verification accuracy as supervised alignment (LFW-a). [sent-33, score-0.523]

21 In this work, we propose a novel combination of unsupervised alignment and unsupervised feature learning, specifically by incorporating deep learning [12–14] into the congealing framework. [sent-34, score-1.407]

22 Through deep learning, we can obtain a feature representation tuned to the statistics of the specific object class we wish to align, and capture the data at multiple scales by using multiple layers of a deep learning architecture. [sent-35, score-0.507]

23 Further, we incorporate a group sparsity constraint into the deep learning algorithm, leading to a topographic organization on the learned filters, and show that this leads to improved alignment results. [sent-36, score-0.733]

24 We apply our method to unconstrained face images and show that, using the aligned images, we achieve a significantly higher face verification accuracy than obtained both using the original face images and using the images produced by prior work in unsupervised alignment [10]. [sent-37, score-1.421]

25 In addition, the accuracy surpasses that achieved using supervised fiducial points based alignment [3], and matches the accuracy using the LFW-a images produced by commercial supervised alignment. [sent-38, score-0.823]

26 2 Related Work We review relevant work in unsupervised joint alignment and deep learning. [sent-39, score-0.657]

27 presented a variation of congealing for unsupervised alignment, where the entropy similarity measure is replaced with a least-squares similarity measure [15, 16]. [sent-42, score-0.811]

28 extended congealing by modifying the objective function to allow for simultaneous alignment and clustering [17]. [sent-44, score-1.023]

29 developed a method for non-rigid alignment using a model parameterized by mesh vertex coordinates in a deformable Lucas-Kanade formulation [19]. [sent-48, score-0.384]

30 In this work, we chose to extend the original congealing method, rather than other alignment frameworks, for several reasons. [sent-52, score-1.023]

31 The algorithm uses entropy as a measure of similarity, rather than variance or least squares, thus allowing for the alignment of data with multiple modes. [sent-53, score-0.425]

32 Unlike other joint alignment procedures [15], the main loop scales linearly with the number of images to be aligned, allowing for a greater number of images to be jointly aligned, smoothing the optimization landscape. [sent-54, score-0.698]

33 Finally, congealing requires only very weak supervision in the form of poorly aligned images. [sent-55, score-0.765]

34 However, our proposed extensions, using features obtained from deep learning, could also be applied to other alignment algorithms that have only been used with a pixel intensity representation, such as [15, 16, 19]. [sent-56, score-0.612]

35 2 Deep Learning A deep belief network (DBN) is a generative graphical model consisting of a layer of visible units and multiple layers of hidden units, where each layer encodes statistical dependencies in the units in 1 http://www. [sent-58, score-0.716]

36 To the best of our knowledge, our proposed method is the first to apply deep learning to the alignment problem. [sent-64, score-0.578]

37 DBNs are generally trained using images drawn from the same distribution as the test images, which in our case corresponds to learning from faces in the LFW training set. [sent-65, score-0.299]

38 [23] have shown successful applications of self-taught learning, using sparse coding and deep belief networks to learn feature representations from natural images. [sent-71, score-0.306]

39 In this paper, we examine whether self-taught learning can be successful for alignment tasks. [sent-72, score-0.384]

40 3 Methodology We begin with a review of the congealing framework. [sent-76, score-0.639]

41 We then show how deep learning can be incorporated into this framework using convolutional DBNs, and how the learning algorithm can be modified through group sparsity regularization to improve congealing performance. [sent-77, score-1.014]

42 For example, letting the feature space be intensity values, M = 2 for binary images and M = 256 for 8-bit grayscale images. [sent-84, score-0.209]

43 Figure 2 illustrates congealing on one dimensional binary images. [sent-93, score-0.639]

44 Once congealing has been performed on a set of images (e. [sent-95, score-0.796]

45 A new image is then aligned by transforming it iteratively according to the sequence of saved DFs, thereby approximating the results of congealing on the original set of images as well as the new test image. [sent-102, score-0.956]

46 As mentioned earlier, congealing was extended to work on complex object classes, such as faces, by using soft clustering of SIFT descriptors as the feature representation [10]. [sent-103, score-0.726]

47 We will refer to this congealing algorithm as SIFT congealing. [sent-104, score-0.639]

48 2 Deep Congealing To incorporate deep learning within congealing, we use the convolutional restricted Boltzmann machine (CRBM) [23,35] and convolutional deep belief network (CDBN) [23]. [sent-107, score-0.603]

49 The CRBM is an extension of the restricted Boltzmann machine, which is a Markov random field with a hidden layer and a visible layer (corresponding to image pixels in computer vision problems), where the connection between layers is bipartite. [sent-108, score-0.431]

50 In the CRBM, rather than fully connecting the hidden layer and visible layer, the weights between the hidden units and the visible units are local (i. [sent-109, score-0.441]

51 , K); (2) hidden biases bk ∈ R that are shared among hidden nodes; and (3) visible bias c ∈ R that is shared among visible nodes. [sent-118, score-0.246]

52 , pooling region) hk that are pooled to a pooling node pk . [sent-130, score-0.426]

53 After training a CRBM, we can use it to compute the posterior of the pooling units given the input data. [sent-133, score-0.224]

54 These pooling unit activations can be used as input to further train the next layer CRBM. [sent-134, score-0.33]

55 After constructing a convolutional deep belief network, we perform (approximate) 2 We use real-valued visible units in the first-layer CRBM; however, we use binary-valued visible units when constructing the second-layer CRBM. [sent-136, score-0.567]

56 For a CDBN with K pooling layer groups, we now have K location stacks at each image location (after max-pooling), over a binary distribution for each location stack. [sent-140, score-0.501]

57 Given N unaligned face images, let P be the number of pooling units in each group in the top-most layer of the CDBN. [sent-141, score-0.495]

58 We use the pooling unit probabilities, with the interpretation that the pooling unit can be considered as a k,(n) mixture of sub-units that are on and off [6]. [sent-142, score-0.37]

59 Letting pα be the pooling unit α in group k for image k,(n) N 1 k k k n under some transformation T n , we define Dα (1) = N n=1 pα and Dα (0) = 1 − Dα (1). [sent-143, score-0.298]

60 k k k Then, the entropy for a specific pooling unit is H(Dα ) = − s∈{0,1} Dα (s) log(Dα (s)). [sent-144, score-0.226]

61 Note that if K = 1, this reduces to the traditional congealing formulation on the binary output of the single pooling layer. [sent-146, score-0.805]

62 3 Learning a Topology As congealing reduces entropy by performing local hill-climbing in the transformation parameters, a key factor in the success of congealing is the smoothness of this optimization landscape. [sent-148, score-1.319]

63 This may lead to plateaus or local minima in the optimization landscape with congealing, for instance, if one filter is a small rotation of another filter, and a rotation of the image causes a section of the face to be between these two filters. [sent-156, score-0.2]

64 For instance, a second-layer CDBN trained on face images would likely learn multiple filters that resemble eye detectors, capturing slightly different types and scales of eyes. [sent-158, score-0.312]

65 If these filters are activating independently, then the resulting entropy of a set of images may not decrease even if eyes in different images are brought into closer alignment. [sent-159, score-0.355]

66 A smooth optimization for congealing requires that, as an image patch is transformed from one such sparse set to another, the change in pooling unit activations is also gradual rather than abrupt. [sent-161, score-0.938]

67 Therefore, we would like to learn filters with a linear topological ordering, such that when a particular pooling unit pk at location α and associated with filter k α is activated, the pooling units at the same location, associated with nearby filters, i. [sent-162, score-0.555]

68 To learn a topology on the learned filters, we add the following group sparsity penalty to the learning objective function (i. [sent-165, score-0.23]

69 Let the term array be used to refer to the set of pooling units associated with a particular filter, i. [sent-168, score-0.251]

70 This regularization penalty is a sum (L1 norm) of L2 norms, each of which is a Gaussian weighting, centered at a particular array, of the pooling units across each array at a specific location. [sent-171, score-0.279]

71 We define α(i, j) as the pooling location k,k 1 k associated with position (i, j), and J as Jij = pk (1 − pk α(i,j) )hij . [sent-182, score-0.32]

72 4 Experiments We learn three different convolutional DBN models to use as the feature representation for deep congealing. [sent-189, score-0.337]

73 First, we learn a one-layer CRBM from the Kyoto images,3 a standard natural image data set, to evaluate the performance of congealing with self-taught CRBM features. [sent-190, score-0.703]

74 Next, we learn a one-layer CRBM from LFW face images, to compare performance when learning the features directly on images of the object class to be aligned. [sent-191, score-0.296]

75 For computing the pooling layer representation to use in congealing, we modified the pooling size to 3x3 for the one-layer models and 2x2 for the second layer in the two-layer model, and adjusted the hidden biases to give an expected activation of 0. [sent-198, score-0.625]

76 In Figure 5, we show a selection of images under several alignment methods. [sent-200, score-0.541]

77 We evaluate the effect of alignment on verification accuracy using View 1 of LFW. [sent-202, score-0.432]

78 For the congealing methods, 400 images from the training set were congealed and used to form a funnel to subsequently align all of the images in both the training and test sets. [sent-203, score-0.99]

79 We then normalize each image feature vector, and apply a linear SVM to an image pair by combining the image feature vectors using element-wise multiplication. [sent-206, score-0.256]

80 html 6 original SIFT deep l1 deep l2 LFW-a original SIFT deep l1 deep l2 LFW-a Figure 5: Sample images from LFW produced by different alignment algorithms. [sent-211, score-1.36]

81 For each set of five images, the alignments are, from left to right: original images; SIFT Congealing; Deep Congealing, Faces, layer 1, with topology; Deep Congealing, Faces, layer 2, with topology; Supervised (LFW-a). [sent-212, score-0.23]

82 Table 1 gives the verification accuracy for this verification system using images produced by a number of alignment algorithms. [sent-216, score-0.632]

83 Deep congealing gives a significant improvement over SIFT congealing. [sent-217, score-0.639]

84 Using a CDBN representation learned with a group sparsity penalty, leading to learned filters with topographic organization, consistently gives a higher accuracy of one to two percentage points. [sent-218, score-0.226]

85 We compare with two supervised alignment systems, the fiducial points based system of [3],5 and LFW-a. [sent-219, score-0.42]

86 Note that LFW-a was produced by a commercial alignment system, in the spirit of [3], but with important differences that have not been published [2]. [sent-220, score-0.517]

87 Congealing with a one-layer CDBN6 trained on faces, with topology, gives verification accuracy significantly higher than using images produced by [3], and comparable to the accuracy using LFW-a images. [sent-221, score-0.32]

88 Moreover, we can combine the verification scores using images from the one-layer and two-layer CDBN trained on faces, learning a second SVM on these scores. [sent-222, score-0.206]

89 This suggests that the two-layer CDBN alignment is somewhat complementary to the onelayer alignment. [sent-225, score-0.384]

90 As a control, we performed the same score combination using the scores produced from images from the one-layer CDBN alignment trained on faces, with topology, and the original images. [sent-227, score-0.633]

91 7 Table 1: Unconstrained face verification accuracy on View 1 of LFW using images produced by different alignment algorithms. [sent-238, score-0.739]

92 By combining the classifier scores produced by layer 1 and 2 using a linear SVM, we achieve higher accuracy using unsupervised alignment than obtained using the widely-used LFW-a images, generated using a commercial supervised fiducial-points algorithm. [sent-239, score-0.801]

93 823 Conclusion We have shown how to combine unsupervised joint alignment with unsupervised feature learning. [sent-251, score-0.574]

94 By congealing on the pooling layer representation of a CDBN, we are able to achieve significant gains in verification accuracy over existing methods for unsupervised alignment. [sent-252, score-1.07]

95 Using face images aligned by this method, we obtain higher verification accuracy than the supervised fiducial points based method of [3]. [sent-254, score-0.423]

96 Further, despite being unsupervised, our method is still able to achieve comparable accuracy to the widely used LFW-a images, obtained by a commercial fiducial point-based alignment system whose detailed procedure is unpublished. [sent-255, score-0.503]

97 We believe that our proposed method is an important contribution in developing generic alignment systems that do not require domain-specific fiducial points. [sent-256, score-0.384]

98 Unsupervised learning of hierarchical representations with convolutional deep belief networks. [sent-422, score-0.321]

99 Learning hierarchical representations for face verification with convolutional deep belief networks. [sent-444, score-0.428]

100 Unsupervised feature learning for audio classification using convolutional deep belief networks. [sent-452, score-0.353]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('congealing', 0.639), ('alignment', 0.384), ('cdbn', 0.24), ('crbm', 0.2), ('deep', 0.194), ('pooling', 0.166), ('images', 0.157), ('lfw', 0.13), ('lters', 0.123), ('faces', 0.118), ('layer', 0.115), ('veri', 0.111), ('topology', 0.109), ('face', 0.107), ('ducial', 0.106), ('convolutional', 0.088), ('unsupervised', 0.079), ('aligned', 0.075), ('commercial', 0.071), ('sift', 0.068), ('visible', 0.065), ('image', 0.064), ('topographic', 0.062), ('units', 0.058), ('lter', 0.056), ('location', 0.052), ('pk', 0.051), ('group', 0.049), ('accuracy', 0.048), ('nw', 0.046), ('sparsity', 0.044), ('nv', 0.043), ('hk', 0.043), ('produced', 0.043), ('entropy', 0.041), ('crbms', 0.04), ('hidden', 0.04), ('belief', 0.039), ('cvpr', 0.038), ('align', 0.037), ('supervised', 0.036), ('dbns', 0.036), ('csml', 0.035), ('pixel', 0.034), ('lee', 0.034), ('iccv', 0.033), ('feature', 0.032), ('layers', 0.032), ('object', 0.032), ('supervision', 0.031), ('df', 0.03), ('activations', 0.03), ('landscape', 0.029), ('kyoto', 0.029), ('penalty', 0.028), ('array', 0.027), ('accv', 0.027), ('dfs', 0.027), ('fiducial', 0.027), ('lsparsity', 0.027), ('lucey', 0.027), ('mattar', 0.027), ('variability', 0.026), ('similarity', 0.026), ('raina', 0.026), ('unlabeled', 0.026), ('cox', 0.025), ('scores', 0.025), ('trained', 0.024), ('nh', 0.024), ('eye', 0.024), ('ranzato', 0.024), ('representation', 0.023), ('boltzmann', 0.023), ('sohn', 0.022), ('cosine', 0.022), ('topological', 0.022), ('recognition', 0.021), ('ij', 0.021), ('iteratively', 0.021), ('nearby', 0.021), ('coding', 0.021), ('weighting', 0.02), ('letting', 0.02), ('mismatched', 0.02), ('honglak', 0.02), ('aligning', 0.02), ('sparse', 0.02), ('cation', 0.02), ('poorly', 0.02), ('undesired', 0.019), ('battle', 0.019), ('detectors', 0.019), ('published', 0.019), ('unit', 0.019), ('nodes', 0.019), ('pose', 0.019), ('boureau', 0.019), ('wild', 0.019), ('shared', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

2 0.18255523 62 nips-2012-Burn-in, bias, and the rationality of anchoring

Author: Falk Lieder, Thomas Griffiths, Noah Goodman

Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1

3 0.18255523 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

Author: Adam Coates, Andrej Karpathy, Andrew Y. Ng

Abstract: Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1

4 0.1822565 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

5 0.14167561 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng

Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1

6 0.13708054 167 nips-2012-Kernel Hyperalignment

7 0.12107044 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

8 0.11353693 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

9 0.11253163 197 nips-2012-Learning with Recursive Perceptual Representations

10 0.10191248 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

11 0.098148517 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

12 0.097727895 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

13 0.090671711 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

14 0.088624172 65 nips-2012-Cardinality Restricted Boltzmann Machines

15 0.079767615 8 nips-2012-A Generative Model for Parts-based Object Segmentation

16 0.077331029 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

17 0.075379297 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

18 0.072912186 185 nips-2012-Learning about Canonical Views from Internet Image Collections

19 0.070052065 170 nips-2012-Large Scale Distributed Deep Networks

20 0.067636773 100 nips-2012-Discriminative Learning of Sum-Product Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.148), (1, 0.067), (2, -0.247), (3, 0.007), (4, 0.127), (5, -0.028), (6, -0.003), (7, -0.116), (8, 0.028), (9, -0.048), (10, 0.012), (11, 0.057), (12, -0.111), (13, 0.067), (14, -0.05), (15, -0.084), (16, 0.042), (17, -0.023), (18, 0.054), (19, -0.041), (20, 0.024), (21, -0.048), (22, 0.028), (23, 0.031), (24, 0.039), (25, 0.024), (26, 0.043), (27, -0.016), (28, -0.031), (29, -0.019), (30, 0.043), (31, -0.025), (32, -0.006), (33, 0.016), (34, 0.049), (35, 0.007), (36, -0.017), (37, 0.054), (38, -0.017), (39, 0.05), (40, -0.103), (41, -0.036), (42, 0.035), (43, -0.022), (44, 0.001), (45, -0.038), (46, 0.029), (47, 0.031), (48, 0.014), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94809377 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

2 0.81598443 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

Author: Will Zou, Shenghuo Zhu, Kai Yu, Andrew Y. Ng

Abstract: We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset. 1

3 0.77939999 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

4 0.77550149 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng

Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1

5 0.69495928 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi

Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1

6 0.66841263 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

7 0.66438532 62 nips-2012-Burn-in, bias, and the rationality of anchoring

8 0.66438532 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

9 0.63968098 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

10 0.62844771 8 nips-2012-A Generative Model for Parts-based Object Segmentation

11 0.62371653 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

12 0.6194883 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.6094023 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

14 0.60927427 170 nips-2012-Large Scale Distributed Deep Networks

15 0.57782042 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

16 0.5643816 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

17 0.54252213 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

18 0.53531438 210 nips-2012-Memorability of Image Regions

19 0.52065086 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

20 0.50037926 146 nips-2012-Graphical Gaussian Vector for Image Categorization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (17, 0.032), (21, 0.046), (38, 0.076), (39, 0.012), (42, 0.024), (44, 0.262), (54, 0.029), (55, 0.072), (74, 0.09), (76, 0.081), (77, 0.014), (80, 0.07), (92, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.746975 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

2 0.73239917 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1

3 0.68617028 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof

Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t:  if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1)  P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9

4 0.62800211 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

5 0.5728808 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1

6 0.53901714 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

7 0.52996314 210 nips-2012-Memorability of Image Regions

8 0.5286929 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

9 0.52658641 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

10 0.52596563 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

11 0.52465671 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

12 0.52431053 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

13 0.52286381 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

14 0.52233547 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

15 0.52222633 201 nips-2012-Localizing 3D cuboids in single-view images

16 0.52151585 168 nips-2012-Kernel Latent SVM for Visual Recognition

17 0.52073133 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

18 0.52005666 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

19 0.51918113 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

20 0.51866943 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding