nips nips2013 nips2013-166 knowledge-graph by maker-knowledge-mining

166 nips-2013-Learning invariant representations and applications to face verification


Source: pdf

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Learning invariant representations and applications to face verification Qianli Liao, Joel Z Leibo, and Tomaso Poggio Center for Brains, Minds and Machines McGovern Institute for Brain Research Massachusetts Institute of Technology Cambridge MA 02139 lql@mit. [sent-1, score-0.355]

2 edu Abstract One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. [sent-5, score-0.485]

3 The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. [sent-10, score-0.215]

4 Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. [sent-13, score-0.72]

5 Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. [sent-14, score-0.655]

6 Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. [sent-15, score-0.435]

7 1 Introduction In the real world, two images of the same object may only be related by a very complicated and highly nonlinear transformation. [sent-16, score-0.304]

8 Two images of the same face could be related by the transformation from frowning to smiling or from youth to old age. [sent-18, score-0.525]

9 The theory is based on the premise that invariance to identity-preserving transformations is the crux of object recognition. [sent-26, score-0.489]

10 ’s theory of invariance (which we review in section 2) and show how various pooling methods for convolutional networks can all be understood as building invariance since they are all equivalent to special cases of the model we study here. [sent-29, score-0.452]

11 Our use of computer-generated image datasets lets us completely control the transformations appearing in each test, thereby allowing us to measure properties of the representation for each transformation independently. [sent-31, score-0.361]

12 We find that the representation performs well even when it is applied to transformations for which there are no theoretical guarantees—e. [sent-32, score-0.22]

13 , the clutter “transformation” which maps an image of a face on one background to the same face on a different background. [sent-34, score-0.598]

14 We find that, despite the use of a very simple classifier—thresholding the angle between face representations—our approach still achieves results that compare favorably with the current state of the art and even exceed it in some cases. [sent-37, score-0.226]

15 2 Template-based invariant encodings for objects unseen during training We conjecture that achieving invariance to identity-preserving transformations without losing discriminability is the crux of object recognition. [sent-38, score-0.759]

16 Our aim is to compute a unique signature for each image x that is invariant with respect to a group of transformations G. [sent-40, score-0.47]

17 We regard two images as equivalent if they are part of the same orbit, that is, if they are transformed versions of one another (x = gx for some g ∈ G). [sent-43, score-0.265]

18 The orbit of an image is itself invariant with respect to the group. [sent-44, score-0.364]

19 For example, the set of images obtained by rotating x is exactly the same as the set of images obtained by rotating gx. [sent-45, score-0.53]

20 The orbit is also unique for each object: the set of images obtained by rotating x only intersects with the set of images obtained by rotating x when x = gx. [sent-46, score-0.735]

21 Thus, an intuitive method of obtaining an invariant signature for an image, unique to each object, is just to check which orbit it belongs to. [sent-47, score-0.398]

22 We can assume access to a stored set of orbits of template images τk ; these template orbits could have been acquired by unsupervised learning—possibly by observing objects transform and associating temporally adjacent frames (e. [sent-48, score-0.992]

23 The key fact enabling this approach to object recognition is this: It is not necessary to have all the template orbits beforehand. [sent-51, score-0.521]

24 Even with a small, sampled, set of template orbits, not including the actual orbit of x, we can still compute an invariant signature. [sent-52, score-0.501]

25 That is, the inner product of the transformed image with a template is the same as the inner product of the image with a transformed template. [sent-54, score-0.366]

26 In fact, the test image need not resemble any of the templates (see [11, 12, 13, 1]). [sent-56, score-0.216]

27 , T } of images sampled from the orbit of the template τk , the distribution of x, gt τk is invariant and unique to each object. [sent-61, score-0.805]

28 Since each face has its n own characteristic empirical distribution function, it also shows that these signatures could be used to discriminate between them. [sent-70, score-0.369]

29 Table 1 reports the average Kolmogorov-Smirnov (KS) statistics comparing signatures for images of the same face, and for different faces: Mean(KSsame ) ∼ 0 =⇒ invariance and Mean(KSdifferent ) > 0 =⇒ discriminability. [sent-71, score-0.497]

30 1 (A) IN-PLANE ROTATION (B) TRANSLATION 2 Figure 1: Example signatures (empirical distribution functions—CDFs) of images depicting two different faces under affine transformations. [sent-72, score-0.682]

31 Signatures for the upper and lower face are shown in red and purple respectively. [sent-74, score-0.226]

32 the inner product (more generally, we consider the template response function ∆gτk (·) := f ( ·, gt τk ), for a possibly non-linear function f —see [1]) and 3. [sent-82, score-0.312]

33 The “simple cells” compute normalized dot products or Gaussian radial basis functions of their inputs with stored templates and “complex cells” compute, for example, µk (x) = max(·). [sent-94, score-0.29]

34 The templates are normally obtained by translation or scaling of a set of fixed patterns, often Gabor functions at the first layer and patches of natural images in subsequent layers. [sent-95, score-0.435]

35 3 Invariance to non-affine transformations The theory of [1] only guarantees that this approach will achieve invariance (and discriminability) in the case of affine transformations. [sent-96, score-0.359]

36 However, many researchers have shown good performance of related architectures on object recognition tasks that seem to require invariance to non-affine transformations (e. [sent-97, score-0.624]

37 One possibility is that achieving invariance to affine transformations 2 The computation can be made hierarchical by using the signature as the input to a subsequent layer. [sent-100, score-0.481]

38 While not dismissing that possibility, we emphasize here that approximate invariance to many non-affine transformations can be achieved as long as the system’s operation is restricted to certain nice object classes [20, 21, 22]. [sent-102, score-0.448]

39 For example, the 2D transformation mapping a profile view of one person’s face to its frontal view is similar to the analogous transformation of another person’s face in this sense. [sent-104, score-0.62]

40 (A) ROTATION IN DEPTH (B) ILLUMINATION Figure 2: Example signatures (empirical distribution functions) of images depicting two different faces under non-affine transformations: (A) Rotation in depth. [sent-107, score-0.682]

41 Figure 2 shows that unlike in the affine case, the signature of a test face with respect to template faces at different orientations (3D rotation in depth) or illumination conditions is not perfectly invariant (KSsame > 0), though it still tolerates substantial transformations. [sent-109, score-1.075]

42 These signatures are also useful for discriminating faces since the empirical distribution functions are considerably more varied between faces than they are across images of the same face (Mean(KSdifferent ) > Mean(KSsame ), table 1). [sent-110, score-1.062]

43 8809 Table 1: Average Kolmogorov-Smirnov statistics comparing the distributions of normalized inner products across transformations and across objects (faces). [sent-121, score-0.368]

44 3 It is interesting to consider the possibility that faces co-evolved along with natural visual systems in order to be highly recognizable. [sent-145, score-0.239]

45 In particular, we investigate the possibility of computing signatures that are invariant to all the task-irrelevant variability in the datasets used for serious computer vision benchmarks. [sent-148, score-0.306]

46 Given two images of new faces, never encountered during training, the task is to decide if they depict the same person or not. [sent-150, score-0.366]

47 We used the following procedure to test the templates-and-signatures approach on face verification problems using a variety of different datasets (see fig. [sent-151, score-0.226]

48 First, all images were preprocessed with low-level features (e. [sent-153, score-0.248]

49 , histograms of oriented gradients (HOG) [23]), followed by PCA using all the images in the training set and z-score-normalization4 . [sent-155, score-0.215]

50 At test-time, the k-th element of the signature of an image x is obtained by first computing all the x, gt τk where gt τk is the t-th image of the k-th template person—both encoded by their projection onto the training set’s principal components— then pooling the results. [sent-156, score-0.653]

51 At test time, the classifier receives images of two faces and must classify them as either depicting the same person or not. [sent-158, score-0.627]

52 We used a simple classifier that merely computes the angle between the signatures of the two faces (via a normalized dot product) and responds “same” if it is above a fixed threshold or “different” if below threshold. [sent-159, score-0.471]

53 The images in the Labeled Faces in the Wild (LFW) dataset vary along so many different dimensions that it is difficult to try to give an exhaustive list. [sent-163, score-0.249]

54 It contains natural variability in, at least, pose, lighting, facial expression, and background [2] (example images in fig. [sent-164, score-0.257]

55 First, in unconstrained tasks like LFW, you cannot rely on having seen all the transformations of any template. [sent-167, score-0.334]

56 Recall, the theory of [1] relies on previous experience with all the transformations of template images in order to recognize test images invariantly to the same transformations. [sent-168, score-0.844]

57 Since LFW is totally unconstrained, any subset of it used for training will never contain all the transformations that will be encountered at test time. [sent-169, score-0.278]

58 Continuing to abuse the notation from section 2, we can say that the LFW database only samples a small subset of G, which is now the set of all transformations that occur in LFW. [sent-170, score-0.22]

59 That is, for any two images in LFW, x and x , only a small (relative to |G|) subset of their orbits are in LFW. [sent-171, score-0.351]

60 It is commmon to consider clutter to be a separate problem from that of achieving transformation-invariance, indeed, [1] conjectures that the brain employs separate mechanisms, quite different from templates and pooling—e. [sent-175, score-0.288]

61 A network of neurons with Hebbian synapses (modeled by Oja’s rule)—changing its weights online as images are presented—converges to the network that projects new inputs onto the eigenvectors of its past input’s covariance [24]. [sent-179, score-0.215]

62 showed that perceived gender of a face is strongly biased toward male or female at different locations in the visual field; and that the spatial pattern of these biases was distinctive and stable over time for each individual [25]. [sent-187, score-0.254]

63 That is, “clutter-transformations” map images of an object on one background to images of the same object on different backgrounds. [sent-192, score-0.65]

64 Figure 3B shows the results of the test of robustness to non-uniform transformation-sampling for 3D rotation-in-depthinvariant face verification. [sent-194, score-0.226]

65 It shows that the method tolerates substantial differences between the transformations used to build the feature representation and the transformations on which the system is tested. [sent-195, score-0.501]

66 We tested two different models of natural non-uniform transformation sampling, in one case (blue curve) we sampled the orbits at a fixed rate when preparing templates, in the other case, we removed connected subsets of each orbit. [sent-196, score-0.248]

67 In both cases the test used the entire orbit and never contained any of the same faces as the training phase. [sent-197, score-0.444]

68 Figure 3C shows that signatures produced by pooling over clutter conditions give good performance on a face-verification task with faces embedded on backgrounds. [sent-199, score-0.505]

69 Using templates with the appropriate background size for each test, we show that our models continue to perform well as we increase the size of the background while the performance of standard HOG features declines. [sent-200, score-0.276]

70 The abscissa is the percentage of frames discarded from each template’s transformation sequence, the ordinate is the accuracy on the face verification task. [sent-208, score-0.401]

71 The abscissa is the background size (10 scales), and the ordinate is the area under the ROC curve (AUC) for the face verification task. [sent-210, score-0.326]

72 5 Computer vision benchmarks: LFW, PubFig, and SUFR-W An implication of the argument in sections 2 and 4, is that there needs to be a reasonable number of images sampled from each template’s orbit. [sent-211, score-0.276]

73 any number of samples is going to be small relative to |G|, we found that approximately 15 images gt τk per face is enough for all the face verification tasks we considered. [sent-214, score-0.8]

74 In order to ensure we would have enough images from each template orbit, we gathered a new dataset—SUFR-W8 —with ∼12,500 images, depicting 450 individuals. [sent-217, score-0.531]

75 The new dataset contains similar variability to LFW and PubFig but tends to have more images per individual than LFW (there are at least 15 images of each individual). [sent-218, score-0.464]

76 7 We obtained 3D models of faces from FaceGen (Singular Inversions Inc. [sent-220, score-0.239]

77 (B) ROC curves for the new dataset using templates from the training set. [sent-254, score-0.222]

78 The third (control) model pools over random images in the dataset (as opposed to images depicting the same person). [sent-256, score-0.549]

79 These experiments used nondetected and non-aligned face images as inputs—thus the errors include detection and alignment errors (about 1. [sent-307, score-0.513]

80 5% of faces are not detected and 6-7% of the detected faces are significantly misaligned). [sent-308, score-0.478]

81 In all cases, templates were obtained from our new dataset (excluding 30 images for a testing set). [sent-309, score-0.408]

82 Figure 4B shows ROC curves for face verification with the new dataset. [sent-312, score-0.255]

83 The purple and green curves are control experiments that pool over images depicting different individuals, and random noise templates respectively. [sent-314, score-0.488]

84 The alignment method we used ([30]) produced images that were somewhat more variable than the method used by the authors of the LFW dataset (LFW-a) —the performance of our simple classifier using raw HOG features on LFW is 73. [sent-320, score-0.327]

85 The strongest result in the literature for face verification with PubFig8310 is 70. [sent-326, score-0.226]

86 We argued that when studying invariance, the appropriate mathematical objects to consider are the orbits of images under the action of a transformation and their associated probability distributions. [sent-335, score-0.481]

87 The probability distributions (and hence the orbits) can be characterized by one-dimensional projections—thus justifying the choice of the empirical distribution function of inner products with template images as a representation for recognition. [sent-336, score-0.48]

88 In this paper, we systematically investigated the properties of this representation for two affine and two non-affine transformations (tables 1 and 2). [sent-337, score-0.22]

89 In fact, the pipeline we used for most experiments actually has no operations at all besides normalized dot products and pooling (also PCA when preparing templates). [sent-343, score-0.268]

90 Despite the classifier’s simplicity, our model’s strong performance on face verification benchmark tasks is quite encouraging (Fig. [sent-346, score-0.27]

91 Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” in Workshop on faces in real-life images: Detection, alignment and recognition (ECCV), (Marseille, Fr), 2008. [sent-363, score-1.023]

92 10 The original PubFig dataset was only provided as a list of URLs from which the images could be downloaded. [sent-390, score-0.249]

93 The authors of that study also made their features available, so we estimated the performance of their features on the available subset of images (using SVM). [sent-394, score-0.281]

94 F¨ ldi´ k, “Learning invariance from transformation sequences,” Neural Computation, vol. [sent-416, score-0.223]

95 Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. [sent-477, score-0.33]

96 DiCarlo, “Comparing state-of-the-art visual features on invariant object recognition tasks,” in Applications of Computer Vision (WACV), 2011 IEEE Workshop on, 2011. [sent-491, score-0.326]

97 Poggio, “View-based models of 3D object recognition: invariance to imaging transformations,” Cerebral Cortex, vol. [sent-495, score-0.228]

98 Cavanagh, “Spatial heterogeneity in the perception of face and form attributes,” Current Biology, vol. [sent-524, score-0.226]

99 Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. [sent-536, score-0.234]

100 Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,” in Analysis and Modeling of Faces and Gestures, pp. [sent-542, score-0.405]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lfw', 0.447), ('faces', 0.239), ('face', 0.226), ('transformations', 0.22), ('images', 0.215), ('orbit', 0.205), ('template', 0.194), ('pubfig', 0.186), ('poggio', 0.171), ('templates', 0.159), ('signatures', 0.143), ('invariance', 0.139), ('orbits', 0.136), ('leibo', 0.112), ('hog', 0.103), ('recognition', 0.102), ('invariant', 0.102), ('rotation', 0.099), ('convolutional', 0.098), ('auc', 0.097), ('veri', 0.095), ('discriminability', 0.091), ('signature', 0.091), ('illumination', 0.091), ('object', 0.089), ('gt', 0.089), ('person', 0.088), ('depicting', 0.085), ('transformation', 0.084), ('lbp', 0.082), ('ventral', 0.081), ('pooling', 0.076), ('wild', 0.076), ('kssame', 0.075), ('unconstrained', 0.07), ('af', 0.065), ('vision', 0.061), ('translation', 0.061), ('dot', 0.058), ('image', 0.057), ('ksdifferent', 0.056), ('lpq', 0.056), ('brain', 0.051), ('roc', 0.05), ('gx', 0.05), ('rotating', 0.05), ('mutch', 0.049), ('pinto', 0.049), ('ltp', 0.049), ('moments', 0.047), ('clutter', 0.047), ('objects', 0.046), ('stream', 0.046), ('alignment', 0.045), ('lighting', 0.044), ('tasks', 0.044), ('er', 0.044), ('background', 0.042), ('products', 0.042), ('crux', 0.041), ('unsupervised', 0.038), ('gathered', 0.037), ('afraz', 0.037), ('dali', 0.037), ('expansive', 0.037), ('identitypreserving', 0.037), ('depict', 0.035), ('dataset', 0.034), ('classi', 0.034), ('features', 0.033), ('frames', 0.033), ('pipeline', 0.033), ('texture', 0.033), ('triggs', 0.033), ('dicarlo', 0.033), ('minds', 0.033), ('tolerates', 0.033), ('patterns', 0.031), ('normalized', 0.031), ('achieving', 0.031), ('abscissa', 0.03), ('ullman', 0.03), ('brains', 0.03), ('architectures', 0.03), ('totally', 0.03), ('curves', 0.029), ('cells', 0.029), ('inner', 0.029), ('preparing', 0.028), ('ordinate', 0.028), ('rosasco', 0.028), ('invariances', 0.028), ('oja', 0.028), ('pattern', 0.028), ('encountered', 0.028), ('labeled', 0.028), ('system', 0.028), ('hmax', 0.027), ('detection', 0.027), ('representations', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

2 0.20042124 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

3 0.17497848 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

Author: Daniel L. Yamins, Ha Hong, Charles Cadieu, James J. DiCarlo

Abstract: Humans recognize visually-presented objects rapidly and accurately. To understand this ability, we seek to construct models of the ventral stream, the series of cortical areas thought to subserve object recognition. One tool to assess the quality of a model of the ventral stream is the Representational Dissimilarity Matrix (RDM), which uses a set of visual stimuli and measures the distances produced in either the brain (i.e. fMRI voxel responses, neural firing rates) or in models (features). Previous work has shown that all known models of the ventral stream fail to capture the RDM pattern observed in either IT cortex, the highest ventral area, or in the human ventral stream. In this work, we construct models of the ventral stream using a novel optimization procedure for category-level object recognition problems, and produce RDMs resembling both macaque IT and human ventral stream. The model, while novel in the optimization procedure, further develops a long-standing functional hypothesis that the ventral visual stream is a hierarchically arranged series of processing stages optimized for visual object recognition. 1

4 0.11618991 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

Author: Marius Pachitariu, Adam M. Packer, Noah Pettit, Henry Dalgleish, Michael Hausser, Maneesh Sahani

Abstract: Biological tissue is often composed of cells with similar morphologies replicated throughout large volumes and many biological applications rely on the accurate identification of these cells and their locations from image data. Here we develop a generative model that captures the regularities present in images composed of repeating elements of a few different types. Formally, the model can be described as convolutional sparse block coding. For inference we use a variant of convolutional matching pursuit adapted to block-based representations. We extend the KSVD learning algorithm to subspaces by retaining several principal vectors from the SVD decomposition instead of just one. Good models with little cross-talk between subspaces can be obtained by learning the blocks incrementally. We perform extensive experiments on simulated images and the inference algorithm consistently recovers a large proportion of the cells with a small number of false positives. We fit the convolutional model to noisy GCaMP6 two-photon images of spiking neurons and to Nissl-stained slices of cortical tissue and show that it recovers cell body locations without supervision. The flexibility of the block-based representation is reflected in the variability of the recovered cell shapes. 1

5 0.11131484 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?

Author: Cheston Tan, Jedediah M. Singer, Thomas Serre, David Sheinberg, Tomaso Poggio

Abstract: The macaque Superior Temporal Sulcus (STS) is a brain area that receives and integrates inputs from both the ventral and dorsal visual processing streams (thought to specialize in form and motion processing respectively). For the processing of articulated actions, prior work has shown that even a small population of STS neurons contains sufficient information for the decoding of actor invariant to action, action invariant to actor, as well as the specific conjunction of actor and action. This paper addresses two questions. First, what are the invariance properties of individual neural representations (rather than the population representation) in STS? Second, what are the neural encoding mechanisms that can produce such individual neural representations from streams of pixel images? We find that a simple model, one that simply computes a linear weighted sum of ventral and dorsal responses to short action “snippets”, produces surprisingly good fits to the neural data. Interestingly, even using inputs from a single stream, both actor-invariance and action-invariance can be accounted for, by having different linear weights. 1

6 0.10554569 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

7 0.10171566 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

8 0.10121352 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

9 0.099291958 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

10 0.097295523 183 nips-2013-Mapping paradigm ontologies to and from the brain

11 0.094209477 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

12 0.092506908 84 nips-2013-Deep Neural Networks for Object Detection

13 0.083590008 251 nips-2013-Predicting Parameters in Deep Learning

14 0.079718225 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

15 0.078696251 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

16 0.078125723 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

17 0.077031679 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

18 0.074662231 226 nips-2013-One-shot learning by inverting a compositional causal process

19 0.072531588 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

20 0.071735404 211 nips-2013-Non-Linear Domain Adaptation with Boosting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.076), (2, -0.114), (3, -0.102), (4, 0.069), (5, -0.119), (6, -0.078), (7, 0.012), (8, -0.079), (9, 0.018), (10, -0.128), (11, 0.021), (12, 0.035), (13, 0.02), (14, -0.044), (15, 0.026), (16, -0.046), (17, -0.195), (18, -0.064), (19, 0.024), (20, 0.031), (21, 0.053), (22, 0.003), (23, 0.039), (24, -0.075), (25, 0.041), (26, 0.075), (27, 0.024), (28, -0.018), (29, 0.009), (30, 0.022), (31, 0.003), (32, -0.016), (33, -0.031), (34, -0.021), (35, -0.035), (36, 0.008), (37, 0.079), (38, 0.093), (39, -0.023), (40, 0.049), (41, -0.01), (42, 0.02), (43, 0.051), (44, -0.038), (45, 0.035), (46, 0.099), (47, -0.04), (48, -0.032), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94941998 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

2 0.79942298 119 nips-2013-Fast Template Evaluation with Vector Quantization

Author: Mohammad Amin Sadeghi, David Forsyth

Abstract: Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1

3 0.76801783 84 nips-2013-Deep Neural Networks for Object Detection

Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan

Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1

4 0.7596705 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

Author: Chen-Ping Yu, Wen-Yu Hua, Dimitris Samaras, Greg Zelinsky

Abstract: Visual clutter, the perception of an image as being crowded and disordered, affects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an image, with proto-objects defined as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel parametric method of clustering superpixels by modeling mixture of Weibulls on Earth Mover’s Distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new 90-image dataset of real world scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman’s ρ = 0.8038, p < 0.001), but also outperformed all existing clutter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features. 1

5 0.74703914 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

Author: Daniel L. Yamins, Ha Hong, Charles Cadieu, James J. DiCarlo

Abstract: Humans recognize visually-presented objects rapidly and accurately. To understand this ability, we seek to construct models of the ventral stream, the series of cortical areas thought to subserve object recognition. One tool to assess the quality of a model of the ventral stream is the Representational Dissimilarity Matrix (RDM), which uses a set of visual stimuli and measures the distances produced in either the brain (i.e. fMRI voxel responses, neural firing rates) or in models (features). Previous work has shown that all known models of the ventral stream fail to capture the RDM pattern observed in either IT cortex, the highest ventral area, or in the human ventral stream. In this work, we construct models of the ventral stream using a novel optimization procedure for category-level object recognition problems, and produce RDMs resembling both macaque IT and human ventral stream. The model, while novel in the optimization procedure, further develops a long-standing functional hypothesis that the ventral visual stream is a hierarchically arranged series of processing stages optimized for visual object recognition. 1

6 0.7365787 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

7 0.68195075 226 nips-2013-One-shot learning by inverting a compositional causal process

8 0.67723578 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

9 0.67104268 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

10 0.64827055 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

11 0.64444369 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

12 0.63263923 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

13 0.61301887 183 nips-2013-Mapping paradigm ontologies to and from the brain

14 0.59802079 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

15 0.59727383 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

16 0.57748634 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths

17 0.57086074 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

18 0.55955136 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

19 0.54574358 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

20 0.54226542 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.017), (16, 0.047), (19, 0.021), (25, 0.024), (33, 0.16), (34, 0.135), (41, 0.025), (49, 0.032), (56, 0.072), (70, 0.051), (75, 0.231), (85, 0.032), (89, 0.026), (93, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85487342 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths

Author: Stefan Mathe, Cristian Sminchisescu

Abstract: Human eye movements provide a rich source of information into the human visual information processing. The complex interplay between the task and the visual stimulus is believed to determine human eye movements, yet it is not fully understood, making it difficult to develop reliable eye movement prediction systems. Our work makes three contributions towards addressing this problem. First, we complement one of the largest and most challenging static computer vision datasets, VOC 2012 Actions, with human eye movement recordings collected under the primary task constraint of action recognition, as well as, separately, for context recognition, in order to analyze the impact of different tasks. Our dataset is unique among the eyetracking datasets of still images in terms of large scale (over 1 million fixations recorded in 9157 images) and different task controls. Second, we propose Markov models to automatically discover areas of interest (AOI) and introduce novel sequential consistency metrics based on them. Our methods can automatically determine the number, the spatial support and the transitions between AOIs, in addition to their locations. Based on such encodings, we quantitatively show that given unconstrained read-world stimuli, task instructions have significant influence on the human visual search patterns and are stable across subjects. Finally, we leverage powerful machine learning techniques and computer vision features in order to learn task-sensitive reward functions from eye movement data within models that allow to effectively predict the human visual search patterns based on inverse optimal control. The methodology achieves state of the art scanpath modeling results. 1

2 0.82979786 347 nips-2013-Variational Planning for Graph-based MDPs

Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler

Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1

same-paper 3 0.81486493 166 nips-2013-Learning invariant representations and applications to face verification

Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio

Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1

4 0.80269587 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

Author: Benigno Uria, Iain Murray, Hugo Larochelle

Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of onedimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradientbased optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case. 1

5 0.8007974 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration

Author: Tuo Zhao, Han Liu

Abstract: We propose a semiparametric method for estimating sparse precision matrix of high dimensional elliptical distribution. The proposed method calibrates regularizations when estimating each column of the precision matrix. Thus it not only is asymptotically tuning free, but also achieves an improved finite sample performance. Theoretically, we prove that the proposed method achieves the parametric rates of convergence in both parameter estimation and model selection. We present numerical results on both simulated and real datasets to support our theory and illustrate the effectiveness of the proposed estimator. 1

6 0.73311311 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

7 0.71407175 201 nips-2013-Multi-Task Bayesian Optimization

8 0.71110487 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

9 0.70704007 173 nips-2013-Least Informative Dimensions

10 0.70583284 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

11 0.70436567 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

12 0.70128894 183 nips-2013-Mapping paradigm ontologies to and from the brain

13 0.70057571 5 nips-2013-A Deep Architecture for Matching Short Texts

14 0.70026273 64 nips-2013-Compete to Compute

15 0.70023799 251 nips-2013-Predicting Parameters in Deep Learning

16 0.69884682 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

17 0.69772238 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

18 0.69770545 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

19 0.69762528 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

20 0.69738936 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions