nips nips2011 nips2011-304 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joel Z. Leibo, Jim Mutch, Tomaso Poggio
Abstract: Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. [sent-5, score-0.896]
2 Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. [sent-6, score-0.753]
3 A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. [sent-7, score-0.652]
4 In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. [sent-8, score-1.129]
5 However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. [sent-9, score-0.765]
6 We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. [sent-10, score-0.957]
7 We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. [sent-11, score-1.244]
8 The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. [sent-12, score-0.993]
9 1 Introduction There is increasing evidence that visual cortex contains discrete patches involved in processing faces but not other objects [2, 3, 4, 5, 6, 7]. [sent-13, score-0.874]
10 Though progress has been made recently in characterizing the properties of these brain areas, the computational-level reason the brain adopts this modular architecture has remained unknown. [sent-14, score-0.273]
11 In this paper, we propose a new computational-level explanation for why visual cortex separates face processing from object processing. [sent-15, score-0.814]
12 Our argument does not require us to claim that faces are automatically processed in ways that are inapplicable to objects (e. [sent-16, score-0.377]
13 gaze detection, gender detection) or that cortical specialization for faces arises due to perceptual expertise [8], though the perspective that emerges from our model is consistent with both of these claims. [sent-18, score-0.249]
14 We show that the task of identifying individual faces in an optimally viewpoint invariant way from single training examples requires a separate neural circuitry specialized for faces. [sent-19, score-0.818]
15 g, translation, scaling and 2D in-plane rotation can be learned from any 1 Figure 1: Layout of face-selective regions in macaque visual cortex, adapted from [1] with permission. [sent-22, score-0.34]
16 object class and usefully applied to any other class [9]. [sent-23, score-0.34]
17 Other transformations which are classspecific, include changes in viewpoint and illumination. [sent-24, score-0.551]
18 In this paper, we describe a method by which invariance to class-specific transformations can be encoded and used for within-class identification. [sent-27, score-0.502]
19 The resulting model of visual cortex must separate the representations of different classes in order to achieve good performance. [sent-28, score-0.428]
20 Section 2 of this paper describes the recently discovered hierarchical organization of the macaque face processing network [1]. [sent-30, score-0.42]
21 Sections 3 and 4 describe an extension to an existing hierarchical model of object recognition to include invariances for class-specific transformations. [sent-31, score-0.391]
22 The final section explains why the brain should have separate modules and relates the proposed computational model to physiology and neuroimaging evidence that the brain does indeed separate face recognition from object recognition. [sent-32, score-0.843]
23 Consistent with a hierarchical organization involving information passing from ML/MF to AM via AL, electrical stimulation of ML elicited a response in AL and stimulation in AL elicited a response in AM [10]. [sent-36, score-0.348]
24 The firing rates of cells in ML/MF are most strongly modulated by face viewpoint. [sent-37, score-0.424]
25 Further along the hierarchy, in patch AM, cells are highly selective for individual faces but tolerate substantial changes in viewpoint [1]. [sent-38, score-0.912]
26 In this paper, we argue that such a system – with view-tuned cells upstream from view-invariant identity-selective cells – is ideally suited to support face identification. [sent-40, score-0.628]
27 In the subsequent sections, we present a model of the ventral stream that is consistent with a large body of experimental results1 and additionally predicts the existence of discrete face-selective patches organized in this manner. [sent-41, score-0.322]
28 2 3 Hubel-Wiesel inspired hierarchical models of object recognition At the end of the ventral visual pathway, cells in the most anterior parts of visual cortex respond selectively to highly complex stimuli and also invariantly over several degrees of visual angle. [sent-43, score-1.651]
29 Hierarchical models inspired by Hubel and Wiesel’s work, H-W models, seek to achieve similar selectivity and invariance properties by subjecting visual inputs to successive tuning and pooling operations [12, 13, 14, 15]. [sent-44, score-0.46]
30 A major algorithmic claim made by these H-W models is that repeated application of this AND-like tuning operation is the source of the selective responses of cells at the end of the ventral stream. [sent-45, score-0.461]
31 Hubel and Wiesel described complex cells as pooling the outputs of simple cells with the same optimal stimuli but receptive fields in different locations [16]. [sent-47, score-0.559]
32 Similar pooling operations can also be employed to gain tolerance to other image transformations, including those induced by changes in viewpoint or illumination. [sent-50, score-0.387]
33 , viewpoint, simply by connecting to (simple-like) cells that are selective for the appearance of the same feature at different viewpoints. [sent-54, score-0.249]
34 Complex (C) cells pool over S cells by computing the max response of all the S cells with which they are connected. [sent-57, score-0.653]
35 There is also psychophysical and physiological evidence that visual cortex employs a temporal association strategy2 [23, 24, 25, 26, 27]. [sent-62, score-0.556]
36 4 Invariance to class-specific transformations H-W models can gain invariance to some transformations in a generic way. [sent-63, score-0.743]
37 , translation, scaling, and in-plane rotation, then the model’s response to any image undergoing the transformation will remain constant no matter what templates were associated with one another to build the model. [sent-66, score-0.308]
38 For example, a face can be encoded invariantly to translation as a vector of similarities to previously viewed template images of any other objects. [sent-67, score-0.563]
39 Other transformations are classspecific, that is, they depend on information about the depicted object that is not available in a single image. [sent-70, score-0.442]
40 For example, the 2D image evoked by an object undergoing a change in viewpoint depends on its 3D structure. [sent-71, score-0.633]
41 Likewise, the images evoked by changes in illumination depend on the object’s material properties. [sent-72, score-0.326]
42 These class-specific properties can be learned from one or more exemplars of the class and applied to other objects in the class (see also [28, 29]). [sent-73, score-0.299]
43 For this to work, the object class needs to consist of objects with similar 3D shape and material properties. [sent-74, score-0.401]
44 2 These temporal association algorithms and the evidence for their employment by visual cortex are interesting in their own right. [sent-77, score-0.494]
45 In this paper we sidestep the issue of how visual cortex associates similar features under different transformations in order to focus on the implications of having the representation that results from applying these learning rules. [sent-78, score-0.624]
46 3 C3 S3 C2 S2 C1 S1 Tuning Pooling Input Figure 2: Illustration of an extension to the HMAX model to incorporate class-specific invariance to face viewpoint changes. [sent-79, score-0.689]
47 rw (x) is invariant to viewpoint changes of the input face x, as long as the 3D structure of the face depicted in the template images wt matches the 3D structure of the face depicted in x. [sent-83, score-1.269]
48 Since all human faces have a relatively similar 3D structure, rw (x) will tolerate substantial viewpoint changes within the domain of faces. [sent-84, score-0.709]
49 It follows that templates derived from a class of objects with the wrong 3D structure give rise to C cells that do not respond invariantly to 3D rotations. [sent-85, score-0.607]
50 That is, a single view of a target object is encoded and a simple classifier (nearest neighbors) must rank test images depicting the same object as being more similar to the encoded target than to images of any other objects. [sent-88, score-0.666]
51 This task models the common situation of encountering a new face or object at one viewpoint and then being asked to recognize it again later from a different viewpoint. [sent-90, score-0.677]
52 The original HMAX model [14], represented here by the red curves (C2), shows a rapid decline in performance due to changes in viewpoint and illumination. [sent-91, score-0.315]
53 Additionally, the performance of the C3 features is not strongly affected by viewpoint and illumination changes (see the plots along the diagonal in figures 3I and 4I). [sent-93, score-0.395]
54 Class A consists of faces produced using FaceGen (Singular Inversions). [sent-96, score-0.249]
55 Each object in this class has a central spike protruding from a sphere and two bumps always in the same location on top of the sphere. [sent-98, score-0.273]
56 Each object in this class has a central pyramid on a flat plane and two walls on either side. [sent-101, score-0.273]
57 The S3/C3 templates were obtained from objects in class A in the top row, class B in the middle row and class C in the bottom row. [sent-105, score-0.507]
58 The abscissa of each plot shows the maximum invariance range (maximum deviation from the frontal view in either direction) over which targets and distractors were presented. [sent-106, score-0.31]
59 The ordinate shows the AUC obtained for the task of recognizing an individual novel object despite changes in viewpoint. [sent-107, score-0.31]
60 A simple correlation-based nearest-neighbor classifier must rank all images of the same object at different viewpoints as being more similar to the frontal view than other objects. [sent-109, score-0.325]
61 Simulation details: These simulations used 2000 translation and scaling invariant C2 units tuned to patches of natural images. [sent-111, score-0.326]
62 Each class consists of faces with different light reflectance properties, modeling different materials. [sent-118, score-0.316]
63 S3/C3 templates were obtained from objects in class A (top row), B (middle row), and C (bottom row). [sent-125, score-0.335]
64 As in figure 3, the abscissa of each plot shows the maximum invariance range (maximum distance the light could move in either direction away from a neutral position where the lamp is even with the middle of the head) over which targets and distractors were presented. [sent-127, score-0.348]
65 The ordinate shows the AUC obtained for the task of recognizing an individual novel object despite changes in illumination. [sent-128, score-0.31]
66 A correlation-based nearest-neighbor “classifier” must rank all images of the same object under each illumination condition as being more similar to the neutral view than other objects. [sent-129, score-0.365]
67 Simulation details: These simulations used 80 translation and scaling invariant C2 units tuned to patches of natural images. [sent-131, score-0.326]
68 The above simulations used 1200 S3 units (80 exemplar faces and 15 illumination conditions) and 80 C3 units. [sent-134, score-0.367]
69 5 Conclusion Everyday visual tasks require reasonably good invariance to non-generic transformations like changes in viewpoint and illumination3 . [sent-140, score-0.939]
70 We showed that a broad class of ventral stream models that is well-supported by physiology data (H-W models) require class-specific modules in order to accomplish these tasks. [sent-141, score-0.388]
71 The responses of cells in an early part of the hierarchy (patches ML and MF) are strongly dependent on viewpoint, while the cells in a downstream area (patch AM) tolerate large changes in viewpoint. [sent-143, score-0.695]
72 Identifying the S3 layer of our extended HMAX model with the ML/MF cells and the C3 layer with the AM cells is an intruiging possibility. [sent-144, score-0.408]
73 Another mapping from the model to the physiology could be to identify the outputs of simple classifiers operating on C2, S3 or C3 layers with the responses of cells in ML/MF and AM. [sent-145, score-0.327]
74 Fundamentally, the 3D rotation of an object class with one 3D structure e. [sent-146, score-0.345]
75 , faces, is not the same as the 3D rotation of another class of objects with a different 3D structure. [sent-148, score-0.267]
76 Generic circuitry cannot take into account both transformations at once. [sent-149, score-0.364]
77 Since the brain must take these transformations into account in interpreting the visual world, it follows that visual cortex must have a modular architecture. [sent-151, score-0.929]
78 Object classes that are important enough to require invariance to these transformations of novel exemplars must be encoded by dedicated circuitry. [sent-152, score-0.594]
79 We do not think it is coincidental that, just as for faces, brain areas which are thought to be specialized for visual processing of the human body (the extrastriate body area [32]) and reading (the visual word form area [33, 34]) are consistently found in human fMRI experiments. [sent-155, score-0.737]
80 We have argued in favor of visual cortex implenting a modularity of content rather than process. [sent-156, score-0.388]
81 The only difference across areas is the object class (and the transformations) being encoded. [sent-159, score-0.273]
82 In this view, visual cortex must be modular in order to succeed in the tasks with which it is faced. [sent-160, score-0.433]
83 of Brain & Cognitive 3 It is sometimes claimed that human vision is not viewpoint invariant [30]. [sent-162, score-0.394]
84 It is certainly true that performance on psychophysical tasks requiring viewpoint invariance is worse than on tasks requiring translation invariance. [sent-163, score-0.613]
85 Many psychophysical experiments on viewpoint invariance were performed with synthetic “paperclip” objects defined entirely by their 3D structure. [sent-167, score-0.659]
86 Chun, “The fusiform face area: a module in human extrastriate cortex specialized for face perception,” The Journal of Neuroscience, vol. [sent-182, score-0.893]
87 Kanwisher, “The fusiform face area subserves face perception, not generic within-category identification,” Nature Neuroscience, vol. [sent-189, score-0.608]
88 Tootell, “An anterior temporal face patch in human cortex, predicted by macaque maps,” Proceedings of the National Academy of Sciences, vol. [sent-213, score-0.592]
89 Gauthier, “FFA: a flexible fusiform area for subordinate-level visual processing automatized by expertise,” Nature Neuroscience, vol. [sent-227, score-0.285]
90 Tsao, “Patches with links: a unified system for processing faces in the macaque temporal lobe,” Science, vol. [sent-240, score-0.413]
91 Poggio, “A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex,” CBCL Paper #259/AI Memo #2005-036, 2005. [sent-250, score-0.589]
92 Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. [sent-258, score-0.293]
93 F¨ ldi´ k, “Learning invariance from transformation sequences,” Neural Computation, vol. [sent-293, score-0.272]
94 Rolls, “Invariant object recognition in the visual system with novel views of 3D objects,” Neural Computation, vol. [sent-299, score-0.463]
95 Spratling, “Learning viewpoint invariant perceptual representations from cluttered images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-315, score-0.348]
96 DiCarlo, “Unsupervised natural experience rapidly alters invariant object representation in visual cortex. [sent-331, score-0.473]
97 B¨ lthoff, “Learning illumination-and orientationu invariant representations of objects through temporal association,” Journal of vision, vol. [sent-356, score-0.291]
98 Poggio, “View-based models of 3D object recognition: invariance to imaging transformations,” Cerebral Cortex, vol. [sent-362, score-0.424]
99 Poggio, “View-dependent object recognition by monkeys,” u Current Biology, vol. [sent-382, score-0.293]
100 Jiang, “A cortical area selective for visual processing of the human body,” Science, vol. [sent-388, score-0.308]
wordName wordTfidf (topN-words)
[('viewpoint', 0.251), ('faces', 0.249), ('transformations', 0.236), ('face', 0.22), ('invariance', 0.218), ('cortex', 0.218), ('object', 0.206), ('cells', 0.204), ('visual', 0.17), ('ventral', 0.159), ('hmax', 0.159), ('poggio', 0.159), ('templates', 0.14), ('circuitry', 0.128), ('objects', 0.128), ('anterior', 0.11), ('patches', 0.109), ('evoked', 0.103), ('macaque', 0.098), ('invariant', 0.097), ('freiwald', 0.091), ('lthoff', 0.091), ('tsao', 0.091), ('brain', 0.09), ('recognition', 0.087), ('translation', 0.082), ('illumination', 0.08), ('images', 0.079), ('auc', 0.075), ('undergoing', 0.073), ('rotation', 0.072), ('pooling', 0.072), ('physiology', 0.07), ('extrastriate', 0.068), ('fusiform', 0.068), ('invariantly', 0.068), ('kanwisher', 0.068), ('tootell', 0.068), ('class', 0.067), ('temporal', 0.066), ('template', 0.066), ('changes', 0.064), ('psychophysical', 0.062), ('hubel', 0.06), ('mutch', 0.06), ('serre', 0.06), ('wiesel', 0.055), ('dedicated', 0.055), ('stream', 0.054), ('transformation', 0.054), ('generic', 0.053), ('responses', 0.053), ('specialized', 0.053), ('patch', 0.052), ('dicarlo', 0.052), ('rw', 0.052), ('distractors', 0.052), ('neuroscience', 0.051), ('organization', 0.051), ('hierarchical', 0.051), ('encoded', 0.048), ('architecture', 0.048), ('area', 0.047), ('invariances', 0.047), ('tolerate', 0.047), ('human', 0.046), ('blender', 0.045), ('cbcl', 0.045), ('classspeci', 0.045), ('facegen', 0.045), ('leibo', 0.045), ('macaques', 0.045), ('wallis', 0.045), ('modular', 0.045), ('stimulation', 0.045), ('selective', 0.045), ('receptive', 0.041), ('response', 0.041), ('separate', 0.04), ('mcdermott', 0.04), ('ordinate', 0.04), ('logothetis', 0.04), ('opaque', 0.04), ('abscissa', 0.04), ('viewpoints', 0.04), ('association', 0.04), ('hierarchy', 0.039), ('identi', 0.039), ('academy', 0.038), ('stimuli', 0.038), ('simulations', 0.038), ('accomplish', 0.038), ('middle', 0.038), ('panel', 0.037), ('elicited', 0.037), ('ective', 0.037), ('riesenhuber', 0.037), ('downstream', 0.037), ('exemplars', 0.037), ('ullman', 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
Author: Joel Z. Leibo, Jim Mutch, Tomaso Poggio
Abstract: Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. 1
2 0.14514107 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen
Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1
3 0.13453388 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity
Author: Maneesh Bhand, Ritvik Mudur, Bipin Suresh, Andrew Saxe, Andrew Y. Ng
Abstract: The efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organism’s lifetime, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive field properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive field properties in response to experimentally altered environmental statistics. Based on these modeling results we propose these models as phenomenological models of receptive field plasticity during an organism’s lifetime. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, first proposed by Mountcastle [1], that a qualitatively similar learning algorithm acts throughout primary sensory cortices. 1
4 0.12713814 154 nips-2011-Learning person-object interactions for action recognition in still images
Author: Vincent Delaitre, Josef Sivic, Ivan Laptev
Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1
5 0.12534942 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
Author: Xinggang Wang, Xiang Bai, Xingwei Yang, Wenyu Liu, Longin J. Latecki
Abstract: We propose a novel inference framework for finding maximal cliques in a weighted graph that satisfy hard constraints. The constraints specify the graph nodes that must belong to the solution as well as mutual exclusions of graph nodes, i.e., sets of nodes that cannot belong to the same solution. The proposed inference is based on a novel particle filter algorithm with state permeations. We apply the inference framework to a challenging problem of learning part-based, deformable object models. Two core problems in the learning framework, matching of image patches and finding salient parts, are formulated as two instances of the problem of finding maximal cliques with hard constraints. Our learning framework yields discriminative part based object models that achieve very good detection rate, and outperform other methods on object classes with large deformation. 1
6 0.1189393 244 nips-2011-Selecting Receptive Fields in Deep Networks
7 0.10810569 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
8 0.10336888 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
9 0.10166218 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
10 0.099202961 219 nips-2011-Predicting response time and error rates in visual search
11 0.098682277 231 nips-2011-Randomized Algorithms for Comparison-based Search
12 0.094104648 127 nips-2011-Image Parsing with Stochastic Scene Grammar
13 0.090943597 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
14 0.088673376 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.087298289 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
16 0.084137626 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
17 0.083943307 259 nips-2011-Sparse Estimation with Structured Dictionaries
18 0.083559446 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
19 0.082266949 22 nips-2011-Active Ranking using Pairwise Comparisons
20 0.081300408 35 nips-2011-An ideal observer model for identifying the reference frame of objects
topicId topicWeight
[(0, 0.199), (1, 0.161), (2, -0.008), (3, 0.158), (4, 0.086), (5, 0.09), (6, 0.044), (7, 0.065), (8, 0.01), (9, -0.024), (10, 0.012), (11, -0.0), (12, 0.017), (13, 0.051), (14, 0.105), (15, -0.021), (16, 0.04), (17, 0.003), (18, 0.011), (19, 0.06), (20, 0.002), (21, 0.013), (22, -0.084), (23, 0.092), (24, -0.011), (25, -0.008), (26, 0.021), (27, 0.05), (28, 0.041), (29, -0.061), (30, 0.063), (31, 0.133), (32, -0.027), (33, 0.006), (34, 0.01), (35, 0.007), (36, -0.042), (37, 0.012), (38, -0.07), (39, 0.101), (40, -0.024), (41, 0.026), (42, -0.048), (43, -0.103), (44, -0.111), (45, -0.001), (46, 0.053), (47, -0.038), (48, 0.044), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.9688229 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
Author: Joel Z. Leibo, Jim Mutch, Tomaso Poggio
Abstract: Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. 1
2 0.69270945 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen
Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1
3 0.65434128 35 nips-2011-An ideal observer model for identifying the reference frame of objects
Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths
Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1
4 0.64851063 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
5 0.64304549 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout
Author: Andreas Geiger, Christian Wojek, Raquel Urtasun
Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1
6 0.64011461 154 nips-2011-Learning person-object interactions for action recognition in still images
7 0.61963755 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
8 0.59902191 293 nips-2011-Understanding the Intrinsic Memorability of Images
10 0.55852461 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
11 0.55528933 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
12 0.55343544 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity
13 0.53261757 193 nips-2011-Object Detection with Grammar Models
14 0.52264005 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
15 0.52136618 127 nips-2011-Image Parsing with Stochastic Scene Grammar
16 0.49756899 244 nips-2011-Selecting Receptive Fields in Deep Networks
17 0.49024999 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
18 0.49022502 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
20 0.45574284 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
topicId topicWeight
[(0, 0.027), (4, 0.037), (20, 0.071), (26, 0.011), (27, 0.276), (31, 0.079), (33, 0.025), (43, 0.042), (45, 0.115), (57, 0.039), (65, 0.052), (74, 0.055), (83, 0.053), (84, 0.026), (99, 0.026)]
simIndex simValue paperId paperTitle
1 0.80268466 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression
Author: Ali Tofigh, Erik Sj̦lund, Mattias H̦glund, Jens Lagergren
Abstract: Cancer has complex patterns of progression that include converging as well as diverging progressional pathways. Vogelstein’s path model of colon cancer was a pioneering contribution to cancer research. Since then, several attempts have been made at obtaining mathematical models of cancer progression, devising learning algorithms, and applying these to cross-sectional data. Beerenwinkel et al. provided, what they coined, EM-like algorithms for Oncogenetic Trees (OTs) and mixtures of such. Given the small size of current and future data sets, it is important to minimize the number of parameters of a model. For this reason, we too focus on tree-based models and introduce Hidden-variable Oncogenetic Trees (HOTs). In contrast to OTs, HOTs allow for errors in the data and thereby provide more realistic modeling. We also design global structural EM algorithms for learning HOTs and mixtures of HOTs (HOT-mixtures). The algorithms are global in the sense that, during the M-step, they find a structure that yields a global maximum of the expected complete log-likelihood rather than merely one that improves it. The algorithm for single HOTs performs very well on reasonable-sized data sets, while that for HOT-mixtures requires data sets of sizes obtainable only with tomorrow’s more cost-efficient technologies. 1
same-paper 2 0.79976928 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
Author: Joel Z. Leibo, Jim Mutch, Tomaso Poggio
Abstract: Many studies have uncovered evidence that visual cortex contains specialized regions involved in processing faces but not other object classes. Recent electrophysiology studies of cells in several of these specialized regions revealed that at least some of these regions are organized in a hierarchical manner with viewpointspecific cells projecting to downstream viewpoint-invariant identity-specific cells [1]. A separate computational line of reasoning leads to the claim that some transformations of visual inputs that preserve viewed object identity are class-specific. In particular, the 2D images evoked by a face undergoing a 3D rotation are not produced by the same image transformation (2D) that would produce the images evoked by an object of another class undergoing the same 3D rotation. However, within the class of faces, knowledge of the image transformation evoked by 3D rotation can be reliably transferred from previously viewed faces to help identify a novel face at a new viewpoint. We show, through computational simulations, that an architecture which applies this method of gaining invariance to class-specific transformations is effective when restricted to faces and fails spectacularly when applied to other object classes. We argue here that in order to accomplish viewpoint-invariant face identification from a single example view, visual cortex must separate the circuitry involved in discounting 3D rotations of faces from the generic circuitry involved in processing other objects. The resulting model of the ventral stream of visual cortex is consistent with the recent physiology results showing the hierarchical organization of the face processing network. 1
3 0.77530771 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
Author: David Wingate, Noah Goodman, Andreas Stuhlmueller, Jeffrey M. Siskind
Abstract: Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals. 1
4 0.6779142 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database
Author: Joshua T. Abbott, Katherine A. Heller, Zoubin Ghahramani, Thomas L. Griffiths
Abstract: How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories. 1
5 0.54450458 244 nips-2011-Selecting Receptive Fields in Deep Networks
Author: Adam Coates, Andrew Y. Ng
Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive fields” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1
6 0.54261607 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
7 0.535604 227 nips-2011-Pylon Model for Semantic Segmentation
8 0.53548515 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
9 0.53379816 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
10 0.53221691 180 nips-2011-Multiple Instance Filtering
11 0.53154457 156 nips-2011-Learning to Learn with Compound HD Models
12 0.53093702 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
13 0.52993983 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
14 0.52929181 276 nips-2011-Structured sparse coding via lateral inhibition
15 0.52742255 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
16 0.52731222 303 nips-2011-Video Annotation and Tracking with Active Learning
17 0.52671838 263 nips-2011-Sparse Manifold Clustering and Embedding
18 0.52668661 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
19 0.52637094 35 nips-2011-An ideal observer model for identifying the reference frame of objects
20 0.52481121 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling