nips nips2010 nips2010-209 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor
Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. [sent-4, score-0.274]
2 We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. [sent-5, score-0.203]
3 By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. [sent-7, score-1.02]
4 We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. [sent-8, score-0.181]
5 We also demonstrate that realworld performance can be improved through the use of synthetic data. [sent-10, score-0.185]
6 1 Introduction Determining the pose of a human body from one or more images is a central problem in Computer Vision. [sent-11, score-0.511]
7 The complex, multi-jointed nature of the body makes the determination of pose challenging, particularly in natural settings where ambiguous and unusual configurations may be observed. [sent-12, score-0.379]
8 The ability to localize the hands is particularly important: they provide tight constraints on the layout of the upper body, yielding a strong cue as to the action and intent of a person. [sent-13, score-0.204]
9 A huge range of techniques, both parametric and non-parametric, exist for inferring body pose from 2D images and 3D datasets [10, 39, 4, 28, 33, 8, 3, 6, 11]. [sent-14, score-0.458]
10 60 Figure 1: Query image (in left column) and the eight nearest neighbours found by our method. [sent-39, score-0.181]
11 Matches are based on the location of the hands, and more generally body pose - not the individual or the background. [sent-41, score-0.379]
12 1 estimating body pose by localizing the hands using a parametric, nonlinear multi-layered embedding of the raw pixel images. [sent-42, score-0.849]
13 Unlike many other metric learning approaches, ours is designed for use with real-world images, having a convolutional architecture that scales gracefully to large images and is invariant to local geometric distortions. [sent-43, score-0.508]
14 Our embedding, trained on both real and synthetic data, is a functional mapping that projects images with similar head and hand positions to lie close-by in a low-dimensional output space. [sent-44, score-0.622]
15 Specifically for this task, we have designed an interface to obtain and verify head and hand labels for thousands of frames through Amazon Mechanical Turk with minimal user intervention. [sent-46, score-0.365]
16 It succeeds in generalizing to body and hand pose when such cues are not explicitly provided in the labels (see Fig. [sent-48, score-0.468]
17 2 Related work Our application domain is related to several approaches in the computer vision literature that propose hand or body pose tracking. [sent-50, score-0.411]
18 In our domain, hands might only occupy a few pixels, and the only body-part that can reliably be detected is the human face ([26, 13]). [sent-52, score-0.313]
19 Many techniques have been proposed that extract, learn, or reason over entire body features. [sent-53, score-0.149]
20 Some extract “shape-context” edge based histograms from the human body [25, 1] or just silhouette features [15]. [sent-58, score-0.202]
21 Our domain contains clutter, lighting variations and low resolution such that it is impossible to separate body features from background successfully. [sent-62, score-0.185]
22 We instead learn relevant features directly from pixels (instead of pre-coded edge or gradient histogram features), and discover implicitly background invariance from training data. [sent-63, score-0.242]
23 We show in this paper several experiments with challenging real video (with crowd-sourced Amazon Mechanical Turk labels), synthetic training data, and hybrid datasets. [sent-65, score-0.368]
24 NCA has also been recently extended to the nonlinear case [34] using MNIST class labels and to linear 1D regression for reinforcement learning [20]. [sent-73, score-0.206]
25 Like NCA, DrLIM uses class neighbourhood structure to drive the optimization: observations with the same class label are driven to be close-by in feature space. [sent-75, score-0.146]
26 3 Learning an invariant mapping by nonlinear embedding We first discuss Neighbourhood Components Analysis [14] and its nonlinear variants. [sent-77, score-0.41]
27 We then propose an alternative objective function optimized for performing nearest neighbour (NN) regression rather than classification. [sent-78, score-0.277]
28 Next, we describe our convolutional architecture which maps images from 2 high-dimensional to low-dimensional space. [sent-79, score-0.386]
29 They only require that neighbourhood relationships be defined between training samples. [sent-83, score-0.16]
30 pose information for images of people) one alternative is to define neighbourhoods based on the distance in the real-valued label space and proceed as usual. [sent-89, score-0.436]
31 Instead of seeking to optimize KNN classification performance, we can use the NCA regression (NCAR) objective [20]: N pij ||yi − yj ||2 . [sent-104, score-0.174]
32 The gradient can be computed efficiently as: ∂LNCAR 2 2 = −2 (zi − zj ) pij yij − δi + pji yij − δj . [sent-111, score-0.187]
33 (4) ∂zi j=i 2 where we use yij = ||yi − yj ||2 and δi = 2 j 2 pij yij . [sent-112, score-0.237]
34 However, to avoid such hand-crafted features which may not be suitable for the task, and to scale to realistic sized inputs, models should take advantage of the pictorial nature of the image input. [sent-121, score-0.157]
35 This is addressed by convolutional architectures [21], which exploit the fact that salient motifs can appear anywhere in the image. [sent-122, score-0.314]
36 By employing successive stages of weight-sharing and feature-pooling, deep convolutional architectures can achieve stable latent representations at each layer, that preserve locality, provide invariance to small variations of the input, and drastically reduce the number of free parameters. [sent-123, score-0.367]
37 Our proposed method which we call Convolutional NCA regression (C-NCAR) is based on a standard convolutional architecture [21, 18]: alternating convolution and subsampling layers followed by a single fully-connected layer (see Fig. [sent-124, score-0.422]
38 It differs from typical convolutional nets in the objective function with which it is trained (i. [sent-126, score-0.258]
39 [16] also use a siamese convolutional network with yet a different objective. [sent-134, score-0.369]
40 [24] have also recently used a convolutional siamese network in which temporal coherence between pairs of frames drives the regularization of the model rather than the objective. [sent-137, score-0.531]
41 Each image is processed by two convolutional and subsampling layers and one fully-connected layer. [sent-141, score-0.315]
42 3 Adding a contrastive loss function Like NCA, DrLIM assumes a discrete notion of similarity or dissimilarity between data pairs, xi and xj . [sent-145, score-0.156]
43 For example, if labels yi are discrete yi ∈ 1, 2, . [sent-149, score-0.177]
44 ˆ 4 4 Experimental results We evaluate our approach in real and synthetic environments by performing 1-nearest neighbour (NN) regression using a variety of standard and learned metrics described below. [sent-159, score-0.442]
45 For every query image in a test set, we compute its distance (under the metric) to each of the training points in a database. [sent-160, score-0.228]
46 (x,y) position of the head and hands) of the neighbour to the query example. [sent-163, score-0.417]
47 For evaluation, we compare the ground-truth label of the query to the label of the nearest neighbour. [sent-164, score-0.214]
48 Errors are reported in terms of mean pixel error over each query and each marker: the head (if it is tracked) and each hand. [sent-165, score-0.324]
49 We acknowledge that improved results could potentially be obtained by using more than one neighbour or with more sophisticated techniques such as locally weighted regression [36]. [sent-167, score-0.205]
50 The approaches compared are: Pixel distance can be used to find nearest neighbours though it is not practical in real situations due to the intractability of computing distances in such a high-dimensional space. [sent-169, score-0.227]
51 We are motivated to use GIST by its previous use in nonlinear NCA for image retrieval [38]. [sent-171, score-0.157]
52 In both the linear and nonlinear case, the architecture and training procedure remains the same as NCAR and C-NCAR, respectively. [sent-194, score-0.199]
53 1 Estimating 2D head and hand pose from synthetic data We extracted 10,000 frames of training data and 5,000 frames of test data from Poser renderings of several hours of real motion capture data. [sent-198, score-0.91]
54 Our synthetic data is similar to that considered in [36], however, we use a variety of backgrounds rather than a constant background. [sent-199, score-0.254]
55 The inputs, x, are 320 × 240 images, and the labels, y, are 6D vectors - the true (x,y) locations of the head and hands. [sent-203, score-0.191]
56 This is perhaps an artifact of the synthetic data. [sent-207, score-0.185]
57 2 Estimating 2D hand pose from real video We digitally recorded all of the contributing and invited speakers at the Learning Workshop (Snowbird) held in April 2010. [sent-209, score-0.522]
58 After each session of talks, blocks of 150 frames were distributed as Human Intelligence Tasks 5 Table 1: 1-NN regression performance on the synthetic (SY) dataset and the real (RE) dataset. [sent-211, score-0.371]
59 Errors are the mean pixel distance between the nearest neighbour and the ground truth label of the query. [sent-213, score-0.378]
60 For RE we assume the location and scale of the head is given by a face detector and only locate the hands. [sent-215, score-0.247]
61 We were able to obtain accurate hand and head tracks for each of the speakers within a few hours of their talks. [sent-241, score-0.35]
62 For the following experiments, we divided the 30 speakers into a training set (odd numbered speakers) and test set (even numbered speakers). [sent-242, score-0.257]
63 We do not consider cases in which the hands lie outside the frame or are occluded. [sent-247, score-0.286]
64 This yields 39,792 and 37,671 training and test images, respectively, containing the head and both hands. [sent-248, score-0.241]
65 Since the images are head-centered, the labels, y, used during training are the 4-dimensional vector containing the relative offset of each hand from the head. [sent-249, score-0.161]
66 We emphasize that finding the hands is an extremely difficult task (sometimes even for human subjects). [sent-250, score-0.257]
67 Frames are low-resolution (typically the hands are 10-15 pixels in diameter) and contain camera movement as well as frequently poor lighting. [sent-251, score-0.275]
68 If the codes are made binary (as in [38]) we could use fast approximate hashing techniques to permit real-time tracking using a database of well over 1 million examples. [sent-256, score-0.169]
69 The nonlinear methods show a dramatic improvement over the linear methods, especially our convolutional architectures which learn features from pixels. [sent-257, score-0.414]
70 Though our method is trained only on the relative positions of the hands from the head, it appears to capture something more substantial about body pose in general. [sent-268, score-0.583]
71 We plan on evaluating this result quantitatively, using synthetic data in which we have access to an articulated skeleton. [sent-269, score-0.232]
72 41 px c1 c1 c1 3 2 1 c2 c2 c2 7 5 4 6 1 2 3 4 5 6 7 2 c3 c3 c3 1 3 5 c4 1 2 c4 5 4 4 3 6 Figure 3: Visualization of the 2D C-NCAR embedding of 1024 points from the RE training set. [sent-271, score-0.153]
73 Note that even with a 2D embedding, we are able to capture pose similarity invariant of subject and background. [sent-273, score-0.331]
74 The leftmost column shows the query image, and the remaining columns (left to right) show the nearest neighbour found by: nonlinear C-NCAR regression, linear NCAR, GIST, pixel distance. [sent-291, score-0.461]
75 Circles mark the pose obtained by crowd-sourcing; we superimpose the pose estimated by C-NCAR onto the query with crosses. [sent-292, score-0.53]
76 3 Improving real-world performance with synthetic data There has been recent interest in using synthetic examples to improve performance on real-world vision tasks (e. [sent-294, score-0.416]
77 The subtle differences between real and synthetic data make it difficult to apply existing techniques to a dataset comprised of both types of examples. [sent-297, score-0.237]
78 This problem falls under the domain of transfer learning, but to the best of our knowledge, transfer learning between real and synthetic pairings is relatively unexplored. [sent-298, score-0.283]
79 (b) Adding synthetic data to a fixed dataset of 1024 real examples to improve test performance measured on real data. [sent-311, score-0.335]
80 Error is expressed relative to a training set with no synthetic data. [sent-312, score-0.235]
81 NCAR-1 does not re-initialize weights when more synthetic examples are added. [sent-313, score-0.231]
82 The curves show that adding synthetic examples improve performance up to a point at which the synthetic examples outnumber the real examples 2:1. [sent-315, score-0.56]
83 The pairwise nature of our approach is well-suited to learning such invariance, provided that we have established correspondences between real and synthetic examples. [sent-316, score-0.237]
84 In our case of pose estimation, this comes from the labels. [sent-317, score-0.23]
85 By forcing examples with similar poses (regardless of whether they are real or synthetic) to lie close-by in code space we can implicitly produce a representation at each layer that is invariant to the nature of the input. [sent-318, score-0.304]
86 We have not made an attempt to restrict pairings to be only between real and synthetic examples, though this may further aid in learning invariance. [sent-319, score-0.283]
87 5(b) demonstrates the effect of gradually adding synthetic examples from SY to the RE training dataset. [sent-321, score-0.281]
88 We use a reduced-size set of 1024 real examples for training which is gradually modified to contain synthetic examples and a fixed set of 1024 real examples for testing. [sent-322, score-0.477]
89 Error is expressed relative to the case of no synthetic examples. [sent-323, score-0.185]
90 In NCAR-1 we do not reset the weights of the model to random each time we adjust the training set to add more synthetic examples. [sent-326, score-0.235]
91 We simply add more synthetic data and continue learning. [sent-327, score-0.185]
92 The overall result is the same for each regime: the addition of synthetic examples to the training set improves test performance on real data up to a level at which the number of synthetic examples is double the number of real examples. [sent-329, score-0.616]
93 5 Conclusions We have presented a nonparametric approach for pose estimation in realistic, challenging video datasets. [sent-330, score-0.311]
94 Our work differs from previous attempts at learning invariant mappings in that it is optimized for nearest neighbour regression rather than classification and it scales to realistic sized images through the use of convolution and weightsharing. [sent-332, score-0.453]
95 Recent work has demonstrated that pre-training can successfully be applied to convolutional architectures, both in the context of RBMs [22, 27] and sparse coding [19]. [sent-341, score-0.258]
96 Pictorial structures revisited: People detection and articulated pose estimation. [sent-359, score-0.277]
97 Poselets: Body part detectors trained using 3d human pose annotations. [sent-376, score-0.316]
98 Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. [sent-508, score-0.258]
99 Learning a nonlinear embedding by preserving class neighbourhood structure. [sent-548, score-0.313]
100 HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. [sent-567, score-0.181]
wordName wordTfidf (topN-words)
[('nca', 0.37), ('convolutional', 0.258), ('ncar', 0.253), ('pose', 0.23), ('gist', 0.226), ('hands', 0.204), ('head', 0.191), ('synthetic', 0.185), ('drlim', 0.161), ('neighbour', 0.156), ('body', 0.149), ('speakers', 0.127), ('lcn', 0.115), ('neighbourhood', 0.11), ('embedding', 0.103), ('nonlinear', 0.1), ('cvpr', 0.088), ('sy', 0.087), ('frames', 0.085), ('video', 0.081), ('ij', 0.08), ('images', 0.079), ('pij', 0.075), ('nearest', 0.072), ('pixels', 0.071), ('query', 0.07), ('backgrounds', 0.069), ('dij', 0.069), ('pictorial', 0.069), ('lncar', 0.069), ('siamese', 0.069), ('talks', 0.069), ('tracking', 0.068), ('invariant', 0.066), ('layer', 0.066), ('amazon', 0.066), ('pixel', 0.063), ('abs', 0.06), ('knn', 0.06), ('yi', 0.06), ('mechanical', 0.058), ('image', 0.057), ('labels', 0.057), ('architectures', 0.056), ('yij', 0.056), ('metric', 0.056), ('face', 0.056), ('shakhnarovich', 0.056), ('ld', 0.056), ('human', 0.053), ('invariance', 0.053), ('codes', 0.052), ('real', 0.052), ('neighbours', 0.052), ('convolutions', 0.052), ('turk', 0.052), ('distance', 0.051), ('training', 0.05), ('yj', 0.05), ('architecture', 0.049), ('hashing', 0.049), ('regression', 0.049), ('iccv', 0.047), ('soft', 0.047), ('articulated', 0.047), ('lnca', 0.046), ('pairings', 0.046), ('examples', 0.046), ('dissimilarity', 0.045), ('people', 0.044), ('temporal', 0.044), ('tanh', 0.044), ('network', 0.042), ('lie', 0.042), ('zi', 0.041), ('mapping', 0.041), ('humaneva', 0.04), ('neighbourhoods', 0.04), ('mobahi', 0.04), ('poselets', 0.04), ('hadsell', 0.04), ('numbered', 0.04), ('xj', 0.04), ('ls', 0.04), ('frame', 0.04), ('pinto', 0.037), ('nn', 0.036), ('contrastive', 0.036), ('background', 0.036), ('label', 0.036), ('similarity', 0.035), ('keller', 0.035), ('detectors', 0.033), ('drives', 0.033), ('implicitly', 0.032), ('hand', 0.032), ('triggs', 0.031), ('forsyth', 0.031), ('seed', 0.031), ('sized', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor
Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1
2 0.23658381 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun
Abstract: We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasisparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multistage convolutional network architecture improves performance on a number of visual recognition and detection tasks. 1
3 0.13304517 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
Author: David Weiss, Benjamin Sapp, Ben Taskar
Abstract: For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem. 1
4 0.11982699 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
5 0.11489902 149 nips-2010-Learning To Count Objects in Images
Author: Victor Lempitsky, Andrew Zisserman
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1
6 0.113399 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
7 0.11000943 138 nips-2010-Large Margin Multi-Task Metric Learning
8 0.10953461 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
9 0.1083729 257 nips-2010-Structured Determinantal Point Processes
11 0.10437724 133 nips-2010-Kernel Descriptors for Visual Recognition
12 0.10397749 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
13 0.10386864 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
14 0.10062923 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
15 0.097834341 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
16 0.096357517 103 nips-2010-Generating more realistic images using gated MRF's
17 0.09563946 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
18 0.094823554 94 nips-2010-Feature Set Embedding for Incomplete Data
19 0.091430016 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
20 0.08844611 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
topicId topicWeight
[(0, 0.232), (1, 0.107), (2, -0.137), (3, -0.175), (4, 0.042), (5, -0.033), (6, -0.026), (7, 0.031), (8, -0.038), (9, -0.001), (10, 0.019), (11, -0.076), (12, 0.023), (13, -0.077), (14, -0.013), (15, -0.05), (16, 0.001), (17, -0.063), (18, -0.098), (19, -0.049), (20, 0.006), (21, -0.047), (22, -0.025), (23, -0.032), (24, 0.078), (25, 0.049), (26, -0.023), (27, -0.128), (28, -0.011), (29, -0.02), (30, 0.025), (31, 0.007), (32, -0.069), (33, -0.034), (34, 0.089), (35, 0.006), (36, -0.038), (37, -0.085), (38, 0.05), (39, 0.038), (40, -0.136), (41, 0.011), (42, -0.158), (43, 0.002), (44, 0.066), (45, -0.052), (46, 0.123), (47, 0.01), (48, 0.01), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.94231009 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor
Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1
2 0.70740181 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun
Abstract: We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasisparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multistage convolutional network architecture improves performance on a number of visual recognition and detection tasks. 1
3 0.61742187 271 nips-2010-Tiled convolutional neural networks
Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng
Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1
4 0.61540884 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model
Author: Peggy Series, David P. Reichert, Amos J. Storkey
Abstract: The Charles Bonnet Syndrome (CBS) is characterized by complex vivid visual hallucinations in people with, primarily, eye diseases and no other neurological pathology. We present a Deep Boltzmann Machine model of CBS, exploring two core hypotheses: First, that the visual cortex learns a generative or predictive model of sensory input, thus explaining its capability to generate internal imagery. And second, that homeostatic mechanisms stabilize neuronal activity levels, leading to hallucinations being formed when input is lacking. We reproduce a variety of qualitative findings in CBS. We also introduce a modification to the DBM that allows us to model a possible role of acetylcholine in CBS as mediating the balance of feed-forward and feed-back processing. Our model might provide new insights into CBS and also demonstrates that generative frameworks are promising as hypothetical models of cortical learning and perception. 1
5 0.58038378 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
6 0.57726204 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
7 0.55969197 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
8 0.55335903 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
9 0.55260426 149 nips-2010-Learning To Count Objects in Images
10 0.54392666 267 nips-2010-The Multidimensional Wisdom of Crowds
11 0.53115958 257 nips-2010-Structured Determinantal Point Processes
12 0.52776635 281 nips-2010-Using body-anchored priors for identifying actions in single images
13 0.5201782 99 nips-2010-Gated Softmax Classification
14 0.50697976 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
15 0.50334281 94 nips-2010-Feature Set Embedding for Incomplete Data
16 0.50055999 138 nips-2010-Large Margin Multi-Task Metric Learning
17 0.49856508 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
18 0.47986594 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
19 0.47609386 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
20 0.4754267 17 nips-2010-A biologically plausible network for the computation of orientation dominance
topicId topicWeight
[(13, 0.059), (17, 0.042), (27, 0.06), (30, 0.043), (35, 0.03), (45, 0.215), (50, 0.055), (52, 0.035), (60, 0.04), (77, 0.048), (78, 0.024), (90, 0.045), (99, 0.244)]
simIndex simValue paperId paperTitle
same-paper 1 0.81128061 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor
Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1
2 0.7980125 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data
Author: Nicholas Fisher, Arunava Banerjee
Abstract: From a functional viewpoint, a spiking neuron is a device that transforms input spike trains on its various synapses into an output spike train on its axon. We demonstrate in this paper that the function mapping underlying the device can be tractably learned based on input and output spike train data alone. We begin by posing the problem in a classification based framework. We then derive a novel kernel for an SRM0 model that is based on PSP and AHP like functions. With the kernel we demonstrate how the learning problem can be posed as a Quadratic Program. Experimental results demonstrate the strength of our approach. 1
3 0.7854712 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
4 0.7104308 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
5 0.7098369 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
6 0.70943558 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
7 0.708009 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
8 0.70793462 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
9 0.70792031 117 nips-2010-Identifying graph-structured activation patterns in networks
10 0.70787853 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.7075659 63 nips-2010-Distributed Dual Averaging In Networks
12 0.70713961 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.70526385 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
14 0.70524633 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
15 0.70520192 158 nips-2010-Learning via Gaussian Herding
16 0.70451885 282 nips-2010-Variable margin losses for classifier design
17 0.70354015 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
18 0.70241785 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
19 0.70228595 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
20 0.70206976 148 nips-2010-Learning Networks of Stochastic Differential Equations