nips nips2011 nips2011-154 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vincent Delaitre, Josef Sivic, Ivan Laptev
Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. [sent-2, score-0.685]
2 First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. [sent-4, score-0.853]
3 Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. [sent-5, score-0.892]
4 Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. [sent-6, score-0.47]
5 Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. [sent-7, score-1.621]
6 Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. [sent-8, score-0.413]
7 The key difficulty stems from the fact that the imaged appearance of a person performing a particular action can vary significantly due to many factors such as camera viewpoint, person’s clothing, occlusions, variation of body pose, object appearance and the layout of the scene. [sent-11, score-1.064]
8 As opposed to action recognition in video [6, 27, 31], action recognition in still images has received relatively little attention. [sent-14, score-0.581]
9 A number of previous works [21, 24, 37] focus on exploiting body pose as a cue for action recognition. [sent-15, score-0.767]
10 Reliable estimation of body configurations for people in arbitrary poses, however, remains a very challenging research problem. [sent-17, score-0.46]
11 In this work, we investigate discriminatively trained models of interactions between objects and human body parts. [sent-21, score-0.986]
12 Rather than relying on accurate estimation of body part configurations or accurate object detection in the image, we represent human actions as locally orderless distributions over body parts and objects together with their interactions. [sent-23, score-1.602]
13 By opportunistically learning class-specific object and body part interactions (e. [sent-24, score-0.953]
14 relative configuration of leg and horse detections for the riding horse action, see Figure 1), we avoid the extremely challenging task of estimating the full body configuration. [sent-26, score-1.114]
15 First, we replace the quantized HOG/SIFT features, typically used in bag-of-features models [11, 28, 36] with powerful, discriminatively trained, local object and human body part detectors [7, 25]. [sent-29, score-1.156]
16 Second, we develop a part interaction representation, capturing pair-wise relative position and scale between object/body parts, and include this representation in a scale-space spatial pyramid model. [sent-33, score-0.632]
17 Suitable pair-wise interactions are first chosen from a large pool of hundreds of thousands of candidate interactions using a linear support vector machine (SVM) with a sparsity inducing regularizer. [sent-35, score-0.715]
18 The selected interaction features are then input into a final, more computationally expensive, non-linear SVM classifier based on the locally orderless spatial pyramid representation. [sent-36, score-0.618]
19 2 Related work Modeling person-object interactions for action recognition has recently attracted significant attention. [sent-37, score-0.535]
20 [37], and Yao and Fei Fei [40] develop joint models of body pose configuration and object location within the image. [sent-40, score-0.754]
21 While we build on the recent body pose estimation work by using strong pose-specific body part models [7, 25], we explicitly avoid inferring the complete body configuration. [sent-42, score-1.445]
22 [13] avoid inferring body configuration by representing a small set of body postures using single HOG templates and represent relative position of the entire person and an object using simple relations (e. [sent-44, score-1.223]
23 They do not explicitly model body parts and their interactions with objects as we do in this work. [sent-47, score-0.858]
24 [38] model the body pose as a latent variable for action recognition. [sent-49, score-0.767]
25 Differently to our method, however, they do not attempt to model interactions between people (their body parts) and objects. [sent-50, score-0.735]
26 [30] also represent people by activation responses of body part detectors (rather than inferring the actual body pose), however, they model only interactions between person and object bounding boxes, not considering individual body parts, as we do in this work. [sent-52, score-2.366]
27 Learning spatial groupings of low-level (SIFT) features for recognizing person-object interactions has been explored by Yao and Fei Fei [39]. [sent-53, score-0.498]
28 While we also learn spatial interactions, we build on powerful body part and object detectors pre-learnt on separate training data, providing a degree of generalization over appearance (e. [sent-54, score-1.114]
29 Differently to [39], we deploy dicriminative selection of interactions using SVM with sparsity inducing regularizer. [sent-57, score-0.312]
30 Spatial-pyramid based bag-of-features models have demonstrated excellent performance on action recognition in still images [1, 11] outperforming body pose based methods [21] or grouplet models [40] on their datasets [11]. [sent-58, score-0.913]
31 Our image representation is defined by the max-pooling of interaction responses over the whole image, solved efficiently by the distance transform. [sent-62, score-0.424]
32 The work in [29], however, does not attempt to model people, body parts and their interactions with objects. [sent-64, score-0.773]
33 Related work also includes models of contextual spatial and co-occurrence relationships between objects [12, 32] as well as objects and the scene [22, 23, 35]. [sent-65, score-0.333]
34 Object part detectors trained from labelled data also form a key ingredient of attribute-based object representations [15, 26]. [sent-66, score-0.606]
35 While we build on this body of work, these approaches do not model interactions of people and their body parts with objects and focus on object/scene recognition rather than recognition of human actions. [sent-67, score-1.527]
36 3 Representing person-object interactions This section describes our image representation in terms of body parts, objects and interactions among them. [sent-68, score-1.16]
37 1 Representing body parts and objects We assume to have a set of n available detectors d1 , . [sent-70, score-0.867]
38 , dn which have been pre-trained for different body parts and object classes. [sent-73, score-0.697]
39 We express the positions of detections p in terms of scalespace coordinates p = (x, y, σ) where (x, y) corresponds to the spatial location and σ = log σ is an ˜ additive scale parameter log-related to the image scale factor σ making the addition in the position ˜ vector space meaningful. [sent-75, score-0.429]
40 For objects we use LSVM detector [17] trained on PASCAL VOC images for ten object classes1 . [sent-77, score-0.505]
41 For body parts we implement the method of [25] and train ten body part detectors2 for each of sixteen pose clusters giving 160 body part detectors in total (see [25] for further details). [sent-78, score-1.884]
42 Both of our detectors use Histograms of Oriented Gradients (HOG) [10] as an underlying low-level image representation. [sent-79, score-0.359]
43 1 The ten object detectors correspond to object classes bicycle, car, chair, cow, dining table, horse, motorbike, person, sofa, tv/monitor 2 The ten body part detectors correspond to head, torso, {left, right} × {forearm, upper arm, lower leg, thigh} 3 3. [sent-80, score-1.516]
44 2 Representing pairwise interactions We define interactions by the pairs of detectors (di , dj ) as well as by the spatial and scale relations among them. [sent-81, score-1.081]
45 Each pair of detectors constitutes a two-node tree where the position and the scale of the leaf are related to the root by scale-space offset and a spatial deformation cost. [sent-82, score-0.595]
46 Figure 1 illustrates an example of an interaction between a horse and the left thigh for the horse riding action. [sent-84, score-0.862]
47 We measure the response of the interaction q located at the root position p1 by: r(I, q, p1 ) = max di (I, p1 ) + dj (I, p2 ) − uT Cu p2 (1) where u = p2 − (p1 + v) is the displacement vector corresponding to the drift of the leaf node with respect to its expected position (p1 + v). [sent-85, score-0.617]
48 For any interaction q we compute its responses for all pairs of node positions p1 , p2 . [sent-87, score-0.428]
49 3 Representing images by response vectors of pair-wise interactions Given a set of M interaction pairs q1 , · · · , qM , we wish to aggregate their responses (1), over an image region A. [sent-90, score-0.845]
50 Here A can be (i) an (extended) person bounding box, as used for selecting discriminative interaction features (Section 4. [sent-91, score-0.575]
51 We define score s(I, q, A) of an interaction pair q within A of an image I by max-pooling, i. [sent-94, score-0.346]
52 (2) p∈A An image region A is then represented by a M -vector of interaction pair scores z = (s1 , · · · , sM ) with si = s(I, qi , A). [sent-97, score-0.346]
53 4 (3) Learning person-object interactions Given object and body part interaction pairs q introduced in the previous section, we wish to use them for action classification in still images. [sent-98, score-1.492]
54 A brute-force approach of analyzing all possible interactions, however, is computationally prohibitive since the space of all possible interactions is combinatorial in the number of detectors and scale-space relations among them. [sent-99, score-0.559]
55 To address this problem, we aim in this paper to select a set of M action-specific interaction pairs q1 , . [sent-100, score-0.327]
56 , qM , which are both representative and discriminative for a given action class. [sent-103, score-0.318]
57 First, for each action we generate a large pool of candidate interactions, each comprising a pair of (body part / object) detectors and their relative scale-space displacement. [sent-105, score-0.722]
58 This step is data-driven and selects candidate detection pairs which frequently occur for a particular action in a consistent relative scale-space configuration. [sent-106, score-0.414]
59 Next, from this initial pool of candidate interactions we select a set of M discriminative interactions which best separate the particular action class from other classes in our training set. [sent-107, score-1.03]
60 Finally, the discriminative interactions are combined across classes and used as interaction features in our final non-linear spatial-pyramid like SVM classifier. [sent-109, score-0.695]
61 1 Generating a candidate pool of interaction pairs To initialize our model, we first generate a large pool of candidate interactions in a data-driven manner. [sent-112, score-0.858]
62 Following the suggestion in [17] that the accurate selection of the deformation cost C may not be that important, we set C to a reasonable fixed value for all pairs, and focus on finding clusters of frequently co-occurring detectors (di , dj ) in specific relative configurations. [sent-113, score-0.398]
63 For each detector i and an image I, we first collect a set of positions of all positive detector responses PI = {p | di (I, p) > 0}, where di (I, p) is the response of detector i at position p in image I. [sent-114, score-0.764]
64 For each pair of detectors (di , dj ) we then gather relative displacements between their detections from all the training images Ik : Dij = k {pj − pi | pi ∈ PIk and pj ∈ PIk }. [sent-117, score-0.666]
65 2 Discriminative selection of interaction pairs The initialization described above produces a large number of candidate interactions. [sent-123, score-0.396]
66 Selection of M interaction pairs corresponding to non-zero elements of w gives M most discriminative (according to (4)) interaction pairs per action class. [sent-129, score-0.972]
67 3 Using interaction pairs for classification Given a set of M discriminative interactions for each action class obtained as described above, we wish to train a final non-linear action classifier. [sent-133, score-1.132]
68 We use spatial pyramid-like representation [28], aggregating responses in each cell of the pyramid using max-pooling as described by eq. [sent-134, score-0.34]
69 2) is necessary in this case as applying the non-linear spatial pyramid classifier on the entire pool of all candidate interactions would be computationally infeasible. [sent-141, score-0.635]
70 The Willow-action dataset contains more than 900 images with more than 1100 labelled person detections from 7 human action classes: Interaction with Computer, 5 Photographing, Playing Music, Riding Bike, Riding Horse, Running and Walking. [sent-143, score-0.625]
71 Each training and testing image in both datasets is annotated with the smallest bounding box containing each person and by the performed action(s). [sent-147, score-0.319]
72 In the human action training/test data, we extend each given person bounding box by 50% and resize the image so that the bounding box has a maximum size of 300 pixels. [sent-150, score-0.768]
73 We run the detectors over the transformed bounding boxes and consider the image scales sk = 2k/10 for k ∈ {−10, · · · , 10}. [sent-151, score-0.428]
74 At each scale we extract the detector response every 4 pixels and 8 pixels for the body part and object detectors, respectively. [sent-152, score-0.821]
75 We generate the candidate interaction pairs by taking the mean-shift radius R = (30, 30, log(2)/2), L = 3 and η = 8%. [sent-154, score-0.396]
76 We select M = 310 discriminative interaction pairs to compute the final spatial pyramid representation of each image. [sent-156, score-0.704]
77 BOF) is the bag-of-features classifier [11], aggregating quantized responses of densely sampled HOG features in spatial pyramid representation, using a (non-linear) intersection kernel. [sent-160, score-0.374]
78 To obtain a single classification score for each person bounding box, we take the maximum LSVM detection score from the detections overlapping the extended bounding box with the standard overlap score [14] higher than 0. [sent-164, score-0.479]
79 Detectors) is a SVM classifier with an RBF kernel trained on max-pooled responses of the entire bank of body part and object detectors in a spatial pyramid representation but without interactions. [sent-167, score-1.407]
80 This baseline is similar in spirit to the object bank representation [29], but here targeted to action classification by including a bank of pose-specific body part detectors as well as object detectors. [sent-168, score-1.512]
81 The largest improvements are obtained on Riding Bike and Horse actions, for which reliable object detectors are available. [sent-171, score-0.483]
82 with respect to using the plain bank of object and body part detectors c. [sent-173, score-1.012]
83 Example detections of interaction pairs are shown in figure 2. [sent-175, score-0.446]
84 6 Conclusion We have developed person-object interaction features based on non-rigid relative scale-space displacement of pairs of body part and object detectors. [sent-187, score-1.129]
85 Further, we have shown that such features can be learnt in a discriminative fashion and can improve action classification performance over a strong bag-of-features baseline in challenging realistic images of common human actions. [sent-188, score-0.531]
86 In addition, the learnt interaction features in some cases correspond to visually meaningful configurations of body parts, and body parts with objects. [sent-189, score-1.189]
87 Arm Cyan: Head Figure 2: Example detections of discriminative interaction pairs. [sent-202, score-0.466]
88 These body part interaction pairs are chosen as discriminative (high positive weight wi ) for action classes indicated on the left. [sent-203, score-1.158]
89 In each row, the first three images show detections on the correct action class. [sent-204, score-0.392]
90 The last image shows a high scoring detection on an incorrect action class. [sent-205, score-0.334]
91 In the examples shown, the interaction features capture either a body part and an object, or two body part interactions. [sent-206, score-1.238]
92 Note that while these interaction pairs are found to be discriminative, due to the detection noise, they do not necessary localize the correct body parts in all images. [sent-207, score-0.872]
93 However, they may still fire at consistent locations across many images as illustrated in the second row, where the head detector consistently detects the camera lens, and the thigh detector fires consistently at the edge of the head. [sent-208, score-0.475]
94 Similarly, the leg detector seems to consistently fire on keyboards (see the third image in the first row for an example), thus improving the confidence of the computer detections for the ”Interacting with computer” action. [sent-209, score-0.374]
95 We use only a small set of object detectors available at [2], however, we are now in a position to include many more additional object (camera, computer, laptop) or texture (grass, road, trees) detectors, trained from additional datasets, such as ImageNet or LabelMe. [sent-294, score-0.789]
96 Currently, we consider detections of entire objects, but the proposed model can be easily extended to represent interactions between body parts and parts of objects [8]. [sent-295, score-1.064]
97 Poselets: Body part detectors trained using 3D human pose annotations. [sent-332, score-0.664]
98 Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. [sent-472, score-0.351]
99 Grouplet: A structured image representation for recognizing human and object interactions. [sent-543, score-0.496]
100 Modeling mutual context of object and human pose in human-object interaction activities. [sent-548, score-0.697]
wordName wordTfidf (topN-words)
[('body', 0.411), ('detectors', 0.284), ('interactions', 0.275), ('interaction', 0.241), ('action', 0.212), ('object', 0.199), ('riding', 0.197), ('horse', 0.156), ('pose', 0.144), ('person', 0.12), ('detections', 0.119), ('pyramid', 0.118), ('spatial', 0.114), ('human', 0.113), ('thigh', 0.112), ('cyan', 0.107), ('discriminative', 0.106), ('lsvm', 0.105), ('detector', 0.105), ('bike', 0.093), ('bof', 0.093), ('cvpr', 0.091), ('parts', 0.087), ('pairs', 0.086), ('objects', 0.085), ('displacement', 0.085), ('voc', 0.078), ('leg', 0.075), ('actions', 0.075), ('forearm', 0.075), ('orderless', 0.075), ('image', 0.075), ('pascal', 0.071), ('recognizing', 0.07), ('bounding', 0.069), ('responses', 0.069), ('candidate', 0.069), ('part', 0.068), ('deformation', 0.067), ('svm', 0.061), ('images', 0.061), ('fei', 0.06), ('pool', 0.059), ('photographing', 0.056), ('walking', 0.056), ('trained', 0.055), ('box', 0.055), ('di', 0.054), ('classi', 0.053), ('position', 0.052), ('bank', 0.05), ('people', 0.049), ('scene', 0.049), ('leaf', 0.048), ('recognition', 0.048), ('discriminatively', 0.047), ('dj', 0.047), ('detection', 0.047), ('head', 0.046), ('camera', 0.046), ('desai', 0.045), ('motorbike', 0.045), ('clothing', 0.045), ('laptev', 0.045), ('bourdev', 0.045), ('maji', 0.045), ('poselet', 0.045), ('yao', 0.045), ('pj', 0.045), ('iccv', 0.043), ('hog', 0.042), ('gurations', 0.041), ('pi', 0.04), ('consumer', 0.04), ('photographs', 0.04), ('features', 0.039), ('representation', 0.039), ('response', 0.038), ('appearance', 0.038), ('delaitre', 0.037), ('dining', 0.037), ('grouplet', 0.037), ('phoning', 0.037), ('rocquencourt', 0.037), ('scalespace', 0.037), ('inducing', 0.037), ('playing', 0.037), ('blue', 0.036), ('er', 0.036), ('quantized', 0.034), ('con', 0.034), ('classes', 0.034), ('pik', 0.033), ('positions', 0.032), ('poses', 0.031), ('locally', 0.031), ('poselets', 0.03), ('pair', 0.03), ('representing', 0.03), ('paris', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 154 nips-2011-Learning person-object interactions for action recognition in still images
Author: Vincent Delaitre, Josef Sivic, Ivan Laptev
Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1
2 0.28806949 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen
Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1
3 0.18409877 193 nips-2011-Object Detection with Grammar Models
Author: Ross B. Girshick, Pedro F. Felzenszwalb, David A. McAllester
Abstract: Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data. 1
4 0.18068655 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
5 0.17686203 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov
Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset. 1
6 0.14539689 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
7 0.13094431 180 nips-2011-Multiple Instance Filtering
8 0.12955272 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout
9 0.12713814 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
10 0.12188654 168 nips-2011-Maximum Margin Multi-Instance Learning
11 0.1205872 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
12 0.10794964 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
13 0.10780519 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
14 0.10748513 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
15 0.10595272 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
16 0.10148174 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
17 0.095516749 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
18 0.087728314 276 nips-2011-Structured sparse coding via lateral inhibition
19 0.087496631 165 nips-2011-Matrix Completion for Multi-label Image Classification
20 0.086501114 231 nips-2011-Randomized Algorithms for Comparison-based Search
topicId topicWeight
[(0, 0.216), (1, 0.126), (2, -0.134), (3, 0.284), (4, 0.12), (5, 0.086), (6, 0.041), (7, -0.015), (8, 0.056), (9, 0.141), (10, 0.028), (11, -0.034), (12, -0.075), (13, 0.123), (14, 0.011), (15, -0.016), (16, 0.054), (17, 0.022), (18, -0.048), (19, 0.041), (20, -0.005), (21, 0.051), (22, -0.088), (23, 0.104), (24, 0.142), (25, -0.028), (26, 0.003), (27, 0.079), (28, -0.024), (29, 0.029), (30, -0.027), (31, 0.086), (32, 0.008), (33, -0.005), (34, -0.023), (35, -0.005), (36, 0.035), (37, -0.078), (38, -0.106), (39, -0.02), (40, 0.015), (41, -0.053), (42, 0.038), (43, -0.055), (44, 0.034), (45, 0.007), (46, -0.028), (47, -0.045), (48, -0.01), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9782958 154 nips-2011-Learning person-object interactions for action recognition in still images
Author: Vincent Delaitre, Josef Sivic, Ivan Laptev
Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1
2 0.81914735 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen
Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1
3 0.80029815 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov
Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset. 1
4 0.78551286 193 nips-2011-Object Detection with Grammar Models
Author: Ross B. Girshick, Pedro F. Felzenszwalb, David A. McAllester
Abstract: Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data. 1
5 0.7746436 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout
Author: Andreas Geiger, Christian Wojek, Raquel Urtasun
Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1
6 0.73666102 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
7 0.72984105 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
8 0.68914622 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
9 0.68820083 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
10 0.63156152 293 nips-2011-Understanding the Intrinsic Memorability of Images
11 0.62807941 127 nips-2011-Image Parsing with Stochastic Scene Grammar
12 0.58457887 180 nips-2011-Multiple Instance Filtering
13 0.58118564 35 nips-2011-An ideal observer model for identifying the reference frame of objects
14 0.57028198 275 nips-2011-Structured Learning for Cell Tracking
15 0.56619883 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
16 0.55480331 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
17 0.52184886 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
18 0.52160621 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
19 0.47190711 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
20 0.44059607 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
topicId topicWeight
[(0, 0.012), (4, 0.112), (20, 0.098), (26, 0.02), (31, 0.061), (33, 0.077), (43, 0.082), (45, 0.103), (57, 0.041), (59, 0.184), (65, 0.012), (74, 0.037), (83, 0.018), (84, 0.013), (99, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.84746206 154 nips-2011-Learning person-object interactions for action recognition in still images
Author: Vincent Delaitre, Josef Sivic, Ivan Laptev
Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-specific body part and object interactions bypasses the difficult problem of estimating the complete human body pose configuration. Benefits of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1
2 0.747935 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
3 0.73091304 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
Author: Daniel J. Lizotte
Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efficient algorithm for computing the coefficients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1
4 0.72510916 127 nips-2011-Image Parsing with Stochastic Scene Grammar
Author: Yibiao Zhao, Song-chun Zhu
Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1
5 0.71373057 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
Author: Joseph Keshet, David A. McAllester
Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
6 0.70195699 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
7 0.69910151 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
8 0.69900668 193 nips-2011-Object Detection with Grammar Models
9 0.69650131 227 nips-2011-Pylon Model for Semantic Segmentation
10 0.69174075 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
11 0.68977797 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
12 0.68895084 168 nips-2011-Maximum Margin Multi-Instance Learning
13 0.68659133 139 nips-2011-Kernel Bayes' Rule
14 0.6840874 213 nips-2011-Phase transition in the family of p-resistances
15 0.67935872 303 nips-2011-Video Annotation and Tracking with Active Learning
16 0.67382669 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
17 0.67315227 180 nips-2011-Multiple Instance Filtering
18 0.67222714 231 nips-2011-Randomized Algorithms for Comparison-based Search
19 0.66387987 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
20 0.66229719 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm