nips nips2011 nips2011-154 knowledge-graph by maker-knowledge-mining

154 nips-2011-Learning person-object interactions for action recognition in still images

Source: pdf

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

Abstract: We investigate a discriminatively trained model of person-object interactions for recognizing common human actions in still images. We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. We introduce three principal contributions. First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. Learning of action-speciﬁc body part and object interactions bypasses the difﬁcult problem of estimating the complete human body pose conﬁguration. Beneﬁts of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We build on the locally order-less spatial pyramid bag-of-features model, which was shown to perform extremely well on a range of object, scene and human action recognition tasks. [sent-2, score-0.685]

2 First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors. [sent-4, score-0.853]

3 Second, we introduce new person-object interaction features based on spatial co-occurrences of individual body parts and objects. [sent-5, score-0.892]

4 Third, we address the combinatorial problem of a large number of possible interaction pairs and propose a discriminative selection procedure using a linear support vector machine (SVM) with a sparsity inducing regularizer. [sent-6, score-0.47]

5 Learning of action-speciﬁc body part and object interactions bypasses the difﬁcult problem of estimating the complete human body pose conﬁguration. [sent-7, score-1.621]

6 Beneﬁts of the proposed model are shown on human action recognition in consumer photographs, outperforming the strong bag-of-features baseline. [sent-8, score-0.413]

7 The key difﬁculty stems from the fact that the imaged appearance of a person performing a particular action can vary signiﬁcantly due to many factors such as camera viewpoint, person’s clothing, occlusions, variation of body pose, object appearance and the layout of the scene. [sent-11, score-1.064]

8 As opposed to action recognition in video [6, 27, 31], action recognition in still images has received relatively little attention. [sent-14, score-0.581]

9 A number of previous works [21, 24, 37] focus on exploiting body pose as a cue for action recognition. [sent-15, score-0.767]

10 Reliable estimation of body conﬁgurations for people in arbitrary poses, however, remains a very challenging research problem. [sent-17, score-0.46]

11 In this work, we investigate discriminatively trained models of interactions between objects and human body parts. [sent-21, score-0.986]

12 Rather than relying on accurate estimation of body part conﬁgurations or accurate object detection in the image, we represent human actions as locally orderless distributions over body parts and objects together with their interactions. [sent-23, score-1.602]

13 By opportunistically learning class-speciﬁc object and body part interactions (e. [sent-24, score-0.953]

14 relative conﬁguration of leg and horse detections for the riding horse action, see Figure 1), we avoid the extremely challenging task of estimating the full body conﬁguration. [sent-26, score-1.114]

15 First, we replace the quantized HOG/SIFT features, typically used in bag-of-features models [11, 28, 36] with powerful, discriminatively trained, local object and human body part detectors [7, 25]. [sent-29, score-1.156]

16 Second, we develop a part interaction representation, capturing pair-wise relative position and scale between object/body parts, and include this representation in a scale-space spatial pyramid model. [sent-33, score-0.632]

17 Suitable pair-wise interactions are ﬁrst chosen from a large pool of hundreds of thousands of candidate interactions using a linear support vector machine (SVM) with a sparsity inducing regularizer. [sent-35, score-0.715]

18 The selected interaction features are then input into a ﬁnal, more computationally expensive, non-linear SVM classiﬁer based on the locally orderless spatial pyramid representation. [sent-36, score-0.618]

19 2 Related work Modeling person-object interactions for action recognition has recently attracted signiﬁcant attention. [sent-37, score-0.535]

20 [37], and Yao and Fei Fei [40] develop joint models of body pose conﬁguration and object location within the image. [sent-40, score-0.754]

21 While we build on the recent body pose estimation work by using strong pose-speciﬁc body part models [7, 25], we explicitly avoid inferring the complete body conﬁguration. [sent-42, score-1.445]

22 [13] avoid inferring body conﬁguration by representing a small set of body postures using single HOG templates and represent relative position of the entire person and an object using simple relations (e. [sent-44, score-1.223]

23 They do not explicitly model body parts and their interactions with objects as we do in this work. [sent-47, score-0.858]

24 [38] model the body pose as a latent variable for action recognition. [sent-49, score-0.767]

25 Differently to our method, however, they do not attempt to model interactions between people (their body parts) and objects. [sent-50, score-0.735]

26 [30] also represent people by activation responses of body part detectors (rather than inferring the actual body pose), however, they model only interactions between person and object bounding boxes, not considering individual body parts, as we do in this work. [sent-52, score-2.366]

27 Learning spatial groupings of low-level (SIFT) features for recognizing person-object interactions has been explored by Yao and Fei Fei [39]. [sent-53, score-0.498]

28 While we also learn spatial interactions, we build on powerful body part and object detectors pre-learnt on separate training data, providing a degree of generalization over appearance (e. [sent-54, score-1.114]

29 Differently to [39], we deploy dicriminative selection of interactions using SVM with sparsity inducing regularizer. [sent-57, score-0.312]

30 Spatial-pyramid based bag-of-features models have demonstrated excellent performance on action recognition in still images [1, 11] outperforming body pose based methods [21] or grouplet models [40] on their datasets [11]. [sent-58, score-0.913]

31 Our image representation is deﬁned by the max-pooling of interaction responses over the whole image, solved efﬁciently by the distance transform. [sent-62, score-0.424]

32 The work in [29], however, does not attempt to model people, body parts and their interactions with objects. [sent-64, score-0.773]

33 Related work also includes models of contextual spatial and co-occurrence relationships between objects [12, 32] as well as objects and the scene [22, 23, 35]. [sent-65, score-0.333]

34 Object part detectors trained from labelled data also form a key ingredient of attribute-based object representations [15, 26]. [sent-66, score-0.606]

35 While we build on this body of work, these approaches do not model interactions of people and their body parts with objects and focus on object/scene recognition rather than recognition of human actions. [sent-67, score-1.527]

36 3 Representing person-object interactions This section describes our image representation in terms of body parts, objects and interactions among them. [sent-68, score-1.16]

37 1 Representing body parts and objects We assume to have a set of n available detectors d1 , . [sent-70, score-0.867]

38 , dn which have been pre-trained for different body parts and object classes. [sent-73, score-0.697]

39 We express the positions of detections p in terms of scalespace coordinates p = (x, y, σ) where (x, y) corresponds to the spatial location and σ = log σ is an ˜ additive scale parameter log-related to the image scale factor σ making the addition in the position ˜ vector space meaningful. [sent-75, score-0.429]

40 For objects we use LSVM detector [17] trained on PASCAL VOC images for ten object classes1 . [sent-77, score-0.505]

41 For body parts we implement the method of [25] and train ten body part detectors2 for each of sixteen pose clusters giving 160 body part detectors in total (see [25] for further details). [sent-78, score-1.884]

42 Both of our detectors use Histograms of Oriented Gradients (HOG) [10] as an underlying low-level image representation. [sent-79, score-0.359]

43 1 The ten object detectors correspond to object classes bicycle, car, chair, cow, dining table, horse, motorbike, person, sofa, tv/monitor 2 The ten body part detectors correspond to head, torso, {left, right} × {forearm, upper arm, lower leg, thigh} 3 3. [sent-80, score-1.516]

44 2 Representing pairwise interactions We deﬁne interactions by the pairs of detectors (di , dj ) as well as by the spatial and scale relations among them. [sent-81, score-1.081]

45 Each pair of detectors constitutes a two-node tree where the position and the scale of the leaf are related to the root by scale-space offset and a spatial deformation cost. [sent-82, score-0.595]

46 Figure 1 illustrates an example of an interaction between a horse and the left thigh for the horse riding action. [sent-84, score-0.862]

47 We measure the response of the interaction q located at the root position p1 by: r(I, q, p1 ) = max di (I, p1 ) + dj (I, p2 ) − uT Cu p2 (1) where u = p2 − (p1 + v) is the displacement vector corresponding to the drift of the leaf node with respect to its expected position (p1 + v). [sent-85, score-0.617]

48 For any interaction q we compute its responses for all pairs of node positions p1 , p2 . [sent-87, score-0.428]

49 3 Representing images by response vectors of pair-wise interactions Given a set of M interaction pairs q1 , · · · , qM , we wish to aggregate their responses (1), over an image region A. [sent-90, score-0.845]

50 Here A can be (i) an (extended) person bounding box, as used for selecting discriminative interaction features (Section 4. [sent-91, score-0.575]

51 We deﬁne score s(I, q, A) of an interaction pair q within A of an image I by max-pooling, i. [sent-94, score-0.346]

52 (2) p∈A An image region A is then represented by a M -vector of interaction pair scores z = (s1 , · · · , sM ) with si = s(I, qi , A). [sent-97, score-0.346]

53 4 (3) Learning person-object interactions Given object and body part interaction pairs q introduced in the previous section, we wish to use them for action classiﬁcation in still images. [sent-98, score-1.492]

54 A brute-force approach of analyzing all possible interactions, however, is computationally prohibitive since the space of all possible interactions is combinatorial in the number of detectors and scale-space relations among them. [sent-99, score-0.559]

55 To address this problem, we aim in this paper to select a set of M action-speciﬁc interaction pairs q1 , . [sent-100, score-0.327]

56 , qM , which are both representative and discriminative for a given action class. [sent-103, score-0.318]

57 First, for each action we generate a large pool of candidate interactions, each comprising a pair of (body part / object) detectors and their relative scale-space displacement. [sent-105, score-0.722]

58 This step is data-driven and selects candidate detection pairs which frequently occur for a particular action in a consistent relative scale-space conﬁguration. [sent-106, score-0.414]

59 Next, from this initial pool of candidate interactions we select a set of M discriminative interactions which best separate the particular action class from other classes in our training set. [sent-107, score-1.03]

60 Finally, the discriminative interactions are combined across classes and used as interaction features in our ﬁnal non-linear spatial-pyramid like SVM classiﬁer. [sent-109, score-0.695]

61 1 Generating a candidate pool of interaction pairs To initialize our model, we ﬁrst generate a large pool of candidate interactions in a data-driven manner. [sent-112, score-0.858]

62 Following the suggestion in [17] that the accurate selection of the deformation cost C may not be that important, we set C to a reasonable ﬁxed value for all pairs, and focus on ﬁnding clusters of frequently co-occurring detectors (di , dj ) in speciﬁc relative conﬁgurations. [sent-113, score-0.398]

63 For each detector i and an image I, we ﬁrst collect a set of positions of all positive detector responses PI = {p | di (I, p) > 0}, where di (I, p) is the response of detector i at position p in image I. [sent-114, score-0.764]

64 For each pair of detectors (di , dj ) we then gather relative displacements between their detections from all the training images Ik : Dij = k {pj − pi | pi ∈ PIk and pj ∈ PIk }. [sent-117, score-0.666]

65 2 Discriminative selection of interaction pairs The initialization described above produces a large number of candidate interactions. [sent-123, score-0.396]

66 Selection of M interaction pairs corresponding to non-zero elements of w gives M most discriminative (according to (4)) interaction pairs per action class. [sent-129, score-0.972]

67 3 Using interaction pairs for classiﬁcation Given a set of M discriminative interactions for each action class obtained as described above, we wish to train a ﬁnal non-linear action classiﬁer. [sent-133, score-1.132]

68 We use spatial pyramid-like representation [28], aggregating responses in each cell of the pyramid using max-pooling as described by eq. [sent-134, score-0.34]

69 2) is necessary in this case as applying the non-linear spatial pyramid classiﬁer on the entire pool of all candidate interactions would be computationally infeasible. [sent-141, score-0.635]

70 The Willow-action dataset contains more than 900 images with more than 1100 labelled person detections from 7 human action classes: Interaction with Computer, 5 Photographing, Playing Music, Riding Bike, Riding Horse, Running and Walking. [sent-143, score-0.625]

71 Each training and testing image in both datasets is annotated with the smallest bounding box containing each person and by the performed action(s). [sent-147, score-0.319]

72 In the human action training/test data, we extend each given person bounding box by 50% and resize the image so that the bounding box has a maximum size of 300 pixels. [sent-150, score-0.768]

73 We run the detectors over the transformed bounding boxes and consider the image scales sk = 2k/10 for k ∈ {−10, · · · , 10}. [sent-151, score-0.428]

74 At each scale we extract the detector response every 4 pixels and 8 pixels for the body part and object detectors, respectively. [sent-152, score-0.821]

75 We generate the candidate interaction pairs by taking the mean-shift radius R = (30, 30, log(2)/2), L = 3 and η = 8%. [sent-154, score-0.396]

76 We select M = 310 discriminative interaction pairs to compute the ﬁnal spatial pyramid representation of each image. [sent-156, score-0.704]

77 BOF) is the bag-of-features classiﬁer [11], aggregating quantized responses of densely sampled HOG features in spatial pyramid representation, using a (non-linear) intersection kernel. [sent-160, score-0.374]

78 To obtain a single classiﬁcation score for each person bounding box, we take the maximum LSVM detection score from the detections overlapping the extended bounding box with the standard overlap score [14] higher than 0. [sent-164, score-0.479]

79 Detectors) is a SVM classiﬁer with an RBF kernel trained on max-pooled responses of the entire bank of body part and object detectors in a spatial pyramid representation but without interactions. [sent-167, score-1.407]

80 This baseline is similar in spirit to the object bank representation [29], but here targeted to action classiﬁcation by including a bank of pose-speciﬁc body part detectors as well as object detectors. [sent-168, score-1.512]

81 The largest improvements are obtained on Riding Bike and Horse actions, for which reliable object detectors are available. [sent-171, score-0.483]

82 with respect to using the plain bank of object and body part detectors c. [sent-173, score-1.012]

83 Example detections of interaction pairs are shown in ﬁgure 2. [sent-175, score-0.446]

84 6 Conclusion We have developed person-object interaction features based on non-rigid relative scale-space displacement of pairs of body part and object detectors. [sent-187, score-1.129]

85 Further, we have shown that such features can be learnt in a discriminative fashion and can improve action classiﬁcation performance over a strong bag-of-features baseline in challenging realistic images of common human actions. [sent-188, score-0.531]

86 In addition, the learnt interaction features in some cases correspond to visually meaningful conﬁgurations of body parts, and body parts with objects. [sent-189, score-1.189]

87 Arm Cyan: Head Figure 2: Example detections of discriminative interaction pairs. [sent-202, score-0.466]

88 These body part interaction pairs are chosen as discriminative (high positive weight wi ) for action classes indicated on the left. [sent-203, score-1.158]

89 In each row, the ﬁrst three images show detections on the correct action class. [sent-204, score-0.392]

90 The last image shows a high scoring detection on an incorrect action class. [sent-205, score-0.334]

91 In the examples shown, the interaction features capture either a body part and an object, or two body part interactions. [sent-206, score-1.238]

92 Note that while these interaction pairs are found to be discriminative, due to the detection noise, they do not necessary localize the correct body parts in all images. [sent-207, score-0.872]

93 However, they may still ﬁre at consistent locations across many images as illustrated in the second row, where the head detector consistently detects the camera lens, and the thigh detector ﬁres consistently at the edge of the head. [sent-208, score-0.475]

94 Similarly, the leg detector seems to consistently ﬁre on keyboards (see the third image in the ﬁrst row for an example), thus improving the conﬁdence of the computer detections for the ”Interacting with computer” action. [sent-209, score-0.374]

95 We use only a small set of object detectors available at [2], however, we are now in a position to include many more additional object (camera, computer, laptop) or texture (grass, road, trees) detectors, trained from additional datasets, such as ImageNet or LabelMe. [sent-294, score-0.789]

96 Currently, we consider detections of entire objects, but the proposed model can be easily extended to represent interactions between body parts and parts of objects [8]. [sent-295, score-1.064]

97 Poselets: Body part detectors trained using 3D human pose annotations. [sent-332, score-0.664]

98 Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. [sent-472, score-0.351]

99 Grouplet: A structured image representation for recognizing human and object interactions. [sent-543, score-0.496]

100 Modeling mutual context of object and human pose in human-object interaction activities. [sent-548, score-0.697]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('body', 0.411), ('detectors', 0.284), ('interactions', 0.275), ('interaction', 0.241), ('action', 0.212), ('object', 0.199), ('riding', 0.197), ('horse', 0.156), ('pose', 0.144), ('person', 0.12), ('detections', 0.119), ('pyramid', 0.118), ('spatial', 0.114), ('human', 0.113), ('thigh', 0.112), ('cyan', 0.107), ('discriminative', 0.106), ('lsvm', 0.105), ('detector', 0.105), ('bike', 0.093), ('bof', 0.093), ('cvpr', 0.091), ('parts', 0.087), ('pairs', 0.086), ('objects', 0.085), ('displacement', 0.085), ('voc', 0.078), ('leg', 0.075), ('actions', 0.075), ('forearm', 0.075), ('orderless', 0.075), ('image', 0.075), ('pascal', 0.071), ('recognizing', 0.07), ('bounding', 0.069), ('responses', 0.069), ('candidate', 0.069), ('part', 0.068), ('deformation', 0.067), ('svm', 0.061), ('images', 0.061), ('fei', 0.06), ('pool', 0.059), ('photographing', 0.056), ('walking', 0.056), ('trained', 0.055), ('box', 0.055), ('di', 0.054), ('classi', 0.053), ('position', 0.052), ('bank', 0.05), ('people', 0.049), ('scene', 0.049), ('leaf', 0.048), ('recognition', 0.048), ('discriminatively', 0.047), ('dj', 0.047), ('detection', 0.047), ('head', 0.046), ('camera', 0.046), ('desai', 0.045), ('motorbike', 0.045), ('clothing', 0.045), ('laptev', 0.045), ('bourdev', 0.045), ('maji', 0.045), ('poselet', 0.045), ('yao', 0.045), ('pj', 0.045), ('iccv', 0.043), ('hog', 0.042), ('gurations', 0.041), ('pi', 0.04), ('consumer', 0.04), ('photographs', 0.04), ('features', 0.039), ('representation', 0.039), ('response', 0.038), ('appearance', 0.038), ('delaitre', 0.037), ('dining', 0.037), ('grouplet', 0.037), ('phoning', 0.037), ('rocquencourt', 0.037), ('scalespace', 0.037), ('inducing', 0.037), ('playing', 0.037), ('blue', 0.036), ('er', 0.036), ('quantized', 0.034), ('con', 0.034), ('classes', 0.034), ('pik', 0.033), ('positions', 0.032), ('poses', 0.031), ('locally', 0.031), ('poselets', 0.03), ('pair', 0.03), ('representing', 0.03), ('paris', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

2 0.28806949 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classiﬁers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by deﬁning a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

3 0.18409877 193 nips-2011-Object Detection with Grammar Models

Author: Ross B. Girshick, Pedro F. Felzenszwalb, David A. McAllester

Abstract: Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difﬁcult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data. 1

4 0.18068655 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then ﬁltering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

5 0.17686203 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov

Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset. 1

6 0.14539689 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

7 0.13094431 180 nips-2011-Multiple Instance Filtering

8 0.12955272 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

9 0.12713814 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

10 0.12188654 168 nips-2011-Maximum Margin Multi-Instance Learning

11 0.1205872 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

12 0.10794964 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

13 0.10780519 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

14 0.10748513 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

15 0.10595272 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

16 0.10148174 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

17 0.095516749 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

18 0.087728314 276 nips-2011-Structured sparse coding via lateral inhibition

19 0.087496631 165 nips-2011-Matrix Completion for Multi-label Image Classification

20 0.086501114 231 nips-2011-Randomized Algorithms for Comparison-based Search

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.216), (1, 0.126), (2, -0.134), (3, 0.284), (4, 0.12), (5, 0.086), (6, 0.041), (7, -0.015), (8, 0.056), (9, 0.141), (10, 0.028), (11, -0.034), (12, -0.075), (13, 0.123), (14, 0.011), (15, -0.016), (16, 0.054), (17, 0.022), (18, -0.048), (19, 0.041), (20, -0.005), (21, 0.051), (22, -0.088), (23, 0.104), (24, 0.142), (25, -0.028), (26, 0.003), (27, 0.079), (28, -0.024), (29, 0.029), (30, -0.027), (31, 0.086), (32, 0.008), (33, -0.005), (34, -0.023), (35, -0.005), (36, 0.035), (37, -0.078), (38, -0.106), (39, -0.02), (40, 0.015), (41, -0.053), (42, 0.038), (43, -0.055), (44, 0.034), (45, 0.007), (46, -0.028), (47, -0.045), (48, -0.01), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9782958 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

2 0.81914735 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

3 0.80029815 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov

4 0.78551286 193 nips-2011-Object Detection with Grammar Models

Author: Ross B. Girshick, Pedro F. Felzenszwalb, David A. McAllester

5 0.7746436 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

Author: Andreas Geiger, Christian Wojek, Raquel Urtasun

Abstract: We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as trafﬁc activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to signiﬁcantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation. 1

6 0.73666102 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

7 0.72984105 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

8 0.68914622 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

9 0.68820083 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

10 0.63156152 293 nips-2011-Understanding the Intrinsic Memorability of Images

11 0.62807941 127 nips-2011-Image Parsing with Stochastic Scene Grammar

12 0.58457887 180 nips-2011-Multiple Instance Filtering

13 0.58118564 35 nips-2011-An ideal observer model for identifying the reference frame of objects

14 0.57028198 275 nips-2011-Structured Learning for Cell Tracking

15 0.56619883 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

16 0.55480331 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

17 0.52184886 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

18 0.52160621 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

19 0.47190711 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

20 0.44059607 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.012), (4, 0.112), (20, 0.098), (26, 0.02), (31, 0.061), (33, 0.077), (43, 0.082), (45, 0.103), (57, 0.041), (59, 0.184), (65, 0.012), (74, 0.037), (83, 0.018), (84, 0.013), (99, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84746206 154 nips-2011-Learning person-object interactions for action recognition in still images

Author: Vincent Delaitre, Josef Sivic, Ivan Laptev

2 0.747935 25 nips-2011-Adaptive Hedge

Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald

Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is signiﬁcantly better than all others. We propose a new way of setting the learning rate, which adapts to the difﬁculty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1

3 0.73091304 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

Author: Daniel J. Lizotte

Abstract: Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, “Expansion-Constrained Ordinary Least Squares” (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the ∞-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of “averagers,” we prove a minimax property of the ECOLS residual error, and we give an efﬁcient algorithm for computing the coefﬁcients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties. 1

4 0.72510916 127 nips-2011-Image Parsing with Stochastic Scene Grammar

Author: Yibiao Zhao, Song-chun Zhu

Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classiﬁers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efﬁcient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene conﬁgurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to ﬁnd the most probable conﬁguration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1

5 0.71373057 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

Author: Joseph Keshet, David A. McAllester

Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (ﬁnite or inﬁnite dimensional) they yield predictors approaching the inﬁmum task loss achievable by any linear predictor over the given features. We also give ﬁnite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1

6 0.70195699 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

7 0.69910151 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

8 0.69900668 193 nips-2011-Object Detection with Grammar Models

9 0.69650131 227 nips-2011-Pylon Model for Semantic Segmentation

10 0.69174075 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

11 0.68977797 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

12 0.68895084 168 nips-2011-Maximum Margin Multi-Instance Learning

13 0.68659133 139 nips-2011-Kernel Bayes' Rule

14 0.6840874 213 nips-2011-Phase transition in the family of p-resistances

15 0.67935872 303 nips-2011-Video Annotation and Tracking with Active Learning

16 0.67382669 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

17 0.67315227 180 nips-2011-Multiple Instance Filtering

18 0.67222714 231 nips-2011-Randomized Algorithms for Comparison-based Search

19 0.66387987 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

20 0.66229719 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm