iccv iccv2013 iccv2013-118 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bangpeng Yao, Jiayuan Ma, Li Fei-Fei
Abstract: Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. In this paper, we propose a weakly supervised approach to discover all possible object functionalities. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach.
Reference: text
sentIndex sentText sentNum sentScore
1 edu fe l } Abstract Object functionality refers to the quality of an object that allows humans to perform some specific actions. [sent-3, score-0.762]
2 It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. [sent-4, score-0.562]
3 In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. [sent-5, score-1.119]
4 Each object functionality is represented by a specific type of human-object interaction. [sent-7, score-0.536]
5 Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. [sent-8, score-0.353]
6 Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach. [sent-9, score-0.633]
7 One view asserts that humans perceive objects by their physical qualities, such as color, shape, size, rigidity, etc. [sent-13, score-0.296]
8 Another idea was proposed by Gibson [15], who suggested that humans perceive objects by looking at their affordances. [sent-14, score-0.296]
9 According to Gibson and his colleagues [14, 2], affordance refers to the quality of an object or an environment which allows humans to perform some specific actions. [sent-15, score-0.521]
10 On the 1There are subtle differences between affordance and functionality in psychology. [sent-20, score-0.656]
11 oalIfns- one hand, observing the functionality of an object (e. [sent-25, score-0.556]
12 how humans interact with it) provides a strong cue for us to recognize the category of the object. [sent-27, score-0.319]
13 On the other hand, inferring object functionality itself is an interesting and useful task. [sent-28, score-0.562]
14 For example, one of the end goals in robotic vision is not to simply tell a robot “this is a violin”, but to teach the robot how to make use of the functionality of the violin - how to play it. [sent-29, score-0.694]
15 Further, learning object functionality also potentially facilitates other tasks in computer vision (e. [sent-30, score-0.561]
16 In this work, our goal is to discover object functionality from weakly labeled images. [sent-35, score-0.667]
17 Given an object, there might be many ways for a human to interact with it, as shown in Fig. [sent-36, score-0.283]
18 As we will show in our experiments, these interactions provide us with some knowledge about the object Figure2. [sent-38, score-0.263]
19 Some interactions correspond to the typical functionality of the object while others do not. [sent-40, score-0.722]
20 Furthermore, while inferring these types of interactions, our method also builds a model tailored to object detection and pose estimation for each specific interaction. [sent-42, score-0.372]
21 Using violin as an example, given a set of images of humanviolin interactions, we discover different types of humanviolin interactions by first estimating human poses and detecting objects, and then clustering the images based on their pairwise distances in terms of human-object interactions. [sent-44, score-0.852]
22 The clustering result can then be used to update the model of human pose estimation and object detection, and hence human-violin interaction. [sent-45, score-0.43]
23 We only need a • • • general human pose estimator and a weak object detector trained from a small subset of training images, which will be updated by our iterative model. [sent-50, score-0.395]
24 Unconstrained human poses: Rather than being limiUtendc oton a rsmainaleld dshetu omf pre-defined poses shuanchb as sitting and reaching [16, 13], our method does not have any constraint on human poses. [sent-51, score-0.332]
25 3) because of different camera angles from which the images are taken, we convert 2D human poses to 3D and then measure the similarity between different images. [sent-54, score-0.278]
26 Aiming for details: The functionality we learn refers itom tinheg fdoetra dilest oaifl human-object ninatleirtyac wtioen lse,a e. [sent-56, score-0.502]
27 This makes our work different from most previous functionality work which mainly focuses on object detection [20, 25]. [sent-59, score-0.581]
28 3 elaborates on our approach of weakly supervised functionality discovery. [sent-63, score-0.559]
29 Recently, functionality has been used to detect objects [17, 20], where human gestures are recognized and treated as a cue to identify objects. [sent-70, score-0.663]
30 In [16], 3D information is deployed such that one can recognize object affordance even when humans are not observed in test images. [sent-71, score-0.478]
31 Such approaches assume that an object has the same functionality across all images, while our method attempts to infer object functionality given that humans might interact with the same object in many ways. [sent-72, score-1.478]
32 Specifically, because of the advances in human detection [3, 10] and human pose estimation [1, 30], humans are frequently used as cues for other tasks, such as object detection (details below) and scene reconstruction [19, 4, 13]. [sent-75, score-0.808]
33 In this paper, we use human poses as context to discover functionalities of objects. [sent-77, score-0.309]
34 Our method relies on modeling the interactions between humans and objects. [sent-79, score-0.39]
35 Most such approaches first estimate human poses [1, 30] and detect objects [10], and then model human-object spatial relationships to improve action recognition performance [5, 18, 33]. [sent-80, score-0.291]
36 While those approaches usually require detailed annotations on training data, a weakly supervised approach is adopted in [25] to infer the spatial relationship between humans and objects. [sent-82, score-0.328]
37 While our method also uses weak supervision to learn the spatial relationship between humans and objects, it takes into account that humans can interact with the same object in different ways, which correspond to different semantic meanings. [sent-83, score-0.6]
38 In this work, we cluster human action images in 3D, where the clustering results are more consistent with human perception than those from 2D. [sent-88, score-0.459]
39 Given a set of images of human-violin interactions, our goal is to figure out what are the groups of interactions between a human and a violin, and output a model for this action. [sent-96, score-0.34]
40 Different interactions, such as playing a violin or using a violin as a weapon, correspond to different object functionalities. [sent-102, score-0.624]
41 Our goal is to discover those interactions from weakly supervised data. [sent-103, score-0.351]
42 Given a set of images of humans interacting with a certain object and an initial model of object detection and pose estimation, we propose an iterative approach to discover different types of human-object interactions and obtain a model tailored to each interaction. [sent-106, score-0.919]
43 On the one hand, given a model of object functionality, we detect the object, estimate the human pose, convert 2D key points to 3D, and then measure the distance between each pair of images (Sec. [sent-107, score-0.302]
44 On the other hand, given the clustering results, both the object detectors and human pose estimators can be updated so that the original model can be tailored to specific cases of object functionality (Sec. [sent-113, score-1.074]
45 Pairwise distance of human-object interactions To reduce the semantic gap between human poses and 2D image representation (shown in Fig. [sent-118, score-0.407]
46 First, the 2D locations and orientations of objects and human body parts are obtained using off-the-shelf object detectors [10] and human pose estimation [30] approaches. [sent-122, score-0.78]
47 We first detect objects and estimate human poses in each image, and then convert the key point coordinates in 3D and measure image similarity. [sent-125, score-0.331]
48 We use the flexible mixture-ofparts [30] approach for 2D human pose estimation. [sent-135, score-0.275]
49 Because of camera angle changes, the same human pose might lead to very different 2D configurations of human body parts, as shown in Fig. [sent-142, score-0.505]
50 Therefore, we use a data-driven approach to reconstruct 3D coordinates of human body parts from the result of 2D pose estimation [27]. [sent-144, score-0.435]
51 For the 3D loca- tions of detected objects, we search the nearest body parts in 2D space, and average their 3D locations as the locations of objects in 3D space. [sent-148, score-0.262]
52 It has been shown that pose features perform substantially better than low-level features in measuring human pose similarities [12]. [sent-150, score-0.419]
53 Following this idea and inspired by [32], we measure the distance of two human poses by rotating one 3D pose to match the other, and then consider the point-wise distance of the rotated human poses. [sent-151, score-0.516]
54 We further incorporate the object in our similarity measure by adding the object as one more point in M and assuming that the depth of the object is the same as the hand that is closest to the object. [sent-158, score-0.256]
55 Clustering based on pairwise distance The goal here is to cluster the given images so that each cluster corresponds to one human-object interaction, as shown in Fig. [sent-161, score-0.255]
56 Updating the object functionality model In each iteration, we update the model of object detection and pose estimation for each cluster of human-object interaction. [sent-198, score-0.919]
57 In each cluster, we re-train the models by using object detection and pose estimation results from this iteration as “ground-truth”. [sent-199, score-0.297]
58 In the step of object detection and pose estimation in the next iteration, we apply all the models from different clusters, and choose the one with the largest score of object detection and pose estimation. [sent-201, score-0.61]
59 For performance evalua- tion, we need a dataset that contains different interactions between humans and each object. [sent-205, score-0.39]
60 The People Playing Musical Instrument (PPMI) dataset [3 1] contains images of people interacting with twelve different musical instruments: bassoon, cello, clarinet, erhu, flute, French horn, guitar, harp, recorder, saxophone, trumpet, and violin. [sent-206, score-0.402]
61 For each instrument, there are images of people playing the instrument (PPMI+) as well as images of people holding the instrument with different pose, but not performing the playing action (PPMI-). [sent-207, score-1.418]
62 We use the normalized training images to train our models, where there are 100 PPMI+ images and 100 PPMI- images for each musical instrument. [sent-208, score-0.329]
63 For each instrument, our goal is to cluster the images based on different types of human-object interactions, and obtain a model of object detection and pose estimation for each cluster. [sent-209, score-0.406]
64 Ideally, images of humans playing the instruments should be grouped in the same cluster. [sent-210, score-0.686]
65 To begin with, we randomly select 10 images from each instrument and annotate the key point locations of human body parts as 2255 1155 InstrumentBOabsejelicnte deteOctuiorsnBPaoseseli neestimOatuiorns Table 1. [sent-211, score-0.687]
66 well as object bounding boxes, and train a detector [10] for each musical instrument and a general human pose estimator [30]. [sent-215, score-1.01]
67 The object detectors and human pose estimator will be updated during our model learning process. [sent-216, score-0.432]
68 Table 1 shows the results of object detection and pose estimation. [sent-218, score-0.266]
69 For each musical instrument, we apply the “final” object detectors and pose estimators obtained from our method to the test PPMI images. [sent-219, score-0.571]
70 We compare our method with the initial baseline models that are trained for all musical instruments. [sent-221, score-0.26]
71 For human pose estimation, a body part is considered correctly localized if the end points of its segment lie within 50% of the ground-truth segment length [12]. [sent-224, score-0.34]
72 This demonstrates the effectiveness of iteratively updating pose estimators and object detectors. [sent-226, score-0.274]
73 The PPMI dataset contains ground truths for which images contain people playing the instrument (PPMI+) and which images contain people only holding the instrument but not playing (PPMI-). [sent-234, score-1.377]
74 For each instrument, ideally, there exists a big cluster of humans playing the instrument, and many other clusters of humans holding the instruments but not playing. [sent-245, score-1.096]
75 6 visualizes the average distribution of number of images in each cluster on all musical instruments. [sent-249, score-0.392]
76 In the other baseline, we cluster images based on 2D positions of keypoints of objects and human poses without converting them to 3D. [sent-252, score-0.359]
77 For these two methods, we also choose the number of clusters on each instrument such that the number of images in the largest cluster is as close to 100 as possible. [sent-253, score-0.583]
78 For each instrument, we assume only the largest cluster contains images of people playing the instrument. [sent-254, score-0.431]
79 The reason might be due to the errors in 2D pose estimation and the lack of accurate pose matching because of camera angle changes. [sent-258, score-0.353]
80 This is due to the large size of French horns, and the fact that the human poses as well as human-object spatial relationship are very similar in images of people playing French horn and people holding French horn but not playing. [sent-261, score-0.84]
81 On the instruments such as flute and trumpet, we are able to separate PPMI+ images from the others with high accuracy, because of the unique human poses and human-object spatial relationships on PPMI+ images. [sent-267, score-0.574]
82 Comparison of our functionality discovery method with the approaches that based on low-level features or 2D key point locations. [sent-269, score-0.481]
83 The poor clustering performance on French horn can also be explained from this figure, where the spatial relationship between humans and French horns are very similar in images of all types of interactions. [sent-306, score-0.429]
84 9 visualizes the heatmap of the locations of human hands with respect to the musical instruments, as well as the locations of objects with respect to the average human pose in different interactions. [sent-308, score-0.875]
85 On most instruments, we observe more consistent human hand locations on the clusters of people playing the instrument than that on the other clusters. [sent-309, score-0.89]
86 This shows some general rules when humans interact with a specific type of object, no matter what the functionality of the interaction is. [sent-314, score-0.81]
87 Interestingly, people usually touch different parts of French horn when they are playing or not playing it, as shown in Fig. [sent-315, score-0.632]
88 Our method learns the interaction between humans and objects. [sent-318, score-0.26]
89 Given a human pose, we would like to know what object the human is manipulating. [sent-319, score-0.339]
90 On the PPMI test images, we apply all the human pose models to each image, and select the human that corresponds to the largest score. [sent-320, score-0.453]
91 (a) Heatmaps of the locations of human hands with respect to musical instruments. [sent-369, score-0.471]
92 We compare our approach with a baseline that runs deformable parts models [10] of all instruments on each image, and output the instrument that corresponds to the largest calibrated score. [sent-376, score-0.69]
93 Table 2 shows that on the musical instruments where the human pose is different from the others, such as flute and violin, our method has good prediction performance. [sent-378, score-0.885]
94 On musical instruments which are played with a similar human pose, such as bassoon, clarinet and saxophone (shown in Fig. [sent-379, score-0.741]
95 Humans tend to touch similar locations of some musical instruments, even when they are not playing it. [sent-382, score-0.552]
96 confirms that both object appearance and functionality are important in perceiving objects and provide complementary information [23]. [sent-387, score-0.585]
97 We consider multiple possible interactions between humans and a certain object, and use an approach that iteratively clusters images based on object functionality and updates models of object detection and pose estimation. [sent-392, score-1.287]
98 On a dataset of people interacting with musical instruments, we show that our model is able to effectively infer object functionalities. [sent-393, score-0.456]
99 Weakly supervised learning of interactions between humans and objects. [sent-591, score-0.424]
100 Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. [sent-646, score-0.389]
wordName wordTfidf (topN-words)
[('functionality', 0.459), ('instrument', 0.355), ('musical', 0.26), ('ppmi', 0.256), ('instruments', 0.254), ('playing', 0.205), ('humans', 0.204), ('affordance', 0.197), ('interactions', 0.186), ('violin', 0.171), ('pose', 0.144), ('french', 0.142), ('human', 0.131), ('flute', 0.096), ('interact', 0.091), ('horn', 0.088), ('cluster', 0.086), ('object', 0.077), ('clusters', 0.072), ('holding', 0.071), ('poses', 0.07), ('people', 0.07), ('weakly', 0.066), ('discover', 0.065), ('body', 0.065), ('bangpeng', 0.064), ('locations', 0.057), ('interaction', 0.056), ('estimators', 0.053), ('delaitre', 0.049), ('tailored', 0.049), ('interacting', 0.049), ('objects', 0.049), ('bassoon', 0.048), ('clarinet', 0.048), ('heatmaps', 0.048), ('humanviolin', 0.048), ('saxophone', 0.048), ('trumpet', 0.048), ('clustering', 0.047), ('largest', 0.047), ('detection', 0.045), ('perceive', 0.043), ('estimator', 0.043), ('horns', 0.043), ('functionalities', 0.043), ('gibson', 0.043), ('action', 0.041), ('yao', 0.04), ('pairwise', 0.04), ('vk', 0.04), ('kij', 0.039), ('detectors', 0.037), ('gupta', 0.035), ('stanford', 0.035), ('parts', 0.034), ('supervised', 0.034), ('guitar', 0.034), ('might', 0.034), ('estimation', 0.031), ('xi', 0.031), ('fouhey', 0.031), ('coordinates', 0.03), ('touch', 0.03), ('convert', 0.029), ('consideration', 0.029), ('efros', 0.027), ('functional', 0.027), ('ways', 0.027), ('inferring', 0.026), ('psychology', 0.026), ('facilitates', 0.025), ('similarity', 0.025), ('cue', 0.024), ('relationship', 0.024), ('sivic', 0.024), ('orientations', 0.024), ('hands', 0.023), ('images', 0.023), ('arms', 0.023), ('visualizes', 0.023), ('key', 0.022), ('robot', 0.022), ('xl', 0.022), ('refers', 0.022), ('colleagues', 0.021), ('itom', 0.021), ('cello', 0.021), ('kjellstrom', 0.021), ('kvk', 0.021), ('objectaction', 0.021), ('recorder', 0.021), ('wiewiora', 0.021), ('revisited', 0.021), ('child', 0.021), ('chair', 0.021), ('distance', 0.02), ('laptev', 0.02), ('play', 0.02), ('observing', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 118 iccv-2013-Discovering Object Functionality
Author: Bangpeng Yao, Jiayuan Ma, Li Fei-Fei
Abstract: Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. In this paper, we propose a weakly supervised approach to discover all possible object functionalities. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach.
2 0.18020852 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling
Author: Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, Shaogang Gong, Tao Xiang
Abstract: Human action can be recognised from a single still image by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between human and object as well as their appearance. Existing approaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensitive to large variations of human poses, occlusion and unsatisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is proposed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density functions describing how a person is interacting with a manipulated object for different activities spatially in a probabilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new framework consists of a proposed exemplar based HOI descriptor and an activity specific matching model that learns the parameters is formulated for robust human activity recog- nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-ofthe-art performance.
3 0.11271337 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
Author: Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele
Abstract: Typical approaches to articulated pose estimation combine spatial modelling of the human body with appearance modelling of body parts. This paper aims to push the state-of-the-art in articulated pose estimation in two ways. First we explore various types of appearance representations aiming to substantially improve the bodypart hypotheses. And second, we draw on and combine several recently proposed powerful ideas such as more flexible spatial models as well as image-conditioned spatial models. In a series of experiments we draw several important conclusions: (1) we show that the proposed appearance representations are complementary; (2) we demonstrate that even a basic tree-structure spatial human body model achieves state-ofthe-art performance when augmented with the proper appearance representation; and (3) we show that the combination of the best performing appearance model with a flexible image-conditioned spatial model achieves the best result, significantly improving over the state of the art, on the “Leeds Sports Poses ” and “Parse ” benchmarks.
4 0.10945076 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
Author: Elisabeta Marinoiu, Dragos Papava, Cristian Sminchisescu
Abstract: Human motion analysis in images and video is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other people in images and how accurate they are. In this paper we aim to unveil some of the processing–as well as the levels of accuracy–involved in the 3D perception of people from images by assessing the human performance. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in particular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose reenactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses. We also discuss the implications of our find- ings for the construction of visual human sensing systems.
5 0.10204244 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos
Author: Taegyu Lim, Seunghoon Hong, Bohyung Han, Joon Hee Han
Abstract: We propose an on-line algorithm to extract a human by foreground/background segmentation and estimate pose of the human from the videos captured by moving cameras. We claim that a virtuous cycle can be created by appropriate interactions between the two modules to solve individual problems. This joint estimation problem is divided into two subproblems, , foreground/background segmentation and pose tracking, which alternate iteratively for optimization; segmentation step generates foreground mask for human pose tracking, and human pose tracking step provides foreground response map for segmentation. The final solution is obtained when the iterative procedure converges. We evaluate our algorithm quantitatively and qualitatively in real videos involving various challenges, and present its outstandingperformance compared to the state-of-the-art techniques for segmentation and pose estimation.
6 0.09471903 46 iccv-2013-Allocentric Pose Estimation
7 0.09209691 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
8 0.090861127 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.089708745 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
10 0.088195778 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
11 0.085303918 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
12 0.08461602 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
13 0.083540514 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
14 0.082511187 39 iccv-2013-Action Recognition with Improved Trajectories
15 0.081652522 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
16 0.08162535 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency
17 0.076609589 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
18 0.076050013 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
19 0.075628847 143 iccv-2013-Estimating Human Pose with Flowing Puppets
20 0.074216887 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
topicId topicWeight
[(0, 0.159), (1, 0.032), (2, 0.032), (3, 0.041), (4, 0.079), (5, -0.055), (6, 0.027), (7, -0.004), (8, -0.041), (9, 0.044), (10, 0.05), (11, 0.016), (12, -0.126), (13, -0.07), (14, -0.043), (15, 0.064), (16, 0.005), (17, -0.04), (18, 0.03), (19, 0.04), (20, 0.03), (21, 0.016), (22, 0.082), (23, -0.025), (24, 0.044), (25, -0.008), (26, 0.009), (27, 0.004), (28, -0.031), (29, -0.039), (30, -0.044), (31, -0.017), (32, 0.023), (33, -0.02), (34, 0.045), (35, -0.024), (36, 0.004), (37, -0.034), (38, -0.029), (39, -0.0), (40, 0.054), (41, 0.023), (42, -0.062), (43, 0.015), (44, 0.027), (45, -0.047), (46, -0.049), (47, 0.017), (48, 0.037), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.94388902 118 iccv-2013-Discovering Object Functionality
Author: Bangpeng Yao, Jiayuan Ma, Li Fei-Fei
Abstract: Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. In this paper, we propose a weakly supervised approach to discover all possible object functionalities. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach.
2 0.85891205 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling
Author: Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, Shaogang Gong, Tao Xiang
Abstract: Human action can be recognised from a single still image by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between human and object as well as their appearance. Existing approaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensitive to large variations of human poses, occlusion and unsatisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is proposed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density functions describing how a person is interacting with a manipulated object for different activities spatially in a probabilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new framework consists of a proposed exemplar based HOI descriptor and an activity specific matching model that learns the parameters is formulated for robust human activity recog- nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-ofthe-art performance.
3 0.85148084 46 iccv-2013-Allocentric Pose Estimation
Author: M. José Antonio, Luc De_Raedt, Tinne Tuytelaars
Abstract: The task of object pose estimation has been a challenge since the early days of computer vision. To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, external elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been shown to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for pose estimation. In particular, we look at object configurations. We show that, starting from noisy object detections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to estimate the objects’ poses more accurately. We explore both a camera-centered as well as an object-centered representation for relations. Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estimation. In addition, object-centered relational representations can also assist object detection.
4 0.84248811 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
Author: Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, Bernt Schiele
Abstract: Typical approaches to articulated pose estimation combine spatial modelling of the human body with appearance modelling of body parts. This paper aims to push the state-of-the-art in articulated pose estimation in two ways. First we explore various types of appearance representations aiming to substantially improve the bodypart hypotheses. And second, we draw on and combine several recently proposed powerful ideas such as more flexible spatial models as well as image-conditioned spatial models. In a series of experiments we draw several important conclusions: (1) we show that the proposed appearance representations are complementary; (2) we demonstrate that even a basic tree-structure spatial human body model achieves state-ofthe-art performance when augmented with the proper appearance representation; and (3) we show that the combination of the best performing appearance model with a flexible image-conditioned spatial model achieves the best result, significantly improving over the state of the art, on the “Leeds Sports Poses ” and “Parse ” benchmarks.
5 0.84225118 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
Author: Elisabeta Marinoiu, Dragos Papava, Cristian Sminchisescu
Abstract: Human motion analysis in images and video is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other people in images and how accurate they are. In this paper we aim to unveil some of the processing–as well as the levels of accuracy–involved in the 3D perception of people from images by assessing the human performance. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in particular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose reenactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses. We also discuss the implications of our find- ings for the construction of visual human sensing systems.
6 0.77139926 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion
7 0.7553184 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
8 0.72172582 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency
10 0.67997539 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose
11 0.65856498 143 iccv-2013-Estimating Human Pose with Flowing Puppets
12 0.64601821 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
13 0.64065832 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data
14 0.62748444 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos
15 0.62399113 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
16 0.62004352 130 iccv-2013-Dynamic Structured Model Selection
17 0.61880982 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
18 0.61101979 205 iccv-2013-Human Re-identification by Matching Compositional Template with Cluster Sampling
19 0.60053295 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors
20 0.59296918 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
topicId topicWeight
[(2, 0.073), (26, 0.046), (31, 0.035), (35, 0.017), (42, 0.13), (48, 0.012), (64, 0.052), (73, 0.029), (76, 0.29), (78, 0.016), (89, 0.176), (98, 0.012)]
simIndex simValue paperId paperTitle
1 0.8020395 164 iccv-2013-Fibonacci Exposure Bracketing for High Dynamic Range Imaging
Author: Mohit Gupta, Daisuke Iso, Shree K. Nayar
Abstract: Exposure bracketing for high dynamic range (HDR) imaging involves capturing several images of the scene at different exposures. If either the camera or the scene moves during capture, the captured images must be registered. Large exposure differences between bracketed images lead to inaccurate registration, resulting in artifacts such as ghosting (multiple copies of scene objects) and blur. We present two techniques, one for image capture (Fibonacci exposure bracketing) and one for image registration (generalized registration), to prevent such motion-related artifacts. Fibonacci bracketing involves capturing a sequence of images such that each exposure time is the sum of the previous N(N > 1) exposures. Generalized registration involves estimating motion between sums of contiguous sets of frames, instead of between individual frames. Together, the two techniques ensure that motion is always estimated betweenframes of the same total exposure time. This results in HDR images and videos which have both a large dynamic range andminimal motion-relatedartifacts. We show, by results for several real-world indoor and outdoor scenes, that theproposed approach significantly outperforms several ex- isting bracketing schemes.
2 0.78319091 221 iccv-2013-Joint Inverted Indexing
Author: Yan Xia, Kaiming He, Fang Wen, Jian Sun
Abstract: Inverted indexing is a popular non-exhaustive solution to large scale search. An inverted file is built by a quantizer such as k-means or a tree structure. It has been found that multiple inverted files, obtained by multiple independent random quantizers, are able to achieve practically good recall and speed. Instead of computing the multiple quantizers independently, we present a method that creates them jointly. Our method jointly optimizes all codewords in all quantizers. Then it assigns these codewords to the quantizers. In experiments this method shows significant improvement over various existing methods that use multiple independent quantizers. On the one-billion set of SIFT vectors, our method is faster and more accurate than a recent state-of-the-art inverted indexing method.
same-paper 3 0.75675666 118 iccv-2013-Discovering Object Functionality
Author: Bangpeng Yao, Jiayuan Ma, Li Fei-Fei
Abstract: Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. In this paper, we propose a weakly supervised approach to discover all possible object functionalities. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach.
4 0.73066354 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
Author: Bingbing Ni, Pierre Moulin
Abstract: We aim to unsupervisedly discover human’s action (motion) patterns of manipulating various objects in scenarios such as assisted living. We are motivated by two key observations. First, large variation exists in motion patterns associated with various types of objects being manipulated, thus manually defining motion primitives is infeasible. Second, some motion patterns are shared among different objects being manipulated while others are object specific. We therefore propose a nonparametric Bayesian method that adopts a hierarchical Dirichlet process prior to learn representative manipulation (motion) patterns in an unsupervised manner. Taking easy-to-obtain object detection score maps and dense motion trajectories as inputs, the proposed probabilistic model can discover motion pattern groups associated with different types of objects being manipulated with a shared manipulation pattern dictionary. The size of the learned dictionary is automatically inferred. Com- prehensive experiments on two assisted living benchmarks and a cooking motion dataset demonstrate superiority of our learned manipulation pattern dictionary in representing manipulation actions for recognition.
5 0.68861961 414 iccv-2013-Temporally Consistent Superpixels
Author: Matthias Reso, Jörn Jachalsky, Bodo Rosenhahn, Jörn Ostermann
Abstract: Superpixel algorithms represent a very useful and increasingly popular preprocessing step for a wide range of computer vision applications, as they offer the potential to boost efficiency and effectiveness. In this regards, this paper presents a highly competitive approach for temporally consistent superpixelsfor video content. The approach is based on energy-minimizing clustering utilizing a novel hybrid clustering strategy for a multi-dimensional feature space working in a global color subspace and local spatial subspaces. Moreover, a new contour evolution based strategy is introduced to ensure spatial coherency of the generated superpixels. For a thorough evaluation the proposed approach is compared to state of the art supervoxel algorithms using established benchmarks and shows a superior performance.
6 0.66523695 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples
7 0.63860822 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
8 0.63840741 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification
9 0.63738656 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
10 0.63702142 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
11 0.63694328 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
12 0.63668323 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
13 0.63647848 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
14 0.6364029 149 iccv-2013-Exemplar-Based Graph Matching for Robust Facial Landmark Localization
15 0.63627827 277 iccv-2013-Multi-channel Correlation Filters
16 0.63596445 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
17 0.6358552 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
18 0.63584918 165 iccv-2013-Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies
19 0.63551378 97 iccv-2013-Coupling Alignments with Recognition for Still-to-Video Face Recognition
20 0.63522106 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering