Author: Yun Jiang, Hema Koppula, Ashutosh Saxena
Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
1 For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. [sent-6, score-0.394]
2 Our goal is to learn this hidden human context (i. [sent-7, score-0.297]
3 We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. [sent-10, score-0.368]
4 This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. [sent-11, score-0.337]
5 Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. [sent-12, score-0.319]
6 This particular configuration that is commonly found in offices, can be naturally explained by a sitting human pose in the chair and working with the computer. [sent-21, score-0.372]
7 Our key hypothesis is that even when the humans are never observed, the human context is helpful. [sent-28, score-0.374]
8 In fact, several recent works have shown promise in using human and object affordances to model the scenes. [sent-29, score-0.556]
9 Jiang, Lim and Saxena [14, 17] used hallucinated humans for learning the object arrangements in a house in order to enable robots to place objects in human-preferred locations. [sent-30, score-0.355]
10 While inspired by these prior works, the key idea in our work is to hallucinate humans in order to learn a generic form of object affordance, and to use them in the task of labeling 3D scenes. [sent-35, score-0.309]
11 While a large corpus of scenes with objects is available, humans and their interactions with objects are observed only a few times for some objects. [sent-36, score-0.291]
12 Therefore, using hallucinated humans gives us the advantage of considering human context while not limited to data that contains real human interactions. [sent-37, score-0.631]
13 However, if the humans are not observed in the scene and we do not know the object affordances either (i. [sent-38, score-0.578]
14 First, while the space of potential unobserved human configurations are large, only 222999999311 few are likely, and so are the object affordances. [sent-43, score-0.312]
15 For example, if standing on furniture (such as tables and chairs) is very unlikely in the prior, we are less likely to learn affordances such as humans stepping on books. [sent-44, score-0.579]
16 Second, we encourage fewer humans per scene resulting in different objects sharing same human configurations. [sent-45, score-0.409]
17 This allows us to explain, but not over-fit, a scene with as few human configurations as necessary. [sent-46, score-0.331]
18 In order to model the scene through hallucinated human configurations and object affordances, we propose a new topic model, which we call Infinite Factored Topic Model (IFTM). [sent-47, score-0.662]
19 Each object in the scene is generated by two types of topics jointly: human-configuration topics and objectaffordance topics. [sent-48, score-1.171]
20 We use a sampling algorithm to estimate the human pose distribution in scenes and to optimize the affordance functions that best explain the given scenes. [sent-49, score-0.686]
21 The learned topics are later used as features for building a scene labeling classifier. [sent-50, score-0.65]
22 We test our approach on the tasks of labeling objects and attributes in 3D scenes, and show that the human-object context is informative in that it increases performance over classifier based on object appearance and shapes. [sent-51, score-0.298]
23 However, none of these works consider human context for scene understanding. [sent-61, score-0.322]
24 Only recently, some works [9, 32, 1, 25, 20] have shown that modeling the interaction between human poses and objects in 2D images and videos result in a better performance on the tasks of object detection and activity recognition. [sent-64, score-0.397]
25 [4] observe humans in videos for estimating 3D geometry and estimating affordances respectively. [sent-67, score-0.465]
26 However, these works are unable to characterize the relationship between objects in 3D unless a human explicitly interacted with each of the objects and are also limited by the quality of the human poses inferred from 2D data. [sent-68, score-0.544]
27 Our method can extract the hidden human context even from static scenes without humans, based on the object configurations found in the human environments. [sent-69, score-0.656]
28 However, they require explicit training data specifying the human pose associated with an affordance and demonstrated their method on a single object category and affordance. [sent-76, score-0.623]
29 [14] consider many affordances in the form of human-object relation topics which are obtained in a completely unsupervised manner. [sent-78, score-0.837]
30 While they employ the learned affordances to infer reasonable object arrangements in human environments, in this work, we combine these affordances, as functional cues, with other visual and geometric cues to improve the performance of scene labeling. [sent-79, score-0.654]
31 Representation of Human Configurations and Object Affordances We first define the representation of the human configurations and object affordances in the following: The Space of Human Configurations. [sent-81, score-0.659]
32 aFtioorn poses, we fceornensitd eorreiedn htautmionasn poses fπr)om in s reidael human activity data (Cornell Activity Dataset-60, [29]), and clustered them using k-means algorithm giving us six types (three sitting poses and three standing poses) of skeletons showing in Fig. [sent-84, score-0.566]
33 From left: sitting upright, sitting reclined, sitting forward, reaching, standing and leaning forward. [sent-87, score-0.34]
34 A human can use the objects at different distances and orientations from the human body. [sent-89, score-0.387]
35 However, the human context cannot be easily harnessed because the space of possible human configurations and object affordances is rather large. [sent-101, score-0.915]
36 For example, one potential explanation of the scene could be humans floating in the air and prefer stepping on every object as the affordance! [sent-103, score-0.294]
37 The key to modeling the large space of latent human context lies in building parsimonious models and providing priors to avoid physically-impossible models. [sent-104, score-0.297]
38 Model Parsimony While there are infinite number of human configurations in a scene and countless ways to interact with objects, only a few human poses and certain common ways ofusing objects are needed to explain most parts of a scene. [sent-107, score-0.718]
39 This is analogous to the document topics [30, 18], except that in our case topics will be continuous distributions and fac- tored. [sent-110, score-1.006]
40 Similar to document topics, our human-context topics can be shared across objects and scenes. [sent-111, score-0.553]
41 We describe the two types of topics below: Human Configuration Topics. [sent-115, score-0.527]
42 In a scene, there are certain human configurations that are used more commonly than others. [sent-116, score-0.265]
43 For instance, in an office a sitting pose on the chair and a few poses standing by the desk, shelf and whiteboard are more common. [sent-117, score-0.365]
44 Most of the objects in an office are arranged for these human configurations. [sent-118, score-0.265]
45 For example, both using a keyboard and reading a book require a human pose to be close to objects. [sent-121, score-0.362]
46 Therefore, the affordance of a book would be a mixture of a ‘close-to’ and a ‘spread-out’ topic. [sent-123, score-0.46]
47 Our hallucinated human configurations need to follow basic physics. [sent-128, score-0.36]
48 We consider the following two properties as priors for the generated human configurations [10]: 1) Kinematics. [sent-130, score-0.265]
49 Furthermore, most objects’ affordance should be symmetric in their relative orientation to the humans’ left or right. [sent-136, score-0.376]
50 We encode this information in the design of the function quantifying affordances and as Bayesian priors in the estimation of the function’s parameters, see Section 5. [sent-137, score-0.347]
51 Infinite Factored Topic Model (IFTM) In this work, we model the human configurations and object affordances as two types of ‘factored’ topics. [sent-140, score-0.696]
52 In our previous work [18], we presented finite factored topic model that discovers different types of topics from text data. [sent-141, score-0.834]
53 Each type of topic is modeled by an independent topic model and a data point is jointly determined by a set of topics, one from each type. [sent-142, score-0.378]
54 By factorizing the original parameter (topic) space into smaller sub-spaces, it uses a small number of topics from different sub-spaces to effectively express a larger number of topics in the original space. [sent-143, score-0.98]
55 In this work, we extend our idea to Infinite Factored Topic Models (IFTM), which can not only handle multiple types of topics but also unknown number of topics in each type. [sent-144, score-1.017]
56 Furthermore, unlike text data, our topics are continuous distributions in this work that we model using Dirichlet process mixture model (DPMM) [30]. [sent-145, score-0.554]
57 In the following, we first briefly review DPMM, and then describe our IFTM and show how to address the challenges induced by the coupling of the topics from different types. [sent-146, score-0.49]
58 Specifically, it first draws infinite number of topics from a base distribution G, and the topic proportion π: θk ∼ G, bk ∼ Beta(1, α), πk = bk ? [sent-150, score-0.784]
59 The topic assignment z is sampled from the topic proportion π. [sent-155, score-0.417]
60 el∞ Figure 3: DPMM and our 2D infinite factored topic model. [sent-165, score-0.375]
61 DPMM is different from traditional mixture models because of that it incorporates base (prior) distribution of topics and it allows the number of topics change according to data. [sent-168, score-1.055]
62 , the number of affordances and human poses) is unknown and can vary from scene to scene. [sent-172, score-0.575]
63 , LzL d)i-, we then draw x from the distribution parameterized by the selected L topics together: z? [sent-200, score-0.527]
64 mes the two types of topic spaces are independent, it is easy to show that the distribution of z can be factorized into the distribution of z? [sent-225, score-0.3]
65 Its distribution F should reflect the likelihood of the object being at this location given the human configuration and affordance. [sent-236, score-0.277]
66 We therefore define F as (see [14]): F(xi;θH,θO) = FdistFrelFheight, (2) where the three terms depict three types of spatial relationships between the object xi and the human pose θH: Euclidean distance, relative angle and height (vertical) distance. [sent-237, score-0.35]
67 We use log-normal, von Mises and normal distributions to characterize the probability of these measurements, and the parameters of these distributions are given by the object affordance topics, i. [sent-238, score-0.475]
68 GH is a uniform distribution over valid human poses in the scene (see Section 4. [sent-242, score-0.359]
69 Learning Human-Context Topics Given a scene, the location of an object xi is observed and our goal is to estimate likely human configurations and affordances in the scene. [sent-246, score-0.697]
70 Moreover, to incorporate the growth of the number × of topics, we add m auxiliary topics for a subject to choose from [23]. [sent-252, score-0.49]
71 These auxiliary topics are drawn from the base distribution GH or GO, and the probability of choosing one of these topics is equal to αH/m or αO/m. [sent-253, score-1.017]
72 Given topic assignments, we can compute the posterior distribution of topics and sample topics from it: = θkH = θH ∝ GH(θH) θjO = θO ∝ GO(θO) ? [sent-257, score-1.206]
73 It shows two affordance topics, labeled with the most common object label for understanding in this figure. [sent-261, score-0.423]
74 In Iteration#1, the affordance is only based on the prior GO and hence is same for all objects. [sent-263, score-0.376]
75 For example, an affordance topic θjO = (μd, σd, μr, κr, μh, σh) is updated as follows: The mean and variance (μd, σd) in Fdist. [sent-268, score-0.565]
76 Given the distance between each object xi and its associated human pose θzHiH, denoted by di, μd and σd are given by, = μd,σd argmaxμ,σ GdOist(μ, σ) ? [sent-269, score-0.285]
77 As the object affordance is often strongly coupled with the object classes, we use the affordance derived from the learned topics as features that feed into other learning al- gorithms, similar to the ideas using in supervised topic models [3]. [sent-282, score-1.525]
78 Although IFTM itself is an unsupervised method, in order to obtain more category-oriented topics, we initialize ziO to its object category to encourage topics to be shared by objects from the same class exclusively. [sent-283, score-0.6]
79 Note that when computing the affordance features (for both training and test data), no object labels are used. [sent-284, score-0.423]
80 In detail, we compute the affordance features as follows. [sent-285, score-0.376]
81 We set the affordance topics as the top K sampled topics θkO, ranked by the posterior distribution. [sent-286, score-1.395]
82 Then we use the histogram of sampled ziO as the affordance features for object i. [sent-288, score-0.462]
83 This is our affordance and human configurations information being used in prediction, without using object-object context. [sent-304, score-0.641]
84 Here we combine the human context (from affordances and human configurations) with object-object context. [sent-331, score-0.765]
85 In detail, we append the node features of each segment with the affordance topic proportions derived from the learned object-affordance topics and learn the semantic labeling model as described in [19]. [sent-332, score-1.176]
86 Being able to hallucinate sensible human poses is critical for learning object affordances. [sent-342, score-0.353]
87 To verify that our algorithm can sample meaningful human poses, we plot a few top sampled poses in the scenes, shown in Fig. [sent-343, score-0.33]
88 In the first home scene, some sampled human poses are sitting on the edge of the bed while others standing close to the desk (so that they have easy access to objects on the table or the shelf-rack). [sent-345, score-0.64]
89 It is these correctly sampled human poses that give us possibility to learn correct object affordances. [sent-352, score-0.342]
90 Our goal is to learn object affordance for each class. [sent-355, score-0.423]
91 6 shows the affordances from the topview and side-view respectively for typical object classes. [sent-357, score-0.394]
92 Note that while the affordance topics are unimodal, the affordance for each objects is a mixture of these topics and thus could be multimodal and more expressive. [sent-365, score-1.833]
93 1 shows that the affordance topic proportions (human context) as extra features boosts the labeling performance. [sent-386, score-0.686]
94 First, when combining human context with the image- and shapefeatures, we see a consistent improvement in labeling performance in all evaluation metrics, regardless of the objectobject context. [sent-387, score-0.391]
95 In fact, adding object-object context to human-object context was particularly helpful for small objects such as keyboards and books that are not always used by humans together, but still have a spatial correlation between them. [sent-390, score-0.436]
96 Similarly, it confuses cpuTop with chairBase because the CPU-top (placed on the ground) could also afford sitting human poses! [sent-396, score-0.314]
97 Conclusions We presented infinite factored topic models (IFTM) that enabled us to model the generation of a scene containing objects through hallucinated (hidden) human configurations and object affordances, both modeled as topics. [sent-402, score-0.911]
98 Learning object arrangements in 3d scenes using human context. [sent-503, score-0.288]
99 Learning human activities and object affordances from rgb-d videos. [sent-540, score-0.556]
100 Modeling mutual context of object and human pose in human-object interaction activities. [sent-612, score-0.341]
