cvpr cvpr2013 cvpr2013-197 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yun Jiang, Hema Koppula, Ashutosh Saxena
Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
Reference: text
sentIndex sentText sentNum sentScore
1 For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. [sent-6, score-0.394]
2 Our goal is to learn this hidden human context (i. [sent-7, score-0.297]
3 We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. [sent-10, score-0.368]
4 This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. [sent-11, score-0.337]
5 Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. [sent-12, score-0.319]
6 This particular configuration that is commonly found in offices, can be naturally explained by a sitting human pose in the chair and working with the computer. [sent-21, score-0.372]
7 Our key hypothesis is that even when the humans are never observed, the human context is helpful. [sent-28, score-0.374]
8 In fact, several recent works have shown promise in using human and object affordances to model the scenes. [sent-29, score-0.556]
9 Jiang, Lim and Saxena [14, 17] used hallucinated humans for learning the object arrangements in a house in order to enable robots to place objects in human-preferred locations. [sent-30, score-0.355]
10 While inspired by these prior works, the key idea in our work is to hallucinate humans in order to learn a generic form of object affordance, and to use them in the task of labeling 3D scenes. [sent-35, score-0.309]
11 While a large corpus of scenes with objects is available, humans and their interactions with objects are observed only a few times for some objects. [sent-36, score-0.291]
12 Therefore, using hallucinated humans gives us the advantage of considering human context while not limited to data that contains real human interactions. [sent-37, score-0.631]
13 However, if the humans are not observed in the scene and we do not know the object affordances either (i. [sent-38, score-0.578]
14 First, while the space of potential unobserved human configurations are large, only 222999999311 few are likely, and so are the object affordances. [sent-43, score-0.312]
15 For example, if standing on furniture (such as tables and chairs) is very unlikely in the prior, we are less likely to learn affordances such as humans stepping on books. [sent-44, score-0.579]
16 Second, we encourage fewer humans per scene resulting in different objects sharing same human configurations. [sent-45, score-0.409]
17 This allows us to explain, but not over-fit, a scene with as few human configurations as necessary. [sent-46, score-0.331]
18 In order to model the scene through hallucinated human configurations and object affordances, we propose a new topic model, which we call Infinite Factored Topic Model (IFTM). [sent-47, score-0.662]
19 Each object in the scene is generated by two types of topics jointly: human-configuration topics and objectaffordance topics. [sent-48, score-1.171]
20 We use a sampling algorithm to estimate the human pose distribution in scenes and to optimize the affordance functions that best explain the given scenes. [sent-49, score-0.686]
21 The learned topics are later used as features for building a scene labeling classifier. [sent-50, score-0.65]
22 We test our approach on the tasks of labeling objects and attributes in 3D scenes, and show that the human-object context is informative in that it increases performance over classifier based on object appearance and shapes. [sent-51, score-0.298]
23 However, none of these works consider human context for scene understanding. [sent-61, score-0.322]
24 Only recently, some works [9, 32, 1, 25, 20] have shown that modeling the interaction between human poses and objects in 2D images and videos result in a better performance on the tasks of object detection and activity recognition. [sent-64, score-0.397]
25 [4] observe humans in videos for estimating 3D geometry and estimating affordances respectively. [sent-67, score-0.465]
26 However, these works are unable to characterize the relationship between objects in 3D unless a human explicitly interacted with each of the objects and are also limited by the quality of the human poses inferred from 2D data. [sent-68, score-0.544]
27 Our method can extract the hidden human context even from static scenes without humans, based on the object configurations found in the human environments. [sent-69, score-0.656]
28 However, they require explicit training data specifying the human pose associated with an affordance and demonstrated their method on a single object category and affordance. [sent-76, score-0.623]
29 [14] consider many affordances in the form of human-object relation topics which are obtained in a completely unsupervised manner. [sent-78, score-0.837]
30 While they employ the learned affordances to infer reasonable object arrangements in human environments, in this work, we combine these affordances, as functional cues, with other visual and geometric cues to improve the performance of scene labeling. [sent-79, score-0.654]
31 Representation of Human Configurations and Object Affordances We first define the representation of the human configurations and object affordances in the following: The Space of Human Configurations. [sent-81, score-0.659]
32 aFtioorn poses, we fceornensitd eorreiedn htautmionasn poses fπr)om in s reidael human activity data (Cornell Activity Dataset-60, [29]), and clustered them using k-means algorithm giving us six types (three sitting poses and three standing poses) of skeletons showing in Fig. [sent-84, score-0.566]
33 From left: sitting upright, sitting reclined, sitting forward, reaching, standing and leaning forward. [sent-87, score-0.34]
34 A human can use the objects at different distances and orientations from the human body. [sent-89, score-0.387]
35 However, the human context cannot be easily harnessed because the space of possible human configurations and object affordances is rather large. [sent-101, score-0.915]
36 For example, one potential explanation of the scene could be humans floating in the air and prefer stepping on every object as the affordance! [sent-103, score-0.294]
37 The key to modeling the large space of latent human context lies in building parsimonious models and providing priors to avoid physically-impossible models. [sent-104, score-0.297]
38 Model Parsimony While there are infinite number of human configurations in a scene and countless ways to interact with objects, only a few human poses and certain common ways ofusing objects are needed to explain most parts of a scene. [sent-107, score-0.718]
39 This is analogous to the document topics [30, 18], except that in our case topics will be continuous distributions and fac- tored. [sent-110, score-1.006]
40 Similar to document topics, our human-context topics can be shared across objects and scenes. [sent-111, score-0.553]
41 We describe the two types of topics below: Human Configuration Topics. [sent-115, score-0.527]
42 In a scene, there are certain human configurations that are used more commonly than others. [sent-116, score-0.265]
43 For instance, in an office a sitting pose on the chair and a few poses standing by the desk, shelf and whiteboard are more common. [sent-117, score-0.365]
44 Most of the objects in an office are arranged for these human configurations. [sent-118, score-0.265]
45 For example, both using a keyboard and reading a book require a human pose to be close to objects. [sent-121, score-0.362]
46 Therefore, the affordance of a book would be a mixture of a ‘close-to’ and a ‘spread-out’ topic. [sent-123, score-0.46]
47 Our hallucinated human configurations need to follow basic physics. [sent-128, score-0.36]
48 We consider the following two properties as priors for the generated human configurations [10]: 1) Kinematics. [sent-130, score-0.265]
49 Furthermore, most objects’ affordance should be symmetric in their relative orientation to the humans’ left or right. [sent-136, score-0.376]
50 We encode this information in the design of the function quantifying affordances and as Bayesian priors in the estimation of the function’s parameters, see Section 5. [sent-137, score-0.347]
51 Infinite Factored Topic Model (IFTM) In this work, we model the human configurations and object affordances as two types of ‘factored’ topics. [sent-140, score-0.696]
52 In our previous work [18], we presented finite factored topic model that discovers different types of topics from text data. [sent-141, score-0.834]
53 Each type of topic is modeled by an independent topic model and a data point is jointly determined by a set of topics, one from each type. [sent-142, score-0.378]
54 By factorizing the original parameter (topic) space into smaller sub-spaces, it uses a small number of topics from different sub-spaces to effectively express a larger number of topics in the original space. [sent-143, score-0.98]
55 In this work, we extend our idea to Infinite Factored Topic Models (IFTM), which can not only handle multiple types of topics but also unknown number of topics in each type. [sent-144, score-1.017]
56 Furthermore, unlike text data, our topics are continuous distributions in this work that we model using Dirichlet process mixture model (DPMM) [30]. [sent-145, score-0.554]
57 In the following, we first briefly review DPMM, and then describe our IFTM and show how to address the challenges induced by the coupling of the topics from different types. [sent-146, score-0.49]
58 Specifically, it first draws infinite number of topics from a base distribution G, and the topic proportion π: θk ∼ G, bk ∼ Beta(1, α), πk = bk ? [sent-150, score-0.784]
59 The topic assignment z is sampled from the topic proportion π. [sent-155, score-0.417]
60 el∞ Figure 3: DPMM and our 2D infinite factored topic model. [sent-165, score-0.375]
61 DPMM is different from traditional mixture models because of that it incorporates base (prior) distribution of topics and it allows the number of topics change according to data. [sent-168, score-1.055]
62 , the number of affordances and human poses) is unknown and can vary from scene to scene. [sent-172, score-0.575]
63 , LzL d)i-, we then draw x from the distribution parameterized by the selected L topics together: z? [sent-200, score-0.527]
64 mes the two types of topic spaces are independent, it is easy to show that the distribution of z can be factorized into the distribution of z? [sent-225, score-0.3]
65 Its distribution F should reflect the likelihood of the object being at this location given the human configuration and affordance. [sent-236, score-0.277]
66 We therefore define F as (see [14]): F(xi;θH,θO) = FdistFrelFheight, (2) where the three terms depict three types of spatial relationships between the object xi and the human pose θH: Euclidean distance, relative angle and height (vertical) distance. [sent-237, score-0.35]
67 We use log-normal, von Mises and normal distributions to characterize the probability of these measurements, and the parameters of these distributions are given by the object affordance topics, i. [sent-238, score-0.475]
68 GH is a uniform distribution over valid human poses in the scene (see Section 4. [sent-242, score-0.359]
69 Learning Human-Context Topics Given a scene, the location of an object xi is observed and our goal is to estimate likely human configurations and affordances in the scene. [sent-246, score-0.697]
70 Moreover, to incorporate the growth of the number × of topics, we add m auxiliary topics for a subject to choose from [23]. [sent-252, score-0.49]
71 These auxiliary topics are drawn from the base distribution GH or GO, and the probability of choosing one of these topics is equal to αH/m or αO/m. [sent-253, score-1.017]
72 Given topic assignments, we can compute the posterior distribution of topics and sample topics from it: = θkH = θH ∝ GH(θH) θjO = θO ∝ GO(θO) ? [sent-257, score-1.206]
73 It shows two affordance topics, labeled with the most common object label for understanding in this figure. [sent-261, score-0.423]
74 In Iteration#1, the affordance is only based on the prior GO and hence is same for all objects. [sent-263, score-0.376]
75 For example, an affordance topic θjO = (μd, σd, μr, κr, μh, σh) is updated as follows: The mean and variance (μd, σd) in Fdist. [sent-268, score-0.565]
76 Given the distance between each object xi and its associated human pose θzHiH, denoted by di, μd and σd are given by, = μd,σd argmaxμ,σ GdOist(μ, σ) ? [sent-269, score-0.285]
77 As the object affordance is often strongly coupled with the object classes, we use the affordance derived from the learned topics as features that feed into other learning al- gorithms, similar to the ideas using in supervised topic models [3]. [sent-282, score-1.525]
78 Although IFTM itself is an unsupervised method, in order to obtain more category-oriented topics, we initialize ziO to its object category to encourage topics to be shared by objects from the same class exclusively. [sent-283, score-0.6]
79 Note that when computing the affordance features (for both training and test data), no object labels are used. [sent-284, score-0.423]
80 In detail, we compute the affordance features as follows. [sent-285, score-0.376]
81 We set the affordance topics as the top K sampled topics θkO, ranked by the posterior distribution. [sent-286, score-1.395]
82 Then we use the histogram of sampled ziO as the affordance features for object i. [sent-288, score-0.462]
83 This is our affordance and human configurations information being used in prediction, without using object-object context. [sent-304, score-0.641]
84 Here we combine the human context (from affordances and human configurations) with object-object context. [sent-331, score-0.765]
85 In detail, we append the node features of each segment with the affordance topic proportions derived from the learned object-affordance topics and learn the semantic labeling model as described in [19]. [sent-332, score-1.176]
86 Being able to hallucinate sensible human poses is critical for learning object affordances. [sent-342, score-0.353]
87 To verify that our algorithm can sample meaningful human poses, we plot a few top sampled poses in the scenes, shown in Fig. [sent-343, score-0.33]
88 In the first home scene, some sampled human poses are sitting on the edge of the bed while others standing close to the desk (so that they have easy access to objects on the table or the shelf-rack). [sent-345, score-0.64]
89 It is these correctly sampled human poses that give us possibility to learn correct object affordances. [sent-352, score-0.342]
90 Our goal is to learn object affordance for each class. [sent-355, score-0.423]
91 6 shows the affordances from the topview and side-view respectively for typical object classes. [sent-357, score-0.394]
92 Note that while the affordance topics are unimodal, the affordance for each objects is a mixture of these topics and thus could be multimodal and more expressive. [sent-365, score-1.833]
93 1 shows that the affordance topic proportions (human context) as extra features boosts the labeling performance. [sent-386, score-0.686]
94 First, when combining human context with the image- and shapefeatures, we see a consistent improvement in labeling performance in all evaluation metrics, regardless of the objectobject context. [sent-387, score-0.391]
95 In fact, adding object-object context to human-object context was particularly helpful for small objects such as keyboards and books that are not always used by humans together, but still have a spatial correlation between them. [sent-390, score-0.436]
96 Similarly, it confuses cpuTop with chairBase because the CPU-top (placed on the ground) could also afford sitting human poses! [sent-396, score-0.314]
97 Conclusions We presented infinite factored topic models (IFTM) that enabled us to model the generation of a scene containing objects through hallucinated (hidden) human configurations and object affordances, both modeled as topics. [sent-402, score-0.911]
98 Learning object arrangements in 3d scenes using human context. [sent-503, score-0.288]
99 Learning human activities and object affordances from rgb-d videos. [sent-540, score-0.556]
100 Modeling mutual context of object and human pose in human-object interaction activities. [sent-612, score-0.341]
wordName wordTfidf (topN-words)
[('topics', 0.49), ('affordance', 0.376), ('affordances', 0.347), ('iftm', 0.217), ('zio', 0.204), ('topic', 0.189), ('human', 0.162), ('dpmm', 0.123), ('factored', 0.118), ('humans', 0.118), ('keyboard', 0.116), ('configurations', 0.103), ('sitting', 0.096), ('hallucinated', 0.095), ('labeling', 0.094), ('poses', 0.094), ('context', 0.094), ('saxena', 0.087), ('zih', 0.082), ('relations', 0.072), ('infinite', 0.068), ('scene', 0.066), ('objects', 0.063), ('offices', 0.061), ('zoio', 0.061), ('jiang', 0.058), ('confuses', 0.056), ('bed', 0.055), ('cputop', 0.054), ('humanobject', 0.054), ('koppula', 0.054), ('gh', 0.054), ('attribute', 0.053), ('standing', 0.052), ('monitor', 0.052), ('hallucinate', 0.05), ('chairbase', 0.05), ('quilt', 0.05), ('gupta', 0.048), ('desk', 0.047), ('scenes', 0.047), ('object', 0.047), ('book', 0.046), ('fouhey', 0.045), ('chairs', 0.045), ('chair', 0.045), ('rgbd', 0.044), ('micro', 0.043), ('pillow', 0.043), ('dirichlet', 0.042), ('keyboards', 0.041), ('objectaffordance', 0.041), ('objectobject', 0.041), ('parsimonious', 0.041), ('parsimony', 0.041), ('zhih', 0.041), ('hidden', 0.041), ('ijrr', 0.041), ('delaitre', 0.041), ('go', 0.04), ('office', 0.04), ('sampled', 0.039), ('macro', 0.038), ('xi', 0.038), ('mixture', 0.038), ('pose', 0.038), ('types', 0.037), ('distribution', 0.037), ('cornell', 0.037), ('bedside', 0.036), ('stepping', 0.036), ('meaningful', 0.035), ('indoor', 0.034), ('tabletop', 0.034), ('yun', 0.034), ('arrangements', 0.032), ('home', 0.032), ('cpus', 0.032), ('activity', 0.031), ('configuration', 0.031), ('ont', 0.03), ('floor', 0.029), ('relationships', 0.028), ('anand', 0.028), ('obj', 0.028), ('jo', 0.028), ('rne', 0.028), ('wall', 0.028), ('xj', 0.027), ('proportions', 0.027), ('watching', 0.027), ('explanation', 0.027), ('furniture', 0.026), ('sampling', 0.026), ('distributions', 0.026), ('reasoning', 0.026), ('books', 0.026), ('kh', 0.026), ('lim', 0.025), ('hi', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
Author: Yun Jiang, Hema Koppula, Ashutosh Saxena
Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
2 0.11789064 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models
Author: Yibiao Zhao, Song-Chun Zhu
Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.
3 0.11135226 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
Author: Gangqiang Zhao, Junsong Yuan, Gang Hua
Abstract: A topical video object refers to an object that is frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as Latent Dirichelet Allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom-up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top down probabilistic topic modeling with bottom up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when comparing with topic models without such priors.
4 0.10421123 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
Author: Luca Del_Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, Kobus Barnard
Abstract: We develop a comprehensive Bayesian generative model for understanding indoor scenes. While it is common in this domain to approximate objects with 3D bounding boxes, we propose using strong representations with finer granularity. For example, we model a chair as a set of four legs, a seat and a backrest. We find that modeling detailed geometry improves recognition and reconstruction, and enables more refined use of appearance for scene understanding. We demonstrate this with a new likelihood function that re- wards 3D object hypotheses whose 2D projection is more uniform in color distribution. Such a measure would be confused by background pixels if we used a bounding box to represent a concave object like a chair. Complex objects are modeled using a set or re-usable 3D parts, and we show that this representation captures much of the variation among object instances with relatively few parameters. We also designed specific data-driven inference mechanismsfor eachpart that are shared by all objects containing that part, which helps make inference transparent to the modeler. Further, we show how to exploit contextual relationships to detect more objects, by, for example, proposing chairs around and underneath tables. We present results showing the benefits of each of these innovations. The performance of our approach often exceeds that of state-of-the-art methods on the two tasks of room layout estimation and object recognition, as evaluated on two bench mark data sets used in this domain. work. 1) Detailed geometric models, such as tables with legs and top (bottom left), provide better reconstructions than plain boxes (top right), when supported by image features such as geometric context [5] (top middle), or an approach to using color introduced here. 2) Non convex models allow for complex configurations, such as a chair under a table (bottom middle). 3) 3D contextual relationships, such as chairs being around a table, allow identifying objects supported by little image evidence, like the chair behind the table (bottom right). Best viewed in color.
5 0.10402121 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh
Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.
6 0.10100137 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
8 0.090989478 406 cvpr-2013-Spatial Inference Machines
9 0.084019274 1 cvpr-2013-3D-Based Reasoning with Blocks, Support, and Stability
10 0.082240917 78 cvpr-2013-Capturing Layers in Image Collections with Componential Models: From the Layered Epitome to the Componential Counting Grid
11 0.081996061 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
12 0.080610074 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
13 0.079720154 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
14 0.074285753 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts
15 0.073860385 40 cvpr-2013-An Approach to Pose-Based Action Recognition
16 0.073190048 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
17 0.065167002 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
18 0.06497971 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings
19 0.064817473 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
20 0.064633347 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
topicId topicWeight
[(0, 0.153), (1, -0.029), (2, 0.008), (3, -0.06), (4, 0.024), (5, 0.015), (6, -0.023), (7, 0.099), (8, 0.003), (9, 0.008), (10, 0.008), (11, 0.011), (12, -0.014), (13, 0.008), (14, -0.028), (15, 0.021), (16, 0.046), (17, 0.109), (18, -0.051), (19, -0.057), (20, 0.016), (21, 0.021), (22, 0.067), (23, -0.014), (24, -0.019), (25, -0.023), (26, 0.009), (27, 0.012), (28, -0.031), (29, -0.038), (30, -0.041), (31, -0.016), (32, -0.044), (33, 0.011), (34, 0.022), (35, -0.03), (36, -0.059), (37, 0.11), (38, -0.006), (39, -0.089), (40, -0.037), (41, -0.016), (42, 0.021), (43, -0.0), (44, 0.047), (45, 0.099), (46, 0.045), (47, -0.024), (48, 0.027), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.9316566 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
Author: Yun Jiang, Hema Koppula, Ashutosh Saxena
Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
2 0.72014093 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese
Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
3 0.7114225 381 cvpr-2013-Scene Parsing by Integrating Function, Geometry and Appearance Models
Author: Yibiao Zhao, Song-Chun Zhu
Abstract: Indoor functional objects exhibit large view and appearance variations, thus are difficult to be recognized by the traditional appearance-based classification paradigm. In this paper, we present an algorithm to parse indoor images based on two observations: i) The functionality is the most essentialproperty to define an indoor object, e.g. “a chair to sit on ”; ii) The geometry (3D shape) ofan object is designed to serve its function. We formulate the nature of the object function into a stochastic grammar model. This model characterizes a joint distribution over the function-geometryappearance (FGA) hierarchy. The hierarchical structure includes a scene category, , functional groups, , functional objects, functional parts and 3D geometric shapes. We use a simulated annealing MCMC algorithm to find the maximum a posteriori (MAP) solution, i.e. a parse tree. We design four data-driven steps to accelerate the search in the FGA space: i) group the line segments into 3D primitive shapes, ii) assign functional labels to these 3D primitive shapes, iii) fill in missing objects/parts according to the functional labels, and iv) synthesize 2D segmentation maps and verify the current parse tree by the Metropolis-Hastings acceptance probability. The experimental results on several challenging indoor datasets demonstrate theproposed approach not only significantly widens the scope ofindoor sceneparsing algorithm from the segmentation and the 3D recovery to the functional object recognition, but also yields improved overall performance.
4 0.70583111 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
Author: C. Lawrence Zitnick, Devi Parikh
Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
5 0.66783708 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun
Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].
6 0.65565729 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
8 0.62215239 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
9 0.59066224 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
10 0.58537412 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings
11 0.58522964 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling
12 0.56376052 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
13 0.56323498 440 cvpr-2013-Tracking People and Their Objects
14 0.54206729 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
16 0.53131342 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
17 0.53087199 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
18 0.53066826 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
19 0.51832557 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
20 0.5179382 406 cvpr-2013-Spatial Inference Machines
topicId topicWeight
[(10, 0.107), (12, 0.03), (16, 0.015), (19, 0.269), (26, 0.043), (33, 0.226), (39, 0.011), (67, 0.062), (69, 0.064), (80, 0.014), (87, 0.063)]
simIndex simValue paperId paperTitle
1 0.85937196 463 cvpr-2013-What's in a Name? First Names as Facial Attributes
Author: Huizhong Chen, Andrew C. Gallagher, Bernd Girod
Abstract: This paper introduces a new idea in describing people using their first names, i.e., the name assigned at birth. We show that describing people in terms of similarity to a vector of possible first names is a powerful description of facial appearance that can be used for face naming and building facial attribute classifiers. We build models for 100 common first names used in the United States and for each pair, construct a pairwise firstname classifier. These classifiers are built using training images downloaded from the internet, with no additional user interaction. This gives our approach important advantages in building practical systems that do not require additional human intervention for labeling. We use the scores from each pairwise name classifier as a set of facial attributes. We show several surprising results. Our name attributes predict the correct first names of test faces at rates far greater than chance. The name attributes are applied to gender recognition and to age classification, outperforming state-of-the-art methods with all training images automatically gathered from the internet.
2 0.81738132 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles
Author: Patrick Lucey, Alina Bialkowski, Peter Carr, Stuart Morgan, Iain Matthews, Yaser Sheikh
Abstract: In this paper, we describe a method to represent and discover adversarial group behavior in a continuous domain. In comparison to other types of behavior, adversarial behavior is heavily structured as the location of a player (or agent) is dependent both on their teammates and adversaries, in addition to the tactics or strategies of the team. We present a method which can exploit this relationship through the use of a spatiotemporal basis model. As players constantly change roles during a match, we show that employing a “role-based” representation instead of one based on player “identity” can best exploit the playing structure. As vision-based systems currently do not provide perfect detection/tracking (e.g. missed or false detections), we show that our compact representation can effectively “denoise ” erroneous detections as well as enabling temporal analysis, which was previously prohibitive due to the dimensionality of the signal. To evaluate our approach, we used a fully instrumented field-hockey pitch with 8 fixed highdefinition (HD) cameras and evaluated our approach on approximately 200,000 frames of data from a state-of-the- art real-time player detector and compare it to manually labelled data.
same-paper 3 0.79786068 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
Author: Yun Jiang, Hema Koppula, Ashutosh Saxena
Abstract: For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task ofattribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
4 0.74399537 66 cvpr-2013-Block and Group Regularized Sparse Modeling for Dictionary Learning
Author: Yu-Tseh Chi, Mohsen Ali, Ajit Rajwade, Jeffrey Ho
Abstract: This paper proposes a dictionary learning framework that combines the proposed block/group (BGSC) or reconstructed block/group (R-BGSC) sparse coding schemes with the novel Intra-block Coherence Suppression Dictionary Learning (ICS-DL) algorithm. An important and distinguishing feature of the proposed framework is that all dictionary blocks are trained simultaneously with respect to each data group while the intra-block coherence being explicitly minimized as an important objective. We provide both empirical evidence and heuristic support for this feature that can be considered as a direct consequence of incorporating both the group structure for the input data and the block structure for the dictionary in the learning process. The optimization problems for both the dictionary learning and sparse coding can be solved efficiently using block-gradient descent, and the details of the optimization algorithms are presented. We evaluate the proposed methods using well-known datasets, and favorable comparisons with state-of-the-art dictionary learning methods demonstrate the viability and validity of the proposed framework.
5 0.7352913 377 cvpr-2013-Sample-Specific Late Fusion for Visual Category Recognition
Author: Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, Shih-Fu Chang
Abstract: Late fusion addresses the problem of combining the prediction scores of multiple classifiers, in which each score is predicted by a classifier trained with a specific feature. However, the existing methods generally use a fixed fusion weight for all the scores of a classifier, and thus fail to optimally determine the fusion weight for the individual samples. In this paper, we propose a sample-specific late fusion method to address this issue. Specifically, we cast the problem into an information propagation process which propagates the fusion weights learned on the labeled samples to individual unlabeled samples, while enforcing that positive samples have higher fusion scores than negative samples. In this process, we identify the optimal fusion weights for each sample and push positive samples to top positions in the fusion score rank list. We formulate our problem as a L∞ norm constrained optimization problem and apply the Alternating Direction Method of Multipliers for the optimization. Extensive experiment results on various visual categorization tasks show that the proposed method consis- tently and significantly beats the state-of-the-art late fusion methods. To the best knowledge, this is the first method supporting sample-specific fusion weight learning.
6 0.73335463 64 cvpr-2013-Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification
7 0.72145683 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
8 0.70763916 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models
9 0.70735079 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
10 0.7021355 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
11 0.7006681 414 cvpr-2013-Structure Preserving Object Tracking
12 0.70059985 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
13 0.6996088 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
14 0.69958967 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics
15 0.69936419 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
16 0.69936097 325 cvpr-2013-Part Discovery from Partial Correspondence
17 0.69911367 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses
18 0.69865578 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models
19 0.69768953 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
20 0.69765174 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects