iccv iccv2013 iccv2013-46 knowledge-graph by maker-knowledge-mining

46 iccv-2013-Allocentric Pose Estimation


Source: pdf

Author: M. José Antonio, Luc De_Raedt, Tinne Tuytelaars

Abstract: The task of object pose estimation has been a challenge since the early days of computer vision. To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, external elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been shown to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for pose estimation. In particular, we look at object configurations. We show that, starting from noisy object detections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to estimate the objects’ poses more accurately. We explore both a camera-centered as well as an object-centered representation for relations. Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estimation. In addition, object-centered relational representations can also assist object detection.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 KU Leuven, ESAT-PSI, iMinds Luc De Raedt KU Leuven, CS-DTAI Tinne Tuytelaars KU Leuven, ESAT-PSI, iMinds Abstract The task of object pose estimation has been a challenge since the early days of computer vision. [sent-2, score-0.505]

2 To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. [sent-3, score-0.465]

3 In this paper, we explore how information from other objects in the scene can be exploited for pose estimation. [sent-6, score-0.495]

4 We show that, starting from noisy object detections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to estimate the objects’ poses more accurately. [sent-8, score-0.984]

5 Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estimation. [sent-10, score-0.476]

6 In addition, object-centered relational representations can also assist object detection. [sent-11, score-0.425]

7 Introduction Object pose or viewpoint estimation is an important problem for a wide range of applications, including robotics and road safety systems. [sent-13, score-0.366]

8 Yet, to the best of our knowledge, context information has not yet been exploited for pose estimation. [sent-19, score-0.388]

9 Imagine you are given the task of predicting the pose of the objects below the yellow circles in Fig. [sent-20, score-0.42]

10 First, we need a method to define informative relations between objects. [sent-27, score-0.439]

11 These relations should be robust to viewpoint changes and general enough to be applicable to different classes of objects (i. [sent-28, score-0.466]

12 In this paper, we explore how in- formation from other objects in the scene can be exploited for the task of pose estimation. [sent-32, score-0.532]

13 We capture statistics of typical objects configurations using kernel density estimation, and combine this information using collective classification, more specifically a Relational Neighbor classifier [23]. [sent-37, score-0.398]

14 The main contributions of our work are: First, we show that considering configurations between objects can be ben289 eficial for pose estimation: the proposed collective classification method complements state-of-the-art local pose estimation methods. [sent-38, score-1.05]

15 object-centered or cameracentered, used to define relations between objects for object pose estimation and detection. [sent-41, score-0.964]

16 To our knowledge this is the first attempt to exploit relations defined between object entities via collective classification for the task of pose estimation. [sent-42, score-0.988]

17 The following three sections show how we define and learn relations be– tween objects in the scene, and how we combine them with the evidence from local detectors. [sent-45, score-0.563]

18 In this paper we focus on object classes with defined shape and appearance, the “Things”, and methods exploiting relations and configurations between them to predict their pose. [sent-52, score-0.559]

19 In the traditional processing pipeline for pose estimation, first, candidate regions to host object instances are proposed. [sent-55, score-0.4]

20 Similar to these works we define relations between scene elements. [sent-66, score-0.465]

21 However, instead of defining relations between different scene element types such as points, regions or objects, we focus on relations between object instances. [sent-67, score-0.971]

22 In recent years, learning relations between “Things” has gained popularity in the computer vision community, particularly to assist the task of object detection. [sent-69, score-0.579]

23 More recently, [19, 30] use discriminant relations between objects to learn the collective appearance of related objects in order to guide the detection of the individual objects. [sent-75, score-0.735]

24 Similar to these works, we learn relations between object instances. [sent-76, score-0.483]

25 in-frontof, close, near, far) we use continuous measures to define relations between entities as in [4, 28, 29]. [sent-81, score-0.438]

26 Finally, different from existing work, we explore the use of relations defined in an object-centered Frame of Reference. [sent-82, score-0.405]

27 In order to measure the level to which an object fits in a group of objects, first, we need to define relations between objects. [sent-92, score-0.548]

28 We define these relations in two different ways, by changing the location and orientation of the frame of reference (FoR). [sent-94, score-0.502]

29 First an object oi is selected and the frame of reference is centered on it with the Z-axis facing in the frontal direction of the object (see Figure 2b). [sent-97, score-0.516]

30 Then, we measure the relative location and pose of each of the other objects oj, one at a time, producing a rela290 relations, b) object-centered relations. [sent-98, score-0.447]

31 For an image with m objects a total of (m(m 1)) pairwise relations are extracted. [sent-101, score-0.466]

32 For these, we use the same relational descriptor as above, yet with everything measured relative to a frame of reference attached to the camera (see Figure 2a). [sent-104, score-0.371]

33 Allocentric Pose Estimation With allocentric pose estimation, we refer to the task of estimating the pose θi of an object oi purely based on the objects in its neighborhood Ni. [sent-109, score-1.471]

34 In our experiments, Ni is the set containing all the other objects oj in the scene. [sent-110, score-0.422]

35 This pose is estimated as follows: θi∗ = argmθaix(pRN(oi|Ni)) (1) where θi belongs to the discrete set of possible poses and pRN(θi |Ni) is a probabilistic Relational Neighbor classifier (pRN) as introduced in [23]. [sent-111, score-0.468]

36 This classifier operates in a node-centric fashion meaning that it processes one object oi at a time based on a set of m objects oj in its neighborhood Ni. [sent-114, score-0.911]

37 ∈Nip(oi|oj)p( oˆ j) (2) This classifier is composed by three terms: p(oi |oj), which expresses the influence of the neighboring object oj on the unknown object oi; the term p( oˆ j) which measures the confidence on the neighbor oj ; and the normalization term Z = ? [sent-116, score-1.002]

38 the same pose (a) and opposite pose (b) respectively . [sent-118, score-0.627]

39 3 to define pairwise relations rij between the hypotheses reported for each image. [sent-125, score-0.707]

40 One group contains relations in which both participants are TP hypotheses and the second group contains relations in which at least one participant is a FP hypothesis. [sent-127, score-1.006]

41 Finally, the relations on these groups are used via Kernel Density Estimation (KDE) to estimate p(rij |oi) and p(rij |¬oi) respectively. [sent-128, score-0.413]

42 For instance, when applied on top of OC relations, it effectively encodes that cars with the same pose tend to be one behind the other as when driving in the same lane, while cars with opposite poses are more likely to be driving on the left - as in opposite lanes (see figure 3). [sent-130, score-0.59]

43 The priors p(oi) and p(¬oi) of the object occurring or not at the given location, are estimated as the percentage of TP hypotheses and FP hypotheses in the validation set, respectively. [sent-131, score-0.45]

44 In this paper we focus mainly on the task of object pose estimation and as a side experiment in re-ranking object detections (see Sec. [sent-139, score-0.638]

45 Pose Estimation: For the task of object pose estimation, we estimate p( oˆ j) ∼ p(θˆj) aiming to compensate for the noise in the poses used to compute rij . [sent-142, score-0.667]

46 Since the scores given as output by state-of-the-art pose aware detectors are indications of the localization of the object rather than of its pose, we exploit the information from the confusion matrix of the pose estimator. [sent-143, score-0.807]

47 Given a 3D object oj with estimated continuous pose (see Sec. [sent-144, score-0.737]

48 While the local classifier (lc) pulls the decision towards individual features, the relational classifier (rc) (Eq. [sent-159, score-0.514]

49 Implementation Details The focus of this paper is on the study of how relations between objects can assist the task of object pose estimation. [sent-171, score-0.962]

50 For this reason rather than proposing our own object detector and pose estimator we use state-of-the-art detectors to acquire evidence ofobjects in the scene. [sent-172, score-0.654]

51 Both methods are based on the popular deformable parts model of [26], and both of them jointly tackle the problems of object detection and pose estimation. [sent-174, score-0.441]

52 These detectors, separately, feed our framework with confidence scores, locations (2D bounding box) and poses of object hypotheses discretized into 8 and 16 partitions respectively. [sent-176, score-0.379]

53 To measure the certainty of this estimation during testing, we perform a linear interpolation of the estimated azimuth angle using the closest discrete pose angles and the confusion 292 table of the local pose estimator as discussed in Sec. [sent-192, score-0.866]

54 Since one of our objectives is to evaluate the influence of the frame of reference for defining informative relations, we define relations using both CC and OC FoRs. [sent-195, score-0.574]

55 For object-centered relations an additional step is required where the FoR should be centered in the trajector object before any relation attribute can be measured (see Section 3). [sent-197, score-0.483]

56 Dataset Most pose estimation datasets do not include groups of objects in images. [sent-205, score-0.451]

57 We evaluate the influence of the FoR when defining relations between objects in both ideal (annotated) and real (estimated) world settings. [sent-209, score-0.618]

58 For the ideal setting, the dataset provides 3D location and pose vectors for the objects. [sent-210, score-0.406]

59 For the real setting, it provides stereo pairs for each scene and object annotations that allow us to build methods to learn and evaluate the configurations between object instances. [sent-211, score-0.382]

60 Additionally, the multiple cars occurring in each image provide a challenging realistic scenario with occlusions and clutter that will be useful to evaluate our proposed allocentric pose estimator. [sent-212, score-0.721]

61 The first quarter of the set is used for training the relational classifier and estimating the pose estimator confusion matrix. [sent-215, score-0.795]

62 Ideal Scenario Experiment: The first experiment aims at answering the question: “How much information about the object’s pose can be obtained based on the locations and poses of objects in its neighborhood? [sent-231, score-0.517]

63 To this end, we consider the ideal scenario, where the local object detector and pose estimator are 100% accurate for the objects in the neighborhood. [sent-233, score-0.707]

64 In this scenario all the objects of interest in the scene have been detected and their pose has been accurately predicted. [sent-234, score-0.497]

65 The pose of each object is then predicted based on the ground truth locations and poses in its neighborhood. [sent-236, score-0.503]

66 The objective of this experiment is to present the upper limit of the performance that the Relational Classifier (RC) used for allocentric pose estimation can achieve in an ideal setting on the current dataset. [sent-237, score-0.817]

67 We compare 2 ideal allocentric pose estimators that are able to predict 8 and 16 poses respectively. [sent-238, score-0.813]

68 Discussion: Table 1 shows that, in an ideal scenario, the allocentric pose estimator takes advantage of finer discretization of object poses. [sent-239, score-0.88]

69 This experiment shows the upper limits in performance that can be expected from allocentric pose estimation using local detectors [12, 21]. [sent-241, score-0.809]

70 At the same time, this upper bound is similar or even higher than what current state-of-the-art local detectors can obtain (see below), and therefore using context information to improve pose estimation results seems promising. [sent-243, score-0.484]

71 We define object-centered relations between the 3D hypotheses in the scene (i. [sent-245, score-0.639]

72 the 2D object detection back projected onto the ground plane ) and perform pose estimation based on the method proposed in Section 3. [sent-247, score-0.536]

73 The objective of this experiment is to evaluate: a) the performance of the local pose estimators, b) the performance of pose estimation based on object relations alone, and c) the changes in performance brought by the method proposed in Sec. [sent-248, score-1.249]

74 Given a set of 3D hypotheses oi we suppress all the hypotheses that are closer than a threshold value t. [sent-268, score-0.608]

75 Discussion: The results of this experiment (see table 2) show it is possible, also in a real scenario to estimate, at least to some extent, the pose of objects by looking at the poses and locations of other objects even if these poses and locations are noisy themselves. [sent-272, score-0.765]

76 While the performance of the relational classifier alone is lower than the one obtained by the local classifier, it is significantly above the chance level (12. [sent-273, score-0.435]

77 Moreover, the combination of – both local and relational classifier brings a mean improvement, over the local classifier, of 2. [sent-276, score-0.42]

78 Additionally, this shows that information encoded by our allocentric pose estimator is complementary to the local detectors and can help in scenarios where evidence of multiple object instances can be obtained. [sent-282, score-0.939]

79 We additionally tried a variation of this setting where pose information is ignored when defining relations between object instances. [sent-286, score-0.909]

80 As expected, allocentric pose estimation in this setting has lower performance. [sent-288, score-0.717]

81 In fact, its performance is close to chance level and is 15% lower than the setting where relations include pose information. [sent-289, score-0.756]

82 Given these observations, we conclude that object pose information plays an important role when modeling configurations between object instances and that it is an intrinsic feature that must be considered in future algorithms that take into account contextual features for reasoning. [sent-290, score-0.671]

83 This is also a strong evidence that we are dealing with a true collective classification problem as the pose of one object depends on the pose of the other mation. [sent-291, score-0.877]

84 Per set: Top image, hypotheses reported by the detector; bottom left, in bird’s eye view, initial pose prediction given by the standard pose estimator; bottom right, in bird’s eye view, pose pre- diction when considering object configurations. [sent-292, score-1.17]

85 2 Object Verification While in this paper we focus on the task of pose estimation, the configuration of objects and their poses in a neighborhood around a given object can also be exploited for object verification, i. [sent-297, score-0.767]

86 We define Object Verification as the task of re-ranking the set of hypotheses given by a detector in such a way that the most likely hypotheses get a higher score. [sent-301, score-0.446]

87 For this task we need a relational classifier that predicts the occurrence of an object oi given the objects in its neighborhood Ni. [sent-302, score-0.931]

88 Additionally, we report the performance of using traditional cameracentered relations and our proposed object-centered Relations. [sent-312, score-0.472]

89 Considering the fact that we are reasoning in 3D Space, we repeat the previous object ver- ification experiment adding a pre-processing 3DNMS step applied on the 3D hypotheses (Table 4). [sent-342, score-0.367]

90 Discussion: The change in performance brought by the combination of local and relational classifiers, over the local classifier alone, confirms that indeed the proposed relations assist the task of object verification. [sent-343, score-1.039]

91 Future work will address reasoning about the volumetric properties of objects and the effect of the re-estimated poses on the aspect ratios of the hypotheses initially predicted by the detector. [sent-351, score-0.395]

92 3 Object-centered or Camera-centered To analyze the effect of the FoR when defining relations between objects, we evaluated the performance of the relational classifier with camera-centered relations and objectcentered relations respectively (Sec. [sent-354, score-1.615]

93 2 involving these types of relations and will provides us an overview of their effect in such tasks. [sent-360, score-0.381]

94 On the pose estimation problem, previous experiments proved that pose information plays an important role when defining relations. [sent-372, score-0.717]

95 Here object-centered relations bring an improvement of ∼2% over their camera-centered counterparts (Table 5). [sent-373, score-0.381]

96 Even when, in isolation, allocentric pose estimation does not solve the object pose estimation problem, experimental results suggest that the proposed method complements local pose estimators. [sent-376, score-1.534]

97 Experiments also prove the relevance of pose information when describing relations between object instances; a feature that has been largely ignored in existing work that exploits contextual information, even in the context of object verification. [sent-380, score-0.969]

98 This stresses the use of relative pose information as a feature to describe object relations. [sent-381, score-0.425]

99 Complement- ing this, experiments show how defining relations from an object-centered perspective can increase performance in object pose estimation and detection. [sent-383, score-0.902]

100 Deformable part models revisited: A performance evaluation for object category pose estimation. [sent-541, score-0.4]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('relations', 0.381), ('oj', 0.337), ('allocentric', 0.32), ('pose', 0.298), ('relational', 0.264), ('oi', 0.26), ('hypotheses', 0.174), ('collective', 0.143), ('prn', 0.137), ('kitti', 0.125), ('rij', 0.122), ('mppe', 0.122), ('object', 0.102), ('classifier', 0.094), ('cameracentered', 0.091), ('estimator', 0.091), ('objects', 0.085), ('poses', 0.076), ('configurations', 0.076), ('ccrel', 0.069), ('ocrel', 0.069), ('ryij', 0.069), ('ideal', 0.069), ('estimation', 0.068), ('objectcentered', 0.061), ('detectors', 0.061), ('scenario', 0.06), ('sj', 0.06), ('contextual', 0.06), ('reasoning', 0.06), ('fp', 0.059), ('rc', 0.059), ('assist', 0.059), ('occurrence', 0.056), ('things', 0.055), ('scene', 0.054), ('defining', 0.053), ('complements', 0.051), ('estimators', 0.05), ('lc', 0.05), ('tp', 0.049), ('verification', 0.049), ('stuff', 0.048), ('confusion', 0.048), ('stereo', 0.048), ('ku', 0.047), ('chance', 0.046), ('allocentrism', 0.046), ('additionally', 0.044), ('cars', 0.043), ('detection', 0.041), ('ni', 0.04), ('brought', 0.04), ('location', 0.039), ('leuven', 0.039), ('purely', 0.038), ('geiger', 0.038), ('task', 0.037), ('social', 0.037), ('bao', 0.036), ('evidence', 0.036), ('reason', 0.035), ('group', 0.035), ('driving', 0.034), ('oc', 0.034), ('exploited', 0.034), ('intrinsic', 0.033), ('neighborhood', 0.033), ('estimate', 0.032), ('azimuth', 0.032), ('hypothesis', 0.032), ('elements', 0.032), ('setting', 0.031), ('local', 0.031), ('opposite', 0.031), ('pulls', 0.031), ('detector', 0.031), ('experiment', 0.031), ('influence', 0.03), ('jo', 0.03), ('kde', 0.03), ('iminds', 0.03), ('yet', 0.03), ('define', 0.03), ('informative', 0.028), ('pascal', 0.028), ('forsyth', 0.027), ('entities', 0.027), ('locations', 0.027), ('plane', 0.027), ('reference', 0.027), ('classifiers', 0.026), ('context', 0.026), ('hoiem', 0.026), ('rule', 0.026), ('influences', 0.025), ('relative', 0.025), ('frame', 0.025), ('psychology', 0.025), ('explore', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 46 iccv-2013-Allocentric Pose Estimation

Author: M. José Antonio, Luc De_Raedt, Tinne Tuytelaars

Abstract: The task of object pose estimation has been a challenge since the early days of computer vision. To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, external elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been shown to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for pose estimation. In particular, we look at object configurations. We show that, starting from noisy object detections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to estimate the objects’ poses more accurately. We explore both a camera-centered as well as an object-centered representation for relations. Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estimation. In addition, object-centered relational representations can also assist object detection.

2 0.15043455 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency

Author: Jiongxin Liu, Peter N. Belhumeur

Abstract: In this paper, we propose a novel approach for bird part localization, targeting fine-grained categories with wide variations in appearance due to different poses (including aspect and orientation) and subcategories. As it is challenging to represent such variations across a large set of diverse samples with tractable parametric models, we turn to individual exemplars. Specifically, we extend the exemplarbased models in [4] by enforcing pose and subcategory consistency at the parts. During training, we build posespecific detectors scoring part poses across subcategories, and subcategory-specific detectors scoring part appearance across poses. At the testing stage, likely exemplars are matched to the image, suggesting part locations whose pose and subcategory consistency are well-supported by the image cues. From these hypotheses, part configuration can be predicted with very high accuracy. Experimental results demonstrate significantperformance gainsfrom our method on an extensive dataset: CUB-200-2011 [30], for both localization and classification tasks.

3 0.13702956 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image

Author: Jiyan Pan, Takeo Kanade

Abstract: Objects in a real world image cannot have arbitrary appearance, sizes and locations due to geometric constraints in 3D space. Such a 3D geometric context plays an important role in resolving visual ambiguities and achieving coherent object detection. In this paper, we develop a RANSAC-CRF framework to detect objects that are geometrically coherent in the 3D world. Different from existing methods, we propose a novel generalized RANSAC algorithm to generate global 3D geometry hypothesesfrom local entities such that outlier suppression and noise reduction is achieved simultaneously. In addition, we evaluate those hypotheses using a CRF which considers both the compatibility of individual objects under global 3D geometric context and the compatibility between adjacent objects under local 3D geometric context. Experiment results show that our approach compares favorably with the state of the art.

4 0.13682453 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation

Author: Joseph J. Lim, Hamed Pirsiavash, Antonio Torralba

Abstract: We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.

5 0.13363114 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

Author: Ibrahim Radwan, Abhinav Dhall, Roland Goecke

Abstract: In this paper, an automatic approach for 3D pose reconstruction from a single image is proposed. The presence of human body articulation, hallucinated parts and cluttered background leads to ambiguity during the pose inference, which makes the problem non-trivial. Researchers have explored various methods based on motion and shading in order to reduce the ambiguity and reconstruct the 3D pose. The key idea of our algorithm is to impose both kinematic and orientation constraints. The former is imposed by projecting a 3D model onto the input image and pruning the parts, which are incompatible with the anthropomorphism. The latter is applied by creating synthetic views via regressing the input view to multiple oriented views. After applying the constraints, the 3D model is projected onto the initial and synthetic views, which further reduces the ambiguity. Finally, we borrow the direction of the unambiguous parts from the synthetic views to the initial one, which results in the 3D pose. Quantitative experiments are performed on the HumanEva-I dataset and qualitatively on unconstrained images from the Image Parse dataset. The results show the robustness of the proposed approach to accurately reconstruct the 3D pose form a single image.

6 0.1305327 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data

7 0.12616676 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras

8 0.12503614 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses

9 0.12438372 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

10 0.11988444 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies

11 0.11634201 290 iccv-2013-New Graph Structured Sparsity Model for Multi-label Image Annotations

12 0.11520051 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

13 0.11478201 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

14 0.11247817 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction

15 0.10970455 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose

16 0.10695446 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding

17 0.10441069 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation

18 0.10188121 379 iccv-2013-Semantic Segmentation without Annotating Segments

19 0.097461358 86 iccv-2013-Concurrent Action Detection with Structural Prediction

20 0.095642291 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.224), (1, -0.033), (2, 0.001), (3, -0.003), (4, 0.133), (5, -0.082), (6, 0.002), (7, -0.011), (8, -0.07), (9, 0.02), (10, 0.053), (11, -0.015), (12, -0.15), (13, -0.082), (14, -0.048), (15, 0.061), (16, -0.004), (17, -0.025), (18, 0.041), (19, 0.049), (20, 0.01), (21, -0.024), (22, 0.106), (23, -0.018), (24, 0.088), (25, -0.064), (26, 0.014), (27, -0.017), (28, 0.002), (29, -0.006), (30, -0.028), (31, -0.009), (32, 0.035), (33, -0.041), (34, 0.041), (35, -0.014), (36, 0.017), (37, -0.039), (38, -0.04), (39, 0.023), (40, -0.001), (41, 0.027), (42, -0.018), (43, -0.0), (44, 0.03), (45, -0.059), (46, -0.002), (47, -0.027), (48, 0.03), (49, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97116345 46 iccv-2013-Allocentric Pose Estimation

Author: M. José Antonio, Luc De_Raedt, Tinne Tuytelaars

Abstract: The task of object pose estimation has been a challenge since the early days of computer vision. To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, external elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been shown to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for pose estimation. In particular, we look at object configurations. We show that, starting from noisy object detections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to estimate the objects’ poses more accurately. We explore both a camera-centered as well as an object-centered representation for relations. Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estimation. In addition, object-centered relational representations can also assist object detection.

2 0.91357601 118 iccv-2013-Discovering Object Functionality

Author: Bangpeng Yao, Jiayuan Ma, Li Fei-Fei

Abstract: Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recognition by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human poses and objects. In this paper, we propose a weakly supervised approach to discover all possible object functionalities. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Experimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach.

3 0.85553038 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation

Author: Joseph J. Lim, Hamed Pirsiavash, Antonio Torralba

Abstract: We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.

4 0.81000501 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

Author: Elisabeta Marinoiu, Dragos Papava, Cristian Sminchisescu

Abstract: Human motion analysis in images and video is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other people in images and how accurate they are. In this paper we aim to unveil some of the processing–as well as the levels of accuracy–involved in the 3D perception of people from images by assessing the human performance. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in particular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose reenactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses. We also discuss the implications of our find- ings for the construction of visual human sensing systems.

5 0.80016398 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

Author: Ibrahim Radwan, Abhinav Dhall, Roland Goecke

Abstract: In this paper, an automatic approach for 3D pose reconstruction from a single image is proposed. The presence of human body articulation, hallucinated parts and cluttered background leads to ambiguity during the pose inference, which makes the problem non-trivial. Researchers have explored various methods based on motion and shading in order to reduce the ambiguity and reconstruct the 3D pose. The key idea of our algorithm is to impose both kinematic and orientation constraints. The former is imposed by projecting a 3D model onto the input image and pruning the parts, which are incompatible with the anthropomorphism. The latter is applied by creating synthetic views via regressing the input view to multiple oriented views. After applying the constraints, the 3D model is projected onto the initial and synthetic views, which further reduces the ambiguity. Finally, we borrow the direction of the unambiguous parts from the synthetic views to the initial one, which results in the 3D pose. Quantitative experiments are performed on the HumanEva-I dataset and qualitatively on unconstrained images from the Image Parse dataset. The results show the robustness of the proposed approach to accurately reconstruct the 3D pose form a single image.

6 0.8000288 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation

7 0.79806066 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling

8 0.75921756 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose

9 0.7472086 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency

10 0.72540247 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

11 0.69303036 130 iccv-2013-Dynamic Structured Model Selection

12 0.68112767 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies

13 0.67800075 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data

14 0.67391104 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image

15 0.66877043 143 iccv-2013-Estimating Human Pose with Flowing Puppets

16 0.66197884 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction

17 0.65664977 205 iccv-2013-Human Re-identification by Matching Compositional Template with Cluster Sampling

18 0.6497348 341 iccv-2013-Real-Time Body Tracking with One Depth Camera and Inertial Sensors

19 0.6479674 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context

20 0.64484364 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.044), (7, 0.013), (26, 0.055), (31, 0.031), (42, 0.537), (64, 0.033), (73, 0.032), (89, 0.152), (98, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98537707 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing

Author: Xu Wang, Stefan Atev, John Wright, Gilad Lerman

Abstract: The problem of efficiently deciding which of a database of models is most similar to a given input query arises throughout modern computer vision. Motivated by applications in recognition, image retrieval and optimization, there has been significant recent interest in the variant of this problem in which the database models are linear subspaces and the input is either a point or a subspace. Current approaches to this problem have poor scaling in high dimensions, and may not guarantee sublinear query complexity. We present a new approach to approximate nearest subspace search, based on a simple, new locality sensitive hash for subspaces. Our approach allows point-tosubspace query for a database of subspaces of arbitrary dimension d, in a time that depends sublinearly on the number of subspaces in the database. The query complexity of our algorithm is linear in the ambient dimension D, allow- ing it to be directly applied to high-dimensional imagery data. Numerical experiments on model problems in image repatching and automatic face recognition confirm the advantages of our algorithm in terms of both speed and accuracy.

2 0.9783147 422 iccv-2013-Toward Guaranteed Illumination Models for Non-convex Objects

Author: Yuqian Zhang, Cun Mu, Han-Wen Kuo, John Wright

Abstract: Illumination variation remains a central challenge in object detection and recognition. Existing analyses of illumination variation typically pertain to convex, Lambertian objects, and guarantee quality of approximation in an average case sense. We show that it is possible to build models for the set of images across illumination variation with worstcase performance guarantees, for nonconvex Lambertian objects. Namely, a natural verification test based on the distance to the model guarantees to accept any image which can be sufficiently well-approximated by an image of the object under some admissible lighting condition, and guarantees to reject any image that does not have a sufficiently good approximation. These models are generated by sampling illumination directions with sufficient density, which follows from a new perturbation bound for directional illuminated images in the Lambertian model. As the number of such images required for guaranteed verification may be large, we introduce a new formulation for cone preserving dimensionality reduction, which leverages tools from sparse and low-rank decomposition to reduce the complexity, while controlling the approximation error with respect to the original model. 1

3 0.97795409 96 iccv-2013-Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition

Author: De-An Huang, Yu-Chiang Frank Wang

Abstract: Cross-domain image synthesis and recognition are typically considered as two distinct tasks in the areas of computer vision and pattern recognition. Therefore, it is not clear whether approaches addressing one task can be easily generalized or extended for solving the other. In this paper, we propose a unified model for coupled dictionary and feature space learning. The proposed learning model not only observes a common feature space for associating cross-domain image data for recognition purposes, the derived feature space is able to jointly update the dictionaries in each image domain for improved representation. This is why our method can be applied to both cross-domain image synthesis and recognition problems. Experiments on a variety of synthesis and recognition tasks such as single image super-resolution, cross-view action recognition, and sketchto-photo face recognition would verify the effectiveness of our proposed learning model.

4 0.97457975 70 iccv-2013-Cascaded Shape Space Pruning for Robust Facial Landmark Detection

Author: Xiaowei Zhao, Shiguang Shan, Xiujuan Chai, Xilin Chen

Abstract: In this paper, we propose a novel cascaded face shape space pruning algorithm for robust facial landmark detection. Through progressively excluding the incorrect candidate shapes, our algorithm can accurately and efficiently achieve the globally optimal shape configuration. Specifically, individual landmark detectors are firstly applied to eliminate wrong candidates for each landmark. Then, the candidate shape space is further pruned by jointly removing incorrect shape configurations. To achieve this purpose, a discriminative structure classifier is designed to assess the candidate shape configurations. Based on the learned discriminative structure classifier, an efficient shape space pruning strategy is proposed to quickly reject most incorrect candidate shapes while preserve the true shape. The proposed algorithm is carefully evaluated on a large set of real world face images. In addition, comparison results on the publicly available BioID and LFW face databases demonstrate that our algorithm outperforms some state-of-the-art algorithms.

5 0.96463943 167 iccv-2013-Finding Causal Interactions in Video Sequences

Author: Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps

Abstract: This paper considers the problem of detecting causal interactions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link indicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspondences). Moreover, this approach also identifies time instants where the interactions between agents change, thus providing event detection capabilities. These results are illustrated with several examples involving non–trivial interactions amongst several human subjects. 1. Introduction and Motivation The problem of identifying causal interactions amongst targets in a video sequence has been the focus of considerable attention in the past few years. A large portion of the existing body of work in this field uses human annotated video to build a storyline that includes both recognizing the activities involved and the causal relationships between them (see for instance [10] and references therein). While these methods are powerful and work well when suitably annotated data is available, annotating video clips is expensive and parsing relevant actions requires domain knowledge which may not be readily available. Indeed, in many situations, unveiling potentially hidden causal relationships is a first step towards building such knowledge. In this paper we consider the problem of identifying causal interactions amongst targets, not necessarily human, ∗This work was supported by NSF grants IIS–0713003, IIS-1318145, and ECCS–0901433, AFOSR grant FA9559–12–1–0271, and the Alert DHS Center of Excellence under Award Number 2008-ST-061-ED0001 . from unannotated video sequences and without prior domain knowledge. Our approach exploits the concept of “Granger Causality” [9], that formalizes the intuitive idea that ifa time series {x(t)} is causally related to a second one {thya(tt)if}a, ttihmene knowledge }oifs tchaeu past vrealluateesd otfo a{yse}c1to should l{eya(dt t}o, a ebnett kern prediction o thf efu ptuasret vvaalulueess ooff {{yx}}tt+k. In [l1ea4d], Pora ab bheatktearr eprt.e aicl.t successfully vuasleude a frequency domain reformulation of this concept to uncover pairwise interactions in scenarios involving repeating events, such as social games. This technique was later extended in [17] to model causal correlations between human joints and applied to the problem of activity classification. However, since this approach is based upon estimating the crosscovariance density function between events, it cannot handle situations where these events are non repeating, are too rare to provide an accurate estimate, or where these estimates are biased by outliers or missing data. Further, estimating a pairwise measure of causal correlation requires a spectral factorization of the cross-covariance, followed by numerical integration and statistical thresholding, limiting the approach to moderately large problems. To circumvent these problems, in this paper we propose an alternative approach based upon recasting the problem into that of identifying the topology of a sparse (directed) graph, where each node corresponds to the time traces of relevant features of a target, and each link corresponds to a regressor. The situation is illustrated in Fig. 1 using as an example the problem of finding causal relations amongst 4 tennis players, leading to a graph with 4 nodes, and potentially 12 (directed) links. Note that in general, the problem of identifying causal relationships is ill posed (unless one wants to identify the set of all individuals that could possibly have causal connections), due to the existence of secondary interactions. To illustrate this point, consider a very simplistic scenario with three actors A, B, and C, where A copies (with some delay) the actions of B, which in turn mimics C, also with some delay. In this situation, the ac- tions of A can be explained in terms of either those of B delayed one time sample, or those of C delayed by two samples. Thus, an algorithm based upon a statistical analysis 33556758 would identify a causal connection between A and C, even though there is no direct link between them. Further, if the actions of C can be explained by some simple autoregressive model of the form: = C(t) ?aiC(t − i) then it follows that the acti?ons of A can be explained by the same model, e.g. = A(t) ?aiA(t − i) Hence, multiple graphs topologies, some of which include self-loops, can explain the same set of time-series. On the other hand, note that in this situation, the sparsest graph (in the sense of having the fewest links) is the one that correctly captures the causality relations: the most direct cause of A is B and that of B is C, with C potentially being explained by a self-loop. To capture this feature and regularize the problem, in the sequel we will seek to find the sparsest graph, in the sense of having the least number of interconnections, that explains the observed data, reflecting the fact that, when alternative models are possible, often the most parsimonious is the correct one. Our main result shows that the problem of identifying sparse graph structures from observed noisy data can be reduced to a convex optimization problem (via the use of Group Lasso type arguments) that can be efficiently solved. The advantages of the proposed methods are: • • • • Its ability to handle complex scenarios involving nonrepeating events, een cvoimropnlmeexn stcael changes, clvoillnegct nioonnsof targets that do not necessarily split into well defined groups, outliers and missing data. The ability to identify the sparsest interaction structure tThhaet explains th idee nobtifseyr tvheed s dpaartas e(stthu inst avoiding labeling as causal connections those indirect correlations mediated only by an intermediary), together with a sparse “indicator” function whose support set indicates time instants where the interactions between agents change. Since the approach is not based on semantic analysis, iSt can bt hee applied ctoh ti she n moto btiaosne dof o arbitrary targets, sniost, necessarily humans (indeed, it applies to arbitrary time series including for instance economic or genetic data). From a computational standpoint, the resulting optiFmriozmatio an c problems nhaalve s a specific fthoerm re asmuletinnagbl oep ttiobe solved by a class of iterative algorithms [5, 3], that require at each step only a combination of thresholding and least-squares approximations. These algorithms have been shown to substantially outperform conventional convex-optimization solvers both in terms of memory and computation time requirements. The remainder of the paper is organized as follows. In section 2 we provide a formal reformulation of the problem of finding causal relationships between agents as a sparse graph identification problem. In section 3, we show that this problem can be efficiently solved using a re-weighted Group Lasso approach. Moreover, as shown there, the resulting problem can be solved one node at a time using first order methods, which allows for handling situations involving a large number of agents. Finally, the effectiveness of the proposed method is illustrated in section 4 using both simple scenarios (for which ground truth is readily available) and video clips of sports, involving complex, nonrepeating interactions amongst many agents. Figure 1. Finding causal interactions as a graph identification problem. Top: sample frame from a doubles tennis sequence. Bottom: Representation of this sequence as a graph, where each node represents the time series associated with the position of each player and the links are vector regressive models. Causal interactions exist when one of the time series can be explained as a combination of past values of the others. 2. Preliminaries For ease of reference, in this section we summarize the notation used in the paper and give a formal definition of the problem under consideration. 2.1. Notation (M) ?M? ??MM??F ?M?1 ?M?o σi ∗ ◦ ith largest singular value of the matrix M. nuclear norm: ?M? ?i σ?i (M). Fnruocbleeanrio nours norm: ??M?2F? ?i,j Mi2j ?1 norm: ?M? 1 ?i,j |Mij? ?|. ?o quasi-norm: ?M?o number of non-zero ?eleme?nMts i?n M. Hadamard product of matrices: (A ◦ ∗ =.: =. =. =. B)i,j = Ai,jBi,j. 33556769 2.2. Statement of the Problem Next, we formalize the problem under consideration. Consider a scenario with P moving agents, and denote by the 3D homogenous coordinates of the pth individual at time t. Motivated by the idea of Granger Causality, we will say that the actions of this agent depend causally from those in a set Ip (which can possibly contain p itself), if can be written as: Q˜p(t) Q˜p(t) Q˜p(t) ?N = ? ?ajp(n)Q˜j(t − n) +˜ η p(t) +˜ u p(t) (1) j? ?∈Ip ?n=0 Here ajp are unknown coefficients, and ˜η p(t) and up(t) represent measurement noise and a piecewise constant signal that is intended to account for relatively rare events that cannot be explained by the (past) actions of other agents. Examples include interactions of an agent with the environment, for instance to avoid obstacles, or changes in the interactions between agents. Since these events are infrequent, we will model as a signal that has (component-wise) a sparse derivative. Note in passing that since (1) involves homogeneous coordinates, the coefficients aj,p(.) satisfy the following constraint1 u ?N ? ?ajp(n) j? ?∈Ip ?n=0 =1 (2) Our goal is to identify causal relationships using as data 2D measurements qp(t) in F frames of the affine projections of the 3D coordinates Q˜p(t) of the targets. Note that, under the affine camera assumption, the 2D coordinates are related exactly by the same regressor parameters [2]. Thus, (1) holds if and only if: ?N qp(t) = ? ?ajp(n)qj(t − n) + u˜ p(t) + ηp(t) (3) j?∈Ip ?n=0 In this context, the problem can be precisely stated as: Given qp(t) (in F number of frames) and some a-priori bound N on the order of the regressors (that is the “memory” of the interactions), find the sparsest set of equations of the form (3) that explains the data, that is: aj,pm,ηinp,up?nIp (4) subject to? ?(2) and: = ? ?ajp(n)qj(t − n) + ?N qp(t) j? ?∈Ip ?n=0 up(t) + ηp(t) , p = 1 . . . , P and t = 1, ..F 1This follows by considering the third coordinate in (1) (5) where nIp denotes the cardinality of the set Ip. Rewriting (5) in matrixd efnoormtes yields: [xp; yp] = [Bp, I][apTuxTpuyTp]T + ηp (6) where qp(t) up(t) ηp(t) xp yp ap aip uxp uyp Bp Xp = [xp(t)Typ(t)T]T = [uTxp(t)uyTp(t)]T = [ηxp(t)Tηyp(t)T]T = = [xp(F)xp(F − 1)...xp(1)]T = [yp(F)yp(F − 1)...yp(1)]T [aT1p, a2Tp, ..., aTPp]T = [aip(0), aip(1), ..., aip(N)]T = [uxp(F)uxp(F−1)...uxp(1)]T = [uyp(F)uyp(F−1)...uyp(1)]T = = [Xp; Yp] [hankel(x1 , N) , ..., hankel(xP, N)] Yp = [hankel(y1, N), ..., hankel(yP, N)] and where, for a sequence z(t), hankel(z, N) denotes its associated Hankel matrix: hankel(z, N) = Itfolw⎛⎜⎝ sz t(hNzFa(t. +−a)d1 2e)scrzip(tF io(N. n− )o231f)al· t h· einzt(Frac−zti(.o1N.n)s−a)m12o)⎟ ⎞⎠ ngst uηaq= ? ηuqa1 T ,ηqau2 T ,ηaqu3 T ,· ·, ηauqP T ? T (8) Thus,inthBisc=on⎢⎣⎡teBx0t.1,theB0p.r2ob·le.·m·ofB0 i.nPte⎦⎥r ⎤estcanbeforagents (that is the complete graph structure) is captured by a matrix equation of the form: q = [B, I][aTuT]T + η (7) where and malized as finding the block–sparsest solution to the set of linear equations (2) and (7). 33557770 The problem of identifying a graph structure subject to sparsity constraints, has been the subject of intense research in the past few years. For instance, [1] proposed a Lasso type algorithm to identify a sparse network where each link corresponds to a VAR process. The main idea underlying this method is to exploit the fact that penalizing the ?1 norm of the vector of regression coefficients tends to produce sparse solutions. However, enforcing sparsity of the entire vector of regressor coefficients does not necessarily result in a sparse graph structure, since the resulting solution can consist of many links, each with a few coefficients. This difficulty can be circumvented by resorting to group Lasso type approaches [18], which seek to enforce block sparsity by using a combination of ?1 and ?2 norm constraints on the coefficients of the regressor. While this approach was shown to work well with artificial data in [11], exact recovery of the underlying network can be only guaranteed when the data satisfies suitable “incoherence” type conditions [4]. Finally, a different approach was pursued in [13], based on the use of a modified Orthogonal Least Squares algorithm, Cyclic Orthogonal Least Squares. However, this approach requires enforcing an a-priori limit on the number of links allowed to point to a single node, and such information may not be readily available, specially in cases where this number has high variability amongst nodes. To address these difficulties, in the next section we develop a convex optimization based approach to the problem of identifying sparse graph structures from observed noisy data. This method is closest in spirit to that in [11], in the sense that it is also based on a group Lasso type argument. The main differences consist in the ability to handle the unknown inputs up(t), needed to model exogenous disturbances affecting the agents, and in a reformulation of the problem, that allows for using a re-weighted iterative type algorithm, leading to substantially sparser solutions, even when the conditions in [4] fail. 3. Causality Identification Algorithm In this section we present the main result of this paper, an algorithm to search for block-sparse solutions to (7). For each fixed p, the algorithm searches for sparse solutions to (6) by solving (iteratively) the following problem (suggested by the re-weighted heuristic proposed in [7]) ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 subject to: ?ηp ? ≤ p = 1, . . , P. ∞ ?P ?, ?N ??aip(n) i?= ?1 ?n=0 ?. = 1, p = 1,...,P. (9) where [Δuxp ; Δuyp] represents the first order differences of the exogenous input vector [uxp ; uyp], Wa and Wu are weighting matrices, and λ is a Lagrange multiplier that plays the role of a tuning parameter between graph sparsity and event sensitivity. Intuitively, for a fixed set of weights w, the algorithm attempts to find a block sparse solution to (6) and a set of sparse inp?uts Δuxp ; Δuyp , by exploiting the facts that minimizing ?i ?aip ?2 (the ?2,1 norm of the vector sequence {aip}) te?nds? tao m?aximize block-sparsity [18], while minimizing et?nhed s? 1t norm mmaizxeim blizoceks sparsity [ [1168]]. wOhniclee t mheisnesolutions are found, the weights w are adjusted to penalize those elements of the sequences with small values, so that in the next iteration solutions that set these elements to zero (hence further increasing sparsity) are favored. Note however, that proceeding in this way, requires solving at each iteration a problem with n = P(Pnr + F) variables, where P and F denote the number of agents and frames, respectively, and where nr is a bound on the regressor order. On the other hand, it is easily seen that both the objective function and the constraints in (9) can be partitioned into P groups, with the pth group involving only the variables related to the pth node. It follows then that problem (9) can be solved by solving P smaller problems of the form: ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 ?P subject to: ?ηp?∞ ?N ≤ ? and ??aip(n) i?= ?1 ?n=0 leading to the algorithm given below: =1 (10) Algorithm 1: REWEIGHTEDCAUSALITYALGORITHM for each p wa = [1, 1, ..., 1] = [1, 1, ..., 1] S > 1(self loop weight) s = [1, 1, ..., S, ..., 1] (p’th element is S) while not converged do 1. solve (9) 2. wja = 1/( ?aip ?2 + δ) 3. wja = wja ◦ s (Penalization self loops) 4. = 1./(abs([Δuxp ; Δuyp]) + δ) end while 5. At this point ajp(.) , Ip and up(t) have been identified end for wu wu It is worth emphasizing that, since the computational complexity of standard interior point methods grows as n3, solving these smaller P problems leads to roughly a O(P2) 33557781 reduction in computational time over solving a single, larger optimization. Thus, this approach can handle moderately large problems using standard, interior-point based, semidefinite optimization solvers. Larger problems can be accommodated by noting that the special form of the objective and constraints allow for using iterative Augmented La- grangian Type Methods (ALM), based upon computing, at each step, the closed form solution to suitable intermediate optimization problems. While a complete derivation of such an algorithm is beyond the scope of this paper, using results from [12] it can be shown that each step requires only a combination of thresholding and least-squares approximations. Moreover, it can be shown that such an algorithm converges Q-superlinearly. 4. Handling Outliers and Missing Data The algorithm outlined above assumes an ideal situation where the data matrix B is perfectly known. However, in practice many of its elements may be outliers (due to misidentified correspondences) or missing (due to occlusion). As we briefly show next, these situations can be efficiently handled by performing a structured robust PCA step [3] to obtain a “clean” data matrix, prior to applying Algorithm 1. From equation (6) it follows that, in the absence of exogenous inputs and noise: ?xy11.. . .yxPP? = ?XY11.. . .YXPP? ?a1...aP? (11) Since xi ∈ {col(Xj)} and yi ∈ {col(Yj }), it follows that the sets {∈co {l(cXoli(X)} a)n}d a n{dco yl(Y∈i) { }c? are self-ex?pressive, or, ?equivalently?, Xthe }ma atnridce {sc oXl( =.) }? aXre1 . . . fX-eNxp? eanssdiv eY, ?Y1 ...YN? are mraantkri cdeesfic Xient. ?Consider no?w the case =.r, w?here some ?elements xi, yi of X and Y are missing. From ?the self-expressive property ooff {Xco aln(Xd Yi)} a raen dm i{scsoinlg(Y. Fi)ro} mit tfhoello swelsf tehxaptr ethsessieve missing eyle omf {encotsl are given by: xi = argmin rank(X) , yi x = argmin rank(Y) (12) y Similarly, in the presence of outliers, X, Y can be decomposed irnlyto, itnhe t sum oesfe a lcoew o fra onkut mlieartsr,ix X (,thYe ccalenan b eda dtae)c oamnda sparse one (the outliers) by solving a problem of the form minrank?YXoo?+ λ????EEYX????os. t.: ?XYoo?+?EEYX?=?YX? From the reasoning? abov?e it follows that in the presence of noise and exogenous outputs, the clean data record can be recovered from the corrupted, partial measurements by solving the following optimization problem: s+muλibn3je? ? ? cYXtM ot ? Y?X:∗◦ +Ξ λYX1? ? ? FM XY◦ E XY? ?1+λ2? ?M YX◦ Δ U YX? ?1 ?YX?=?XYoo?+?EEXY?+?UUYX?+?ΞΞYX? (13) where we have used the standard convex relaxations of rank and cardinality2. Here Ξ and U denote noise and piecewise constant exogenous matrices, ΔU denotes the matrix obtained by taking the difference between consecutive elements in U, and MX (MY) is a “mask” matrix, with mi,j = 0 if the element (i, j) in X ( Y) is missing, mi,j = 1 otherw=i0s e, i tuhseed e etom aenvtoi (di, penalizing )e lisem miesnstisn gin, mE, Ξ, U corresponding to missing data. Problem (13) is a structured robust PCA problem (due to the Hankel structure of X, Y) trhobatu can C bAe efficiently suoelv teod t using tkheel fsitrrsut oturrdeer o mf Xeth,oYd) proposed in [3], slightly modified to handle the terms containing ΔU. 5. Experimental Results In this section we illustrate the effectiveness of the proposed approach using several video clips (provided as supplemental material). The results of the experiments are displayed using graphs embedded on the video frames: An arrow indicates causal correlation between agents, with the point of the arrow indicating the agent whose actions are affected by the agent at its tail. The internal parameters of the algorithm were experimentally tuned, leading to the values ? = 0.1, = 0.05, self loop weights S = 10. The algorithm is fairly insensitive to the value of the regularization parameters and S, which could be adjusted up or down by an order of magnitude without affecting the structure of the resulting graph. Finally, we used regressor order N=2 for the first three examples and N=4 for the last one, a choice that is consistent with the frame rate and the complexity of λ λ the actions taking place in each clip. 5.1. Clips from the UT-Interaction Data Set We considered two video clips from the UT Human Interaction Data Set [15] (sequences 6 and 16). Figures 2 and 5 compare the results obtained applying the proposed algorithm versus Group Lasso (GL) [11] and Group Lasso combined with the reweighted heuristic described in (9) (GLRW). In all cases, the inputs to the algorithm were the (approximate) coordinates of the heads of each of the agents, normalized to the interval [−1, 1], artificially corrupted ,w niothrm m10al%iz eodut tloie trhs.e Notably, [t−he1 proposed algorithm 2As shown in [6, 8] under suitable conditions these relaxations the exact minimum rank solution. 33557792 recover Figure 2. Sample frames from the UT sequence 6 with the identified causal connections superimposed. Top: Proposed Method. Center: Reweighted Group Lasso. Bottom: Group Lasso. Only the proposed method identifies the correct connections. was able to correctly identify the correlations between the agents from this very limited amount of information, while the others failed to do so. Note in passing that in both cases none of the algorithms were directly applicable, due to some of the individuals leaving the field of view or being occluded. As illustrated in Fig. 3, the missing data was recovered by solving an RPCA problem prior to applying Algorithm 1. Finally, Fig. 4 sheds more insight on the key role played by the sparse signal u. As shown there, changes in u correspond exactly to time instants when the behavior of the corresponding agent deviates from the general pattern followed during most of the clip. Figure 3. Time traces of the individual heads in the UT sequence 6, artificially corrupted with 10 % outliers. The outliers were removed and the missing data due to targets leaving the field of view was estimated solving a modified RPCA problem. Frame number Figure 4. Sample (derivative sparse) exogenous signals in the UT sequence 6. The changes correspond to the instants when the second person starts moving towards the first, who remains stationary, and when the two persons merge in an embrace. Figure 5. Sample frames from the UT sequence 16. Top: Correct correlations identified by the Proposed Method. Center and Bottom: Reweighted Group Lasso and Group Lasso (circles indicate self-loops). 5.2. Doubles Tennis Experiment This experiment considers a non-staged real-life scenario. The data consists of 230 frames of a video clip from the Australian Open Tennis Doubles Final games. The goal here is to identify causal relationships between the different players using time traces of the respective centroid positions. Note that in this case the ground truth is not available. Nevertheless, since players from the same team usually look at their opponents and react to their motions, we expect a strong causality connection between members of 33557803 opposite teams. This intuition is matched by the correlations unveiled by the algorithm, shown in Fig. 6. The identified sparse input corresponding to the vertical direction is shown in Fig. 7 (similar results for the horizontal component are omitted due to space reasons.) Figure 6. Sample frames from the tennis sequence. Top: The proposed method correctly identifies interactions between opposite team members. Center: Reweighted Group Lasso misses the interaction between the two rear-most individuals of opposite teams, generating self loops instead (denoted by the disks). Bottom: Group Lasso yields an almost complete graph. Figure 7. Exogenous signal corresponding to the vertical axis for the tennis sequence. The change in one component corresponds to the instant when the leftmost player in the bottom team moves from the line towards the net, remaining closer to it from then on. 5.3. Basketball Game Experiment This experiment considers the interactions amongst players in a basketball game. As in the case ofthe tennis players, since the data comes from a real life scenario, the ground truth is not available. However, contrary to the tennis game, this scenario involves complex interactions amongst many players, and causality is hard to discern by inspection. Nevertheless, the results shown in Fig. 8, obtained using the position of the centroids as inputs to our algorithm, match our intuition. Firstly, one would expect a strong cause/effect connection between the actions of the player with the ball and the two defending opponents facing him. These connections (denoted by the yellow arrows) were indeed successfully identified by the algorithm. The next set of causal correlations is represented by the (blue, light green) and (black, white) arrow pairs showing the defending and the opponent players on the far side of the field and under the hoop. An important, counterintuitive, connection identified by the algorithm is represented by the magenta arrows be- tween the right winger of the white team with two of his teammates: the one holding the ball and the one running behind all players. While at first sight this connection is not as obvious as the others, it becomes apparent towards the end of the sequence, when the right winger player is signaling with a raised arm. Notably, our algorithm was able to unveil this signaling without the need to perform a semantic analysis (a very difficult task here, since this signaling is apparent only in the last few frames). Rather, it used the fact that the causal correlation was encapsulated in the dynamics of the relative motions of these players. 6. Conclusions In this paper we propose a new method for detecting causal interactions between agents using video data. The main idea is to recast this problem into a blind directed graph topology identification, where each node corresponds to the observed motion of a given target, each link indicates the presence of a causal correlation and the unknown inputs account for changes in the interaction patterns. In turn, this problem can be reduced to that of finding block-sparse solutions to a set of linear equations, which can be efficiently accomplished using an iterative re-weighted Group-Lasso approach. The ability of the algorithm to correctly identify causal correlations, even in cases where portions of the data record are missing or corrupted by outliers, and the key role played by the unknown exogenous input were illustrated with several examples involving non–trivial inter- actions amongst several human subjects. Remarkably, the proposed algorithm was able to identify both the correct interactions and the time instants when interactions amongst agents changed, based on minimal motion information: in all cases we used just a single time trace per person. This success indicates that in many scenarios, the dynamic information contained in the motion pattern of a single feature associated with a target is rich enough to enable identifying complex interaction patterns, without the need to track multiple features, perform a semantic analysis or use additional domain knowledge. 33557814 Figure 8. Sample frames from a Basketball game. Top: proposed method. Center: Reweighted Group the signaling player and his teammates. Bottom: Group Lasso yields an almost complete graph. Lasso misses the interaction between References [1] A. Arnold, Y. Liu, and N. Abe. Estimating brain functional connectivity with sparse multivariate autoregression. In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 66–75, 2007. 4 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps. Dynamic subspace-based coordinated multicamera tracking. In 2011 IEEE ICCV, pages 2462–2469, 2011. 3 [3] M. Ayazoglu, M. Sznaier, and O. Camps. Fast algorithms for structured robust principal component analysis. In 2012 IEEE CVPR, pages 1704–171 1, June 2012. 2, 5 [4] A. Bolstad, B. Van Veen, and R. Nowak. Causal network inference via group sparse regularization. IEEE Transactions on Signal Processing, 59(6):2628–2641, 2011. 4 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] tributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. 2 E. Candes, X. Li, Y. Ma, and J.Wright. Robust principal component analysis? J. ACM, (3), 2011. 5 E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, December 2008. 4 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. Siam J. Optim., (2):572–596, 2011. 5 C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, pages 424–438l, 1969. 1 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In 2009 IEEE CVPR, pages 2012–2019, 2009. 1 S. Haufe, G. Nolte, K. R. Muller, and N. Kramer. Sparse causal discovery in multivariate time series. In Neural Information Processing Systems, 2009. 4, 5 G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 1663–670, 2010. 5 D. Materassi, G. Innocenti, and L. Giarre. Reduced complexity models in identification of dynamical networks: Links with sparsification problems. In 48th IEEE Conference on Decision and Control, pages 4796–4801, 2009. 4 K. Prabhakar, S. Oh, P. Wang, G. Abowd, and J. Rehg. Temporal causality for the analysis ofvisual events. In IEEE Conf Comp. Vision and Pattern Recog. (CVPR)., pages 1967– 1974, 2010. 1 M. S. Ryoo and J. K. Aggarwal. UT Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html, 2010. 5 [16] J. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3): 1030–1051, 2006. 4 [17] S. Yi and V. Pavlovic. Sparse granger causality graphs for human action classification. In 2012 ICPR, pages 3374–3377. 1 [18] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1):49–67, 2006. 4 33557825

6 0.95058227 213 iccv-2013-Implied Feedback: Learning Nuances of User Behavior in Image Search

same-paper 7 0.93211144 46 iccv-2013-Allocentric Pose Estimation

8 0.9193536 184 iccv-2013-Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion

9 0.89687443 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

10 0.87249911 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso

11 0.86853009 14 iccv-2013-A Generalized Iterated Shrinkage Algorithm for Non-convex Sparse Coding

12 0.85775125 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search

13 0.83985704 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration

14 0.83224249 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples

15 0.82770443 398 iccv-2013-Sparse Variation Dictionary Learning for Face Recognition with a Single Training Sample per Person

16 0.81738573 114 iccv-2013-Dictionary Learning and Sparse Coding on Grassmann Manifolds: An Extrinsic Solution

17 0.81647813 52 iccv-2013-Attribute Adaptation for Personalized Image Search

18 0.81336302 154 iccv-2013-Face Recognition via Archetype Hull Ranking

19 0.81158906 45 iccv-2013-Affine-Constrained Group Sparse Coding and Its Application to Image-Based Classifications

20 0.81101775 106 iccv-2013-Deep Learning Identity-Preserving Face Space