iccv iccv2013 iccv2013-231 knowledge-graph by maker-knowledge-mining

231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition


Source: pdf

Author: Behrooz Mahasseni, Sinisa Todorovic

Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu , Abstract This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. [sent-3, score-0.287]

2 When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. [sent-4, score-0.599]

3 We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i. [sent-5, score-0.615]

4 , tasks), and (2) discriminative action parts, along with joint learning of all tasks. [sent-7, score-0.309]

5 This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. [sent-8, score-0.228]

6 Results on the benchmark datasets show that we outperform standard multitask learning by 21. [sent-10, score-0.132]

7 , walking), we want to identify the action class and camera viewpoint. [sent-17, score-0.328]

8 The videos are captured from different camera viewpoints, which are taken to be discrete and indexed by the viewpoint identifier, or viewpoint, for short. [sent-18, score-0.117]

9 Invariance to viewpoint changes is critical for action recognition, because people’s motion trajectories may take arbitrary directions relative to the camera viewpoint while performing an action. [sent-19, score-0.46]

10 In our setting, natural variations of action instances within a class are augmented by variations in their appearance across different viewpoints. [sent-20, score-0.308]

11 One way to achieve view invariance could be to reason about a 3D layout of the scene, or 3D volume of the human body, so that the video features can be adapted from one view to another through geometric transformations [29, 28, 23, 18, 14]. [sent-21, score-0.119]

12 They seek to extend knowledge acquired in training from one or a limited number of views to other target views where recognition will be performed. [sent-28, score-0.205]

13 They either transform view-dependent video features to a new view-invariant fea- ture space [11, 12], or adapt model parameters to the target views [3, 4, 26]. [sent-29, score-0.139]

14 However, these approaches require access to simultaneous multiview observations of the same action instance (except for [11]). [sent-32, score-0.339]

15 To approachourproblem, we specifythateachviewpoint of a given set of action classes is a learning task. [sent-35, score-0.331]

16 Then, view invariance in recognition could be achieved by jointly learning all the tasks using Multitask Learning (MTL) [2]. [sent-36, score-0.129]

17 While MTL is well known to vision [20, 13, 16, 3 1], it has never been used for view-invariant action recognition. [sent-38, score-0.287]

18 Our LMTL uses a part-based action representation, instead of the standard BoW. [sent-46, score-0.287]

19 In this way, LMTL is enabled to identify foreground video features which group into discriminative action parts, each corresponding to characteristic movements of a human-body part. [sent-47, score-0.358]

20 In addition, LMTL is enabled to identify latent groupings of correlated viewpoints of a given set of action classes. [sent-48, score-0.681]

21 Thus, LMTL learns a new shared feature space, such that each group ofcamera viewpoints found to be correlated are allowed to share features, whereas this sharing is prohibited between the groups. [sent-49, score-0.277]

22 We use the latent large-margin framework [30] to formulate LMTL, wherein we incorporate the mixed integer programming of [10] for grouping the viewpoints. [sent-50, score-0.128]

23 Within each group of viewpoints, a shared feature representation is estimated and used for learning parameters of a part-based action model, subject to the trace-norm regularization. [sent-52, score-0.331]

24 DPM has K nodes representing K action parts, connected in a star structure. [sent-64, score-0.287]

25 An action part is a discriminative space-time window in the video volume. [sent-65, score-0.316]

26 The nodes are connected to the root, where the graph edges encode space-time deformations of the action parts. [sent-67, score-0.287]

27 The weights can be learned within the latent large-margin framework [30], aimed at jointly discovering the K latent parts, and estimating their weights so as to maximize the discriminativeness of DPM across a set of action classes. [sent-69, score-0.457]

28 Below, we give an overview of how to ground action parts onto raw pixels, and how to perform view-invariant action recognition. [sent-70, score-0.608]

29 Access to action parts is provided by representing a video by a large set of overlapping space-time windows of different sizes and shapes, V = {V1, . [sent-71, score-0.427]

30 s, extracted from A, and spatiotemporal displacements of the windows in A can be represented by a d-dimensional vector, φ(x, y, h), where x denotes the features; y is the action class; and h denotes the space-time locations of the win- dows in A. [sent-84, score-0.343]

31 For a set of M action classes, we learn a multiclass DPM by using an augmented D-dimensional feature Figure 1: For a given video, we estimate the action class ˆy and viewpoint ˆv using Latent Multitask Learning (LMTL). [sent-85, score-0.704]

32 LMTL identifies action parts, h, and groups of correlated camera viewpoints. [sent-86, score-0.348]

33 LMTL learns a linear transformation U to map input view-dependent features to a new feature space, partitioned into subspaces which are shared by viewpoints within the same group. [sent-87, score-0.218]

34 Φ(x, y, h), (1) where v is the viewpoint, and wv ∈ RD are the multiclass DPM parameters. [sent-95, score-0.202]

35 Given a video, th∈e aRction class and viewpoint are estimated via localizing latent action parts as ( ˆy,ˆ v) = argym,va,xh Fy,v,h(x). [sent-96, score-0.503]

36 2, can be learned using MTL, where each task v represents recognition of one of M action classes imaged from the given view v. [sent-102, score-0.384]

37 Baseline 1: Let W be a matrix whose columns are wv, indexed by viewpoints v = 1, . [sent-108, score-0.218]

38 Also, let 33 112292 Δ(y, ˆ y(wv , x)) denote a loss function of recognizing action class ˆy when the true class is y in the vth task. [sent-112, score-0.379]

39 vg where vg denotes viewpoints that belong to group g. [sent-220, score-0.33]

40 Note that a solution of (7) needs to resolve the latent assignment of viewpoints to groups, in addition to finding Θ and Ω. [sent-221, score-0.303]

41 atrix whose columns are parameters of the viewpoints in group g. [sent-234, score-0.24]

42 Mixture integer programming can be used in (8) to identify the latent assignment of viewpoints to groups. [sent-235, score-0.323]

43 Let Qg ∈ RV ×V denote the diagonal assignment matrices for grouping the viewpoints into their respective groups g. [sent-236, score-0.285]

44 The goal of LMTL is to learn Θ and Ω, defined in (5), and the viewpoint grouping matrices {Qg}, defined in (9), and use them for inference imn (t2ri)c. [sent-254, score-0.119]

45 }L,e wt eh i∗vin r=o dhuc∗ (eθ av , exwi) dosesno ftuen tchtieo en,st aimnda tseusb ostfi tvutthe ittas ink for latent action parts in a training video, (xi , yi) ∈ Dv, given its true action class yi. [sent-261, score-0.714]

46 Also, let = , xi) ˆhvi ˆh)(θ ∈v 33 112303 denote the estimates of vth task for latent action parts in the same training video, (xi, yi) ∈ Dv, without the knowledge of its true action class, but) given some estimate ˆy vi = yˆ(θv , xi). [sent-262, score-0.769]

47 Then, as in [5, 30], we define a loss function, Δ(yi, hv∗i, ˆ yvi, hˆvi), in terms of both action class labels and latent variables, and substitute it in (9). [sent-263, score-0.457]

48 , (13) Note that (13) is the standard latent structured SVM formulation [5, 30], and can be efficiently solved using the CCCP algorithm, as in [5, 30]. [sent-342, score-0.106]

49 , identify K action parts in V, using the distance transform accommodated for 2D+time volumes [5]. [sent-366, score-0.341]

50 (14) In the second step, from (2) and (14), we recognize the action class and viewpoint of the new video as ( ˆy,ˆ v) = argmya,vx θv? [sent-372, score-0.413]

51 Features This section describes our feature vectors x and φ(x, y, A video is represented by a large set of overlapping space-time (2D+t) windows of different sizes. [sent-375, score-0.106]

52 φ(x, y, is formed by concatenating unary and pairwise potentials of action parts. [sent-382, score-0.287]

53 The pairwise potential is defined as the Euclidean distance between the two closest corners of the 2D+t windows corresponding to the root and the action part. [sent-384, score-0.343]

54 We evaluate our approach on three benchmark datasets including: the IXMAS dataset [25], the newer version of IXMAS dataset [24] referred to as IXMAS(new), and the i3DPost multiview human action dataset [6]. [sent-387, score-0.325]

55 IXMAS has 12 different actions performed by 11 actors three 33 1132 14 Figure 2: Our average recognition accuracy on i3DPost videos for different input parameters K and g. [sent-388, score-0.18]

56 IXMAS(new) has the same set of actions as IXMAS, recorded from five different viewpoints with different cameras. [sent-391, score-0.308]

57 A video is split into overlapping 2D+t windows of varying width, height and time duration. [sent-404, score-0.106]

58 Input parameters to our LMTL include: the number of action parts K = {2, 4, 6, 8} and the number of groups g ∈ ac t{io1,n 5, a6r,t 7s, K K8}. [sent-410, score-0.345]

59 e2 esshtow ous rth saetn changes oof g eaff sepcetc our average accuracy on the action classes of i3DPost. [sent-413, score-0.331]

60 Baseline 1learns action classifiers separately for different viewpoints, and is specified in (3). [sent-417, score-0.287]

61 1658g Table 1: Average accuracy in [%] of LMTL and Baseline1 (B 1), Baseline2 (B2), Baseline3 (B3) for different camera viewpoints on IXMAS. [sent-435, score-0.24]

62 Table 2: Average accuracy in [%] of LMTL and Baseline1 (B 1), Baseline2 (B2), Baseline3 (B3) for different camera viewpoints on IXMAS(new). [sent-436, score-0.24]

63 g 7196 Table 3: Average accuracy in [%] of LMTL and Baseline1 (B 1), Baseline2 (B2), Baseline3 (B3) for different camera viewpoints on i3DPost. [sent-446, score-0.24]

64 This setting allows us to compare our LMTL with the baselines and methods which use all viewpoints in training. [sent-448, score-0.237]

65 We have access to videos from one or more source views, and limited (or no) access to videos from other target views. [sent-450, score-0.207]

66 For evaluation, we vary the number of source views, and the number of videos from target views present in training. [sent-451, score-0.177]

67 Tables 1, 2 and 3 show the average accuracy of LMTL and the baselines with respect to different viewpoints on IXMAS, IXMAS(new) and i3DPost respectively in the first setting. [sent-452, score-0.259]

68 We see that sharing features across all viewpoints in Baseline 2 worsens results relative to Baseline 1. [sent-453, score-0.218]

69 This is not a surprise, because the assumption that all viewpoints share a common feature space is too strong (e. [sent-454, score-0.218]

70 Another interesting observation is the effect of using latent action parts. [sent-458, score-0.372]

71 953 Table 4: The confusion matrix of LTML for the IXMAS action classes. [sent-478, score-0.347]

72 The values are in [%] CW CA GU KI PU PC SH SD TA WK WV Table 5: Confusion matrix of LMTL on IXMAS(new) action classes. [sent-480, score-0.287]

73 0 76 Table 6: Confusion matrix of LMTL on I3DPost action classes. [sent-492, score-0.287]

74 This shows the merit of our accounting for action parts. [sent-494, score-0.312]

75 Tables 4, 5 and 6 show the confusion tables of our approach for action classes on the IXMAS, IXMAS(new) and i3DPost datasets. [sent-497, score-0.401]

76 Although we do not model structured actions and actions with more than one actor explicitly in our model, results on the i3DPost dataset show a reasonable accuracy for these set of actions. [sent-498, score-0.185]

77 Figure 3: Accuracy in [%] of LMTL across different viewpoints on IXMAS. [sent-499, score-0.218]

78 7468 Table 7: The confusion matrix of viewpoints estimated by LMTL on IXMAS. [sent-512, score-0.278]

79 The values are in [%] CAM0 CAM1 CAM2 CAM3 CAM4 Table 8: Confusion matrix of estimated viewpoints for LMTL IXMAS(new). [sent-513, score-0.218]

80 Values are in [%] on Studying recognition accuracy per viewpoint is important, because it shows how well an approach performs in different viewpoints. [sent-514, score-0.116]

81 Table 7 shows the confusion matrix of our viewpoint estimation on the IXMAS dataset. [sent-518, score-0.136]

82 Confusion matrices of our viewpoint estimation on the IXMAS(new) and i3DPost datasets are shown in tables 8 and 9 respectively. [sent-520, score-0.108]

83 68 Table 9: Confusion matrix of estimated viewpoints for LMTL I3DPost. [sent-537, score-0.218]

84 Figure 5: Average accuracy in [%] for different numbers of viewpoints in the IXMAS, IXMAS(new) and i3DPost datasets. [sent-539, score-0.24]

85 5 shows the effect of using multiple source views on our average accuracy for three datasets. [sent-546, score-0.125]

86 In addition to the source view videos, we also use one-third of videos of the target viewpoint in learning. [sent-548, score-0.206]

87 Note that the total number of groups of viewpoints is limited by the number of the source views, which is 5 in IXMAS, and 8 in both IXMAS(new) and i3DPost. [sent-549, score-0.268]

88 two different modes of tests: 1) Unbalanced labeled mode, where we use one-third of videos from the target views in training, and 2) Balanced labeled mode, where we use twothirds of videos from the target views in training. [sent-573, score-0.302]

89 The best average accuracy on the simple action classes of i3DPost, reported in [8], is 90. [sent-582, score-0.331]

90 Our evaluation on i3DPost shows that the DPM representation of action classes is capable of handling more complex, structured actions. [sent-586, score-0.33]

91 Conclusion We have formulated a new approach to view-invariant action recognition. [sent-591, score-0.287]

92 We have formalized viewpoints of a given set of action classes as learning tasks, which can be jointly learned within the Multitask Learning (MTL) framework. [sent-593, score-0.549]

93 To express that some viewpoints may not be correlated, and that discriminative action parts are subject to occlusion across the views, we have extended the standard MTL to latent MTL (LMTL). [sent-594, score-0.624]

94 Thus, our LMTL identifies groupings of correlated viewpoints, leveraging a multiclass deformable parts model of actions. [sent-595, score-0.138]

95 Our evaluation on the benchmark IXMAS, IXMAS(new), and i3DPost datasets shows that accounting for parts and grouping viewpoints in LMTL leads to significant performance improvements over MTL, and other knowledge-transfer approaches to view-invariant action recognition. [sent-596, score-0.607]

96 Single view human action recognition using key pose matching and viterbi path searching. [sent-684, score-0.335]

97 Making action recognition robust to occlusions and viewpoint changes. [sent-749, score-0.381]

98 View-invariant action recognition using [27] [28] [29] [30] [3 1] latent kernelized structural svm. [sent-760, score-0.39]

99 Mining actionlet ensemble for action recognition with depth cameras. [sent-764, score-0.305]

100 Learning 4D action feature models for arbitrary view action recognition. [sent-771, score-0.604]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lmtl', 0.641), ('ixmas', 0.378), ('action', 0.287), ('mtl', 0.25), ('viewpoints', 0.218), ('qg', 0.211), ('wv', 0.169), ('multitask', 0.11), ('yi', 0.088), ('latent', 0.085), ('views', 0.077), ('viewpoint', 0.076), ('actions', 0.071), ('dv', 0.065), ('confusion', 0.06), ('windows', 0.056), ('wk', 0.053), ('vi', 0.052), ('hv', 0.05), ('trac', 0.048), ('dpm', 0.048), ('baseline', 0.047), ('vg', 0.045), ('cw', 0.044), ('grouping', 0.043), ('hvi', 0.043), ('videos', 0.041), ('unbalanced', 0.041), ('yvi', 0.039), ('ub', 0.039), ('substitute', 0.038), ('pu', 0.037), ('correlated', 0.037), ('weinland', 0.035), ('parts', 0.034), ('groupings', 0.034), ('multiclass', 0.033), ('target', 0.033), ('walk', 0.033), ('access', 0.033), ('ki', 0.033), ('tables', 0.032), ('checkwatch', 0.032), ('crossarms', 0.032), ('getup', 0.032), ('nikolaidis', 0.032), ('pickup', 0.032), ('scratchhead', 0.032), ('tbl', 0.032), ('turnaround', 0.032), ('specifies', 0.03), ('view', 0.03), ('xi', 0.03), ('invariance', 0.03), ('video', 0.029), ('balanced', 0.029), ('tasks', 0.029), ('bow', 0.028), ('sitdown', 0.028), ('jf', 0.028), ('actors', 0.028), ('ca', 0.027), ('imaged', 0.027), ('source', 0.026), ('gu', 0.026), ('loss', 0.026), ('uw', 0.025), ('accounting', 0.025), ('sh', 0.024), ('sd', 0.024), ('groups', 0.024), ('pc', 0.024), ('vth', 0.024), ('punch', 0.023), ('kick', 0.023), ('accuracy', 0.022), ('learning', 0.022), ('bend', 0.022), ('disregarding', 0.022), ('classes', 0.022), ('group', 0.022), ('overlapping', 0.021), ('class', 0.021), ('structured', 0.021), ('trajectories', 0.021), ('identify', 0.02), ('fraction', 0.02), ('yilmaz', 0.02), ('pull', 0.019), ('vv', 0.019), ('nthde', 0.019), ('hw', 0.019), ('multiview', 0.019), ('baselines', 0.019), ('referred', 0.019), ('sensitivity', 0.019), ('ww', 0.019), ('jp', 0.019), ('recorded', 0.019), ('recognition', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

Author: Behrooz Mahasseni, Sinisa Todorovic

Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.

2 0.22764881 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

Author: Jingjing Zheng, Zhuolin Jiang

Abstract: We present an approach to jointly learn a set of viewspecific dictionaries and a common dictionary for crossview action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the viewspecific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in viewspecific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also , makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.

3 0.21017767 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

Author: Qiang Zhou, Gang Wang, Kui Jia, Qi Zhao

Abstract: Sharing knowledge for multiple related machine learning tasks is an effective strategy to improve the generalization performance. In this paper, we investigate knowledge sharing across categories for action recognition in videos. The motivation is that many action categories are related, where common motion pattern are shared among them (e.g. diving and high jump share the jump motion). We propose a new multi-task learning method to learn latent tasks shared across categories, and reconstruct a classifier for each category from these latent tasks. Compared to previous methods, our approach has two advantages: (1) The learned latent tasks correspond to basic motionpatterns instead offull actions, thus enhancing discrimination power of the classifiers. (2) Categories are selected to share information with a sparsity regularizer, avoidingfalselyforcing all categories to share knowledge. Experimental results on multiplepublic data sets show that the proposed approach can effectively transfer knowledge between different action categories to improve the performance of conventional single task learning methods.

4 0.2090123 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

5 0.19082628 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

6 0.18118818 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

7 0.16401918 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

8 0.15038034 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

9 0.14487489 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

10 0.1439047 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

11 0.1398983 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

12 0.13506682 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces

13 0.12867637 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

14 0.12653214 291 iccv-2013-No Matter Where You Are: Flexible Graph-Guided Multi-task Learning for Multi-view Head Pose Classification under Target Motion

15 0.11333466 166 iccv-2013-Finding Actors and Actions in Movies

16 0.11026308 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

17 0.10927725 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

18 0.096776344 39 iccv-2013-Action Recognition with Improved Trajectories

19 0.093580216 276 iccv-2013-Multi-attributed Dictionary Learning for Sparse Coding

20 0.092271388 282 iccv-2013-Multi-view Object Segmentation in Space and Time


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.164), (1, 0.174), (2, 0.061), (3, 0.19), (4, -0.024), (5, -0.018), (6, 0.065), (7, -0.073), (8, -0.003), (9, 0.004), (10, 0.034), (11, 0.046), (12, -0.061), (13, -0.076), (14, 0.153), (15, -0.024), (16, 0.015), (17, -0.019), (18, -0.031), (19, -0.052), (20, -0.023), (21, 0.018), (22, -0.043), (23, -0.088), (24, 0.007), (25, -0.028), (26, -0.021), (27, -0.046), (28, 0.003), (29, 0.02), (30, 0.065), (31, 0.046), (32, 0.002), (33, 0.012), (34, 0.025), (35, -0.01), (36, 0.026), (37, 0.059), (38, 0.016), (39, -0.014), (40, 0.008), (41, -0.116), (42, 0.013), (43, -0.054), (44, 0.092), (45, -0.005), (46, -0.041), (47, 0.022), (48, -0.018), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96385264 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

Author: Behrooz Mahasseni, Sinisa Todorovic

Abstract: This paper presents an approach to view-invariant action recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discriminative action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robustness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.5–6%.

2 0.84773362 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.84253442 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

4 0.79642284 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

Author: Qiang Zhou, Gang Wang, Kui Jia, Qi Zhao

Abstract: Sharing knowledge for multiple related machine learning tasks is an effective strategy to improve the generalization performance. In this paper, we investigate knowledge sharing across categories for action recognition in videos. The motivation is that many action categories are related, where common motion pattern are shared among them (e.g. diving and high jump share the jump motion). We propose a new multi-task learning method to learn latent tasks shared across categories, and reconstruct a classifier for each category from these latent tasks. Compared to previous methods, our approach has two advantages: (1) The learned latent tasks correspond to basic motionpatterns instead offull actions, thus enhancing discrimination power of the classifiers. (2) Categories are selected to share information with a sparsity regularizer, avoidingfalselyforcing all categories to share knowledge. Experimental results on multiplepublic data sets show that the proposed approach can effectively transfer knowledge between different action categories to improve the performance of conventional single task learning methods.

5 0.78522807 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis

Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.

6 0.77844399 38 iccv-2013-Action Recognition with Actons

7 0.74939787 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

8 0.73143065 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

9 0.71104968 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

10 0.70227349 166 iccv-2013-Finding Actors and Actions in Movies

11 0.69747502 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces

12 0.66910243 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

13 0.65682256 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

14 0.60979933 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition

15 0.59674013 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

16 0.59036368 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

17 0.58159339 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

18 0.56848621 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

19 0.55044031 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

20 0.54696 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.064), (7, 0.019), (13, 0.01), (26, 0.045), (31, 0.03), (42, 0.495), (64, 0.053), (73, 0.017), (89, 0.148)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98553467 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing

Author: Xu Wang, Stefan Atev, John Wright, Gilad Lerman

Abstract: The problem of efficiently deciding which of a database of models is most similar to a given input query arises throughout modern computer vision. Motivated by applications in recognition, image retrieval and optimization, there has been significant recent interest in the variant of this problem in which the database models are linear subspaces and the input is either a point or a subspace. Current approaches to this problem have poor scaling in high dimensions, and may not guarantee sublinear query complexity. We present a new approach to approximate nearest subspace search, based on a simple, new locality sensitive hash for subspaces. Our approach allows point-tosubspace query for a database of subspaces of arbitrary dimension d, in a time that depends sublinearly on the number of subspaces in the database. The query complexity of our algorithm is linear in the ambient dimension D, allow- ing it to be directly applied to high-dimensional imagery data. Numerical experiments on model problems in image repatching and automatic face recognition confirm the advantages of our algorithm in terms of both speed and accuracy.

2 0.98096246 96 iccv-2013-Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition

Author: De-An Huang, Yu-Chiang Frank Wang

Abstract: Cross-domain image synthesis and recognition are typically considered as two distinct tasks in the areas of computer vision and pattern recognition. Therefore, it is not clear whether approaches addressing one task can be easily generalized or extended for solving the other. In this paper, we propose a unified model for coupled dictionary and feature space learning. The proposed learning model not only observes a common feature space for associating cross-domain image data for recognition purposes, the derived feature space is able to jointly update the dictionaries in each image domain for improved representation. This is why our method can be applied to both cross-domain image synthesis and recognition problems. Experiments on a variety of synthesis and recognition tasks such as single image super-resolution, cross-view action recognition, and sketchto-photo face recognition would verify the effectiveness of our proposed learning model.

3 0.97818321 422 iccv-2013-Toward Guaranteed Illumination Models for Non-convex Objects

Author: Yuqian Zhang, Cun Mu, Han-Wen Kuo, John Wright

Abstract: Illumination variation remains a central challenge in object detection and recognition. Existing analyses of illumination variation typically pertain to convex, Lambertian objects, and guarantee quality of approximation in an average case sense. We show that it is possible to build models for the set of images across illumination variation with worstcase performance guarantees, for nonconvex Lambertian objects. Namely, a natural verification test based on the distance to the model guarantees to accept any image which can be sufficiently well-approximated by an image of the object under some admissible lighting condition, and guarantees to reject any image that does not have a sufficiently good approximation. These models are generated by sampling illumination directions with sufficient density, which follows from a new perturbation bound for directional illuminated images in the Lambertian model. As the number of such images required for guaranteed verification may be large, we introduce a new formulation for cone preserving dimensionality reduction, which leverages tools from sparse and low-rank decomposition to reduce the complexity, while controlling the approximation error with respect to the original model. 1

4 0.97608876 70 iccv-2013-Cascaded Shape Space Pruning for Robust Facial Landmark Detection

Author: Xiaowei Zhao, Shiguang Shan, Xiujuan Chai, Xilin Chen

Abstract: In this paper, we propose a novel cascaded face shape space pruning algorithm for robust facial landmark detection. Through progressively excluding the incorrect candidate shapes, our algorithm can accurately and efficiently achieve the globally optimal shape configuration. Specifically, individual landmark detectors are firstly applied to eliminate wrong candidates for each landmark. Then, the candidate shape space is further pruned by jointly removing incorrect shape configurations. To achieve this purpose, a discriminative structure classifier is designed to assess the candidate shape configurations. Based on the learned discriminative structure classifier, an efficient shape space pruning strategy is proposed to quickly reject most incorrect candidate shapes while preserve the true shape. The proposed algorithm is carefully evaluated on a large set of real world face images. In addition, comparison results on the publicly available BioID and LFW face databases demonstrate that our algorithm outperforms some state-of-the-art algorithms.

5 0.96781123 167 iccv-2013-Finding Causal Interactions in Video Sequences

Author: Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps

Abstract: This paper considers the problem of detecting causal interactions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link indicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspondences). Moreover, this approach also identifies time instants where the interactions between agents change, thus providing event detection capabilities. These results are illustrated with several examples involving non–trivial interactions amongst several human subjects. 1. Introduction and Motivation The problem of identifying causal interactions amongst targets in a video sequence has been the focus of considerable attention in the past few years. A large portion of the existing body of work in this field uses human annotated video to build a storyline that includes both recognizing the activities involved and the causal relationships between them (see for instance [10] and references therein). While these methods are powerful and work well when suitably annotated data is available, annotating video clips is expensive and parsing relevant actions requires domain knowledge which may not be readily available. Indeed, in many situations, unveiling potentially hidden causal relationships is a first step towards building such knowledge. In this paper we consider the problem of identifying causal interactions amongst targets, not necessarily human, ∗This work was supported by NSF grants IIS–0713003, IIS-1318145, and ECCS–0901433, AFOSR grant FA9559–12–1–0271, and the Alert DHS Center of Excellence under Award Number 2008-ST-061-ED0001 . from unannotated video sequences and without prior domain knowledge. Our approach exploits the concept of “Granger Causality” [9], that formalizes the intuitive idea that ifa time series {x(t)} is causally related to a second one {thya(tt)if}a, ttihmene knowledge }oifs tchaeu past vrealluateesd otfo a{yse}c1to should l{eya(dt t}o, a ebnett kern prediction o thf efu ptuasret vvaalulueess ooff {{yx}}tt+k. In [l1ea4d], Pora ab bheatktearr eprt.e aicl.t successfully vuasleude a frequency domain reformulation of this concept to uncover pairwise interactions in scenarios involving repeating events, such as social games. This technique was later extended in [17] to model causal correlations between human joints and applied to the problem of activity classification. However, since this approach is based upon estimating the crosscovariance density function between events, it cannot handle situations where these events are non repeating, are too rare to provide an accurate estimate, or where these estimates are biased by outliers or missing data. Further, estimating a pairwise measure of causal correlation requires a spectral factorization of the cross-covariance, followed by numerical integration and statistical thresholding, limiting the approach to moderately large problems. To circumvent these problems, in this paper we propose an alternative approach based upon recasting the problem into that of identifying the topology of a sparse (directed) graph, where each node corresponds to the time traces of relevant features of a target, and each link corresponds to a regressor. The situation is illustrated in Fig. 1 using as an example the problem of finding causal relations amongst 4 tennis players, leading to a graph with 4 nodes, and potentially 12 (directed) links. Note that in general, the problem of identifying causal relationships is ill posed (unless one wants to identify the set of all individuals that could possibly have causal connections), due to the existence of secondary interactions. To illustrate this point, consider a very simplistic scenario with three actors A, B, and C, where A copies (with some delay) the actions of B, which in turn mimics C, also with some delay. In this situation, the ac- tions of A can be explained in terms of either those of B delayed one time sample, or those of C delayed by two samples. Thus, an algorithm based upon a statistical analysis 33556758 would identify a causal connection between A and C, even though there is no direct link between them. Further, if the actions of C can be explained by some simple autoregressive model of the form: = C(t) ?aiC(t − i) then it follows that the acti?ons of A can be explained by the same model, e.g. = A(t) ?aiA(t − i) Hence, multiple graphs topologies, some of which include self-loops, can explain the same set of time-series. On the other hand, note that in this situation, the sparsest graph (in the sense of having the fewest links) is the one that correctly captures the causality relations: the most direct cause of A is B and that of B is C, with C potentially being explained by a self-loop. To capture this feature and regularize the problem, in the sequel we will seek to find the sparsest graph, in the sense of having the least number of interconnections, that explains the observed data, reflecting the fact that, when alternative models are possible, often the most parsimonious is the correct one. Our main result shows that the problem of identifying sparse graph structures from observed noisy data can be reduced to a convex optimization problem (via the use of Group Lasso type arguments) that can be efficiently solved. The advantages of the proposed methods are: • • • • Its ability to handle complex scenarios involving nonrepeating events, een cvoimropnlmeexn stcael changes, clvoillnegct nioonnsof targets that do not necessarily split into well defined groups, outliers and missing data. The ability to identify the sparsest interaction structure tThhaet explains th idee nobtifseyr tvheed s dpaartas e(stthu inst avoiding labeling as causal connections those indirect correlations mediated only by an intermediary), together with a sparse “indicator” function whose support set indicates time instants where the interactions between agents change. Since the approach is not based on semantic analysis, iSt can bt hee applied ctoh ti she n moto btiaosne dof o arbitrary targets, sniost, necessarily humans (indeed, it applies to arbitrary time series including for instance economic or genetic data). From a computational standpoint, the resulting optiFmriozmatio an c problems nhaalve s a specific fthoerm re asmuletinnagbl oep ttiobe solved by a class of iterative algorithms [5, 3], that require at each step only a combination of thresholding and least-squares approximations. These algorithms have been shown to substantially outperform conventional convex-optimization solvers both in terms of memory and computation time requirements. The remainder of the paper is organized as follows. In section 2 we provide a formal reformulation of the problem of finding causal relationships between agents as a sparse graph identification problem. In section 3, we show that this problem can be efficiently solved using a re-weighted Group Lasso approach. Moreover, as shown there, the resulting problem can be solved one node at a time using first order methods, which allows for handling situations involving a large number of agents. Finally, the effectiveness of the proposed method is illustrated in section 4 using both simple scenarios (for which ground truth is readily available) and video clips of sports, involving complex, nonrepeating interactions amongst many agents. Figure 1. Finding causal interactions as a graph identification problem. Top: sample frame from a doubles tennis sequence. Bottom: Representation of this sequence as a graph, where each node represents the time series associated with the position of each player and the links are vector regressive models. Causal interactions exist when one of the time series can be explained as a combination of past values of the others. 2. Preliminaries For ease of reference, in this section we summarize the notation used in the paper and give a formal definition of the problem under consideration. 2.1. Notation (M) ?M? ??MM??F ?M?1 ?M?o σi ∗ ◦ ith largest singular value of the matrix M. nuclear norm: ?M? ?i σ?i (M). Fnruocbleeanrio nours norm: ??M?2F? ?i,j Mi2j ?1 norm: ?M? 1 ?i,j |Mij? ?|. ?o quasi-norm: ?M?o number of non-zero ?eleme?nMts i?n M. Hadamard product of matrices: (A ◦ ∗ =.: =. =. =. B)i,j = Ai,jBi,j. 33556769 2.2. Statement of the Problem Next, we formalize the problem under consideration. Consider a scenario with P moving agents, and denote by the 3D homogenous coordinates of the pth individual at time t. Motivated by the idea of Granger Causality, we will say that the actions of this agent depend causally from those in a set Ip (which can possibly contain p itself), if can be written as: Q˜p(t) Q˜p(t) Q˜p(t) ?N = ? ?ajp(n)Q˜j(t − n) +˜ η p(t) +˜ u p(t) (1) j? ?∈Ip ?n=0 Here ajp are unknown coefficients, and ˜η p(t) and up(t) represent measurement noise and a piecewise constant signal that is intended to account for relatively rare events that cannot be explained by the (past) actions of other agents. Examples include interactions of an agent with the environment, for instance to avoid obstacles, or changes in the interactions between agents. Since these events are infrequent, we will model as a signal that has (component-wise) a sparse derivative. Note in passing that since (1) involves homogeneous coordinates, the coefficients aj,p(.) satisfy the following constraint1 u ?N ? ?ajp(n) j? ?∈Ip ?n=0 =1 (2) Our goal is to identify causal relationships using as data 2D measurements qp(t) in F frames of the affine projections of the 3D coordinates Q˜p(t) of the targets. Note that, under the affine camera assumption, the 2D coordinates are related exactly by the same regressor parameters [2]. Thus, (1) holds if and only if: ?N qp(t) = ? ?ajp(n)qj(t − n) + u˜ p(t) + ηp(t) (3) j?∈Ip ?n=0 In this context, the problem can be precisely stated as: Given qp(t) (in F number of frames) and some a-priori bound N on the order of the regressors (that is the “memory” of the interactions), find the sparsest set of equations of the form (3) that explains the data, that is: aj,pm,ηinp,up?nIp (4) subject to? ?(2) and: = ? ?ajp(n)qj(t − n) + ?N qp(t) j? ?∈Ip ?n=0 up(t) + ηp(t) , p = 1 . . . , P and t = 1, ..F 1This follows by considering the third coordinate in (1) (5) where nIp denotes the cardinality of the set Ip. Rewriting (5) in matrixd efnoormtes yields: [xp; yp] = [Bp, I][apTuxTpuyTp]T + ηp (6) where qp(t) up(t) ηp(t) xp yp ap aip uxp uyp Bp Xp = [xp(t)Typ(t)T]T = [uTxp(t)uyTp(t)]T = [ηxp(t)Tηyp(t)T]T = = [xp(F)xp(F − 1)...xp(1)]T = [yp(F)yp(F − 1)...yp(1)]T [aT1p, a2Tp, ..., aTPp]T = [aip(0), aip(1), ..., aip(N)]T = [uxp(F)uxp(F−1)...uxp(1)]T = [uyp(F)uyp(F−1)...uyp(1)]T = = [Xp; Yp] [hankel(x1 , N) , ..., hankel(xP, N)] Yp = [hankel(y1, N), ..., hankel(yP, N)] and where, for a sequence z(t), hankel(z, N) denotes its associated Hankel matrix: hankel(z, N) = Itfolw⎛⎜⎝ sz t(hNzFa(t. +−a)d1 2e)scrzip(tF io(N. n− )o231f)al· t h· einzt(Frac−zti(.o1N.n)s−a)m12o)⎟ ⎞⎠ ngst uηaq= ? ηuqa1 T ,ηqau2 T ,ηaqu3 T ,· ·, ηauqP T ? T (8) Thus,inthBisc=on⎢⎣⎡teBx0t.1,theB0p.r2ob·le.·m·ofB0 i.nPte⎦⎥r ⎤estcanbeforagents (that is the complete graph structure) is captured by a matrix equation of the form: q = [B, I][aTuT]T + η (7) where and malized as finding the block–sparsest solution to the set of linear equations (2) and (7). 33557770 The problem of identifying a graph structure subject to sparsity constraints, has been the subject of intense research in the past few years. For instance, [1] proposed a Lasso type algorithm to identify a sparse network where each link corresponds to a VAR process. The main idea underlying this method is to exploit the fact that penalizing the ?1 norm of the vector of regression coefficients tends to produce sparse solutions. However, enforcing sparsity of the entire vector of regressor coefficients does not necessarily result in a sparse graph structure, since the resulting solution can consist of many links, each with a few coefficients. This difficulty can be circumvented by resorting to group Lasso type approaches [18], which seek to enforce block sparsity by using a combination of ?1 and ?2 norm constraints on the coefficients of the regressor. While this approach was shown to work well with artificial data in [11], exact recovery of the underlying network can be only guaranteed when the data satisfies suitable “incoherence” type conditions [4]. Finally, a different approach was pursued in [13], based on the use of a modified Orthogonal Least Squares algorithm, Cyclic Orthogonal Least Squares. However, this approach requires enforcing an a-priori limit on the number of links allowed to point to a single node, and such information may not be readily available, specially in cases where this number has high variability amongst nodes. To address these difficulties, in the next section we develop a convex optimization based approach to the problem of identifying sparse graph structures from observed noisy data. This method is closest in spirit to that in [11], in the sense that it is also based on a group Lasso type argument. The main differences consist in the ability to handle the unknown inputs up(t), needed to model exogenous disturbances affecting the agents, and in a reformulation of the problem, that allows for using a re-weighted iterative type algorithm, leading to substantially sparser solutions, even when the conditions in [4] fail. 3. Causality Identification Algorithm In this section we present the main result of this paper, an algorithm to search for block-sparse solutions to (7). For each fixed p, the algorithm searches for sparse solutions to (6) by solving (iteratively) the following problem (suggested by the re-weighted heuristic proposed in [7]) ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 subject to: ?ηp ? ≤ p = 1, . . , P. ∞ ?P ?, ?N ??aip(n) i?= ?1 ?n=0 ?. = 1, p = 1,...,P. (9) where [Δuxp ; Δuyp] represents the first order differences of the exogenous input vector [uxp ; uyp], Wa and Wu are weighting matrices, and λ is a Lagrange multiplier that plays the role of a tuning parameter between graph sparsity and event sensitivity. Intuitively, for a fixed set of weights w, the algorithm attempts to find a block sparse solution to (6) and a set of sparse inp?uts Δuxp ; Δuyp , by exploiting the facts that minimizing ?i ?aip ?2 (the ?2,1 norm of the vector sequence {aip}) te?nds? tao m?aximize block-sparsity [18], while minimizing et?nhed s? 1t norm mmaizxeim blizoceks sparsity [ [1168]]. wOhniclee t mheisnesolutions are found, the weights w are adjusted to penalize those elements of the sequences with small values, so that in the next iteration solutions that set these elements to zero (hence further increasing sparsity) are favored. Note however, that proceeding in this way, requires solving at each iteration a problem with n = P(Pnr + F) variables, where P and F denote the number of agents and frames, respectively, and where nr is a bound on the regressor order. On the other hand, it is easily seen that both the objective function and the constraints in (9) can be partitioned into P groups, with the pth group involving only the variables related to the pth node. It follows then that problem (9) can be solved by solving P smaller problems of the form: ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 ?P subject to: ?ηp?∞ ?N ≤ ? and ??aip(n) i?= ?1 ?n=0 leading to the algorithm given below: =1 (10) Algorithm 1: REWEIGHTEDCAUSALITYALGORITHM for each p wa = [1, 1, ..., 1] = [1, 1, ..., 1] S > 1(self loop weight) s = [1, 1, ..., S, ..., 1] (p’th element is S) while not converged do 1. solve (9) 2. wja = 1/( ?aip ?2 + δ) 3. wja = wja ◦ s (Penalization self loops) 4. = 1./(abs([Δuxp ; Δuyp]) + δ) end while 5. At this point ajp(.) , Ip and up(t) have been identified end for wu wu It is worth emphasizing that, since the computational complexity of standard interior point methods grows as n3, solving these smaller P problems leads to roughly a O(P2) 33557781 reduction in computational time over solving a single, larger optimization. Thus, this approach can handle moderately large problems using standard, interior-point based, semidefinite optimization solvers. Larger problems can be accommodated by noting that the special form of the objective and constraints allow for using iterative Augmented La- grangian Type Methods (ALM), based upon computing, at each step, the closed form solution to suitable intermediate optimization problems. While a complete derivation of such an algorithm is beyond the scope of this paper, using results from [12] it can be shown that each step requires only a combination of thresholding and least-squares approximations. Moreover, it can be shown that such an algorithm converges Q-superlinearly. 4. Handling Outliers and Missing Data The algorithm outlined above assumes an ideal situation where the data matrix B is perfectly known. However, in practice many of its elements may be outliers (due to misidentified correspondences) or missing (due to occlusion). As we briefly show next, these situations can be efficiently handled by performing a structured robust PCA step [3] to obtain a “clean” data matrix, prior to applying Algorithm 1. From equation (6) it follows that, in the absence of exogenous inputs and noise: ?xy11.. . .yxPP? = ?XY11.. . .YXPP? ?a1...aP? (11) Since xi ∈ {col(Xj)} and yi ∈ {col(Yj }), it follows that the sets {∈co {l(cXoli(X)} a)n}d a n{dco yl(Y∈i) { }c? are self-ex?pressive, or, ?equivalently?, Xthe }ma atnridce {sc oXl( =.) }? aXre1 . . . fX-eNxp? eanssdiv eY, ?Y1 ...YN? are mraantkri cdeesfic Xient. ?Consider no?w the case =.r, w?here some ?elements xi, yi of X and Y are missing. From ?the self-expressive property ooff {Xco aln(Xd Yi)} a raen dm i{scsoinlg(Y. Fi)ro} mit tfhoello swelsf tehxaptr ethsessieve missing eyle omf {encotsl are given by: xi = argmin rank(X) , yi x = argmin rank(Y) (12) y Similarly, in the presence of outliers, X, Y can be decomposed irnlyto, itnhe t sum oesfe a lcoew o fra onkut mlieartsr,ix X (,thYe ccalenan b eda dtae)c oamnda sparse one (the outliers) by solving a problem of the form minrank?YXoo?+ λ????EEYX????os. t.: ?XYoo?+?EEYX?=?YX? From the reasoning? abov?e it follows that in the presence of noise and exogenous outputs, the clean data record can be recovered from the corrupted, partial measurements by solving the following optimization problem: s+muλibn3je? ? ? cYXtM ot ? Y?X:∗◦ +Ξ λYX1? ? ? FM XY◦ E XY? ?1+λ2? ?M YX◦ Δ U YX? ?1 ?YX?=?XYoo?+?EEXY?+?UUYX?+?ΞΞYX? (13) where we have used the standard convex relaxations of rank and cardinality2. Here Ξ and U denote noise and piecewise constant exogenous matrices, ΔU denotes the matrix obtained by taking the difference between consecutive elements in U, and MX (MY) is a “mask” matrix, with mi,j = 0 if the element (i, j) in X ( Y) is missing, mi,j = 1 otherw=i0s e, i tuhseed e etom aenvtoi (di, penalizing )e lisem miesnstisn gin, mE, Ξ, U corresponding to missing data. Problem (13) is a structured robust PCA problem (due to the Hankel structure of X, Y) trhobatu can C bAe efficiently suoelv teod t using tkheel fsitrrsut oturrdeer o mf Xeth,oYd) proposed in [3], slightly modified to handle the terms containing ΔU. 5. Experimental Results In this section we illustrate the effectiveness of the proposed approach using several video clips (provided as supplemental material). The results of the experiments are displayed using graphs embedded on the video frames: An arrow indicates causal correlation between agents, with the point of the arrow indicating the agent whose actions are affected by the agent at its tail. The internal parameters of the algorithm were experimentally tuned, leading to the values ? = 0.1, = 0.05, self loop weights S = 10. The algorithm is fairly insensitive to the value of the regularization parameters and S, which could be adjusted up or down by an order of magnitude without affecting the structure of the resulting graph. Finally, we used regressor order N=2 for the first three examples and N=4 for the last one, a choice that is consistent with the frame rate and the complexity of λ λ the actions taking place in each clip. 5.1. Clips from the UT-Interaction Data Set We considered two video clips from the UT Human Interaction Data Set [15] (sequences 6 and 16). Figures 2 and 5 compare the results obtained applying the proposed algorithm versus Group Lasso (GL) [11] and Group Lasso combined with the reweighted heuristic described in (9) (GLRW). In all cases, the inputs to the algorithm were the (approximate) coordinates of the heads of each of the agents, normalized to the interval [−1, 1], artificially corrupted ,w niothrm m10al%iz eodut tloie trhs.e Notably, [t−he1 proposed algorithm 2As shown in [6, 8] under suitable conditions these relaxations the exact minimum rank solution. 33557792 recover Figure 2. Sample frames from the UT sequence 6 with the identified causal connections superimposed. Top: Proposed Method. Center: Reweighted Group Lasso. Bottom: Group Lasso. Only the proposed method identifies the correct connections. was able to correctly identify the correlations between the agents from this very limited amount of information, while the others failed to do so. Note in passing that in both cases none of the algorithms were directly applicable, due to some of the individuals leaving the field of view or being occluded. As illustrated in Fig. 3, the missing data was recovered by solving an RPCA problem prior to applying Algorithm 1. Finally, Fig. 4 sheds more insight on the key role played by the sparse signal u. As shown there, changes in u correspond exactly to time instants when the behavior of the corresponding agent deviates from the general pattern followed during most of the clip. Figure 3. Time traces of the individual heads in the UT sequence 6, artificially corrupted with 10 % outliers. The outliers were removed and the missing data due to targets leaving the field of view was estimated solving a modified RPCA problem. Frame number Figure 4. Sample (derivative sparse) exogenous signals in the UT sequence 6. The changes correspond to the instants when the second person starts moving towards the first, who remains stationary, and when the two persons merge in an embrace. Figure 5. Sample frames from the UT sequence 16. Top: Correct correlations identified by the Proposed Method. Center and Bottom: Reweighted Group Lasso and Group Lasso (circles indicate self-loops). 5.2. Doubles Tennis Experiment This experiment considers a non-staged real-life scenario. The data consists of 230 frames of a video clip from the Australian Open Tennis Doubles Final games. The goal here is to identify causal relationships between the different players using time traces of the respective centroid positions. Note that in this case the ground truth is not available. Nevertheless, since players from the same team usually look at their opponents and react to their motions, we expect a strong causality connection between members of 33557803 opposite teams. This intuition is matched by the correlations unveiled by the algorithm, shown in Fig. 6. The identified sparse input corresponding to the vertical direction is shown in Fig. 7 (similar results for the horizontal component are omitted due to space reasons.) Figure 6. Sample frames from the tennis sequence. Top: The proposed method correctly identifies interactions between opposite team members. Center: Reweighted Group Lasso misses the interaction between the two rear-most individuals of opposite teams, generating self loops instead (denoted by the disks). Bottom: Group Lasso yields an almost complete graph. Figure 7. Exogenous signal corresponding to the vertical axis for the tennis sequence. The change in one component corresponds to the instant when the leftmost player in the bottom team moves from the line towards the net, remaining closer to it from then on. 5.3. Basketball Game Experiment This experiment considers the interactions amongst players in a basketball game. As in the case ofthe tennis players, since the data comes from a real life scenario, the ground truth is not available. However, contrary to the tennis game, this scenario involves complex interactions amongst many players, and causality is hard to discern by inspection. Nevertheless, the results shown in Fig. 8, obtained using the position of the centroids as inputs to our algorithm, match our intuition. Firstly, one would expect a strong cause/effect connection between the actions of the player with the ball and the two defending opponents facing him. These connections (denoted by the yellow arrows) were indeed successfully identified by the algorithm. The next set of causal correlations is represented by the (blue, light green) and (black, white) arrow pairs showing the defending and the opponent players on the far side of the field and under the hoop. An important, counterintuitive, connection identified by the algorithm is represented by the magenta arrows be- tween the right winger of the white team with two of his teammates: the one holding the ball and the one running behind all players. While at first sight this connection is not as obvious as the others, it becomes apparent towards the end of the sequence, when the right winger player is signaling with a raised arm. Notably, our algorithm was able to unveil this signaling without the need to perform a semantic analysis (a very difficult task here, since this signaling is apparent only in the last few frames). Rather, it used the fact that the causal correlation was encapsulated in the dynamics of the relative motions of these players. 6. Conclusions In this paper we propose a new method for detecting causal interactions between agents using video data. The main idea is to recast this problem into a blind directed graph topology identification, where each node corresponds to the observed motion of a given target, each link indicates the presence of a causal correlation and the unknown inputs account for changes in the interaction patterns. In turn, this problem can be reduced to that of finding block-sparse solutions to a set of linear equations, which can be efficiently accomplished using an iterative re-weighted Group-Lasso approach. The ability of the algorithm to correctly identify causal correlations, even in cases where portions of the data record are missing or corrupted by outliers, and the key role played by the unknown exogenous input were illustrated with several examples involving non–trivial inter- actions amongst several human subjects. Remarkably, the proposed algorithm was able to identify both the correct interactions and the time instants when interactions amongst agents changed, based on minimal motion information: in all cases we used just a single time trace per person. This success indicates that in many scenarios, the dynamic information contained in the motion pattern of a single feature associated with a target is rich enough to enable identifying complex interaction patterns, without the need to track multiple features, perform a semantic analysis or use additional domain knowledge. 33557814 Figure 8. Sample frames from a Basketball game. Top: proposed method. Center: Reweighted Group the signaling player and his teammates. Bottom: Group Lasso yields an almost complete graph. Lasso misses the interaction between References [1] A. Arnold, Y. Liu, and N. Abe. Estimating brain functional connectivity with sparse multivariate autoregression. In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 66–75, 2007. 4 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps. Dynamic subspace-based coordinated multicamera tracking. In 2011 IEEE ICCV, pages 2462–2469, 2011. 3 [3] M. Ayazoglu, M. Sznaier, and O. Camps. Fast algorithms for structured robust principal component analysis. In 2012 IEEE CVPR, pages 1704–171 1, June 2012. 2, 5 [4] A. Bolstad, B. Van Veen, and R. Nowak. Causal network inference via group sparse regularization. IEEE Transactions on Signal Processing, 59(6):2628–2641, 2011. 4 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] tributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. 2 E. Candes, X. Li, Y. Ma, and J.Wright. Robust principal component analysis? J. ACM, (3), 2011. 5 E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, December 2008. 4 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. Siam J. Optim., (2):572–596, 2011. 5 C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, pages 424–438l, 1969. 1 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In 2009 IEEE CVPR, pages 2012–2019, 2009. 1 S. Haufe, G. Nolte, K. R. Muller, and N. Kramer. Sparse causal discovery in multivariate time series. In Neural Information Processing Systems, 2009. 4, 5 G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 1663–670, 2010. 5 D. Materassi, G. Innocenti, and L. Giarre. Reduced complexity models in identification of dynamical networks: Links with sparsification problems. In 48th IEEE Conference on Decision and Control, pages 4796–4801, 2009. 4 K. Prabhakar, S. Oh, P. Wang, G. Abowd, and J. Rehg. Temporal causality for the analysis ofvisual events. In IEEE Conf Comp. Vision and Pattern Recog. (CVPR)., pages 1967– 1974, 2010. 1 M. S. Ryoo and J. K. Aggarwal. UT Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html, 2010. 5 [16] J. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3): 1030–1051, 2006. 4 [17] S. Yi and V. Pavlovic. Sparse granger causality graphs for human action classification. In 2012 ICPR, pages 3374–3377. 1 [18] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1):49–67, 2006. 4 33557825

6 0.95762908 213 iccv-2013-Implied Feedback: Learning Nuances of User Behavior in Image Search

7 0.93772489 46 iccv-2013-Allocentric Pose Estimation

8 0.92381895 184 iccv-2013-Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion

same-paper 9 0.91030788 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

10 0.87977189 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso

11 0.87614322 14 iccv-2013-A Generalized Iterated Shrinkage Algorithm for Non-convex Sparse Coding

12 0.8723591 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search

13 0.84604257 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration

14 0.83964753 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples

15 0.83914047 398 iccv-2013-Sparse Variation Dictionary Learning for Face Recognition with a Single Training Sample per Person

16 0.83205652 52 iccv-2013-Attribute Adaptation for Personalized Image Search

17 0.82875973 154 iccv-2013-Face Recognition via Archetype Hull Ranking

18 0.82771295 114 iccv-2013-Dictionary Learning and Sparse Coding on Grassmann Manifolds: An Extrinsic Solution

19 0.8257944 44 iccv-2013-Adapting Classification Cascades to New Domains

20 0.8254118 106 iccv-2013-Deep Learning Identity-Preserving Face Space