iccv iccv2013 iccv2013-418 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
Reference: text
sentIndex sentText sentNum sentScore
1 of Electrical and Computer Engineering Northeastern University, Boston, MA 02115 camps @ coe . [sent-4, score-0.177]
2 The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. [sent-12, score-0.777]
3 Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. [sent-13, score-0.207]
4 The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. [sent-14, score-0.644]
5 Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor. [sent-15, score-0.172]
6 Introduction Recent advances in the accuracy and efficiency of object detectors [13, 16], particularly pedestrian detectors, have inspired and fueled multi-target tracking approaches by detection. [sent-17, score-0.104]
7 These techniques proceed by detecting the targets frame by frame using a high quality object detector and then associating these detections by using online or offline trackers [6, 3 1, 33]. [sent-18, score-0.347]
8 Often, these associations are based on appearance and location similarity, while the start and end of the tracks are handled using “source” and “sink” nodes. [sent-19, score-0.193]
9 Theirapearnce does not help, but their motion aids to disambiguate them. [sent-23, score-0.104]
10 patterns, and source and sink nodes can be naturally placed at the boundaries of the field of view. [sent-24, score-0.154]
11 However, these algorithms do not perform as well when targets have similar appearance, do not move with the assumed dynamics, or come out in the middle of the field of view as in the example shown in Figure 1. [sent-26, score-0.287]
12 While there are trackers that rely less on appearance [3, 8, 9, 12, 30], they often require tuning of a large number of parameters and expertise to adapt the algorithms to these more challenging scenarios. [sent-27, score-0.155]
13 It is also possible to track solely based on dynamics using Kalman [20] or particle [19] filters to predict the target location and associate the closest detection to this prediction. [sent-28, score-0.397]
14 However, these approaches must assume a dynamic model a priori and have trouble distinguishing close to each other targets. [sent-29, score-0.066]
15 [14] showed that it is possible to use dynamics to compare tracks and disambiguate between targets without assuming a motion model a priori. [sent-31, score-0.715]
16 Instead, comparisons are based on the complexity of the underlying dynamics which is estimated by minimizing the rank of a Hankel matrix constructed directly from the available data, potentially fragmented and corrupted by noise. [sent-32, score-0.521]
17 However, the computational and memory complexity of IP methods is O(l6) and O(l4), respectively, where l is the length of the trajectory. [sent-34, score-0.133]
18 Thus, until now this approach has been limited to stitching short trajectories of a few targets. [sent-35, score-0.094]
19 22330044 In this paper we propose a framework to use motion dynamics for multi-target tracking by detection that can efficiently handle large numbers of targets, long trajectories, missing data, and arbitrary motions. [sent-36, score-0.496]
20 This is accomplished by i) formulating the problem as a generalized linear assignment (GLA) of tracklets which are incrementally associated into longer trajectories based on their dynamics-based similarity, and ii) using efficient algorithms to estimate these similarity measures. [sent-37, score-0.626]
21 Contributions The benefit of using a GLA approach is that it avoids the need to make an a priori commitment about the start and termination nodes for location or time of the considered trajectories. [sent-40, score-0.289]
22 We explore two algorithms that differ on the method they use to estimate dynamics similarity. [sent-41, score-0.319]
23 The first method replaces the IP optimization step used in [14] with an alternating direction method of multipliers (ADMM) [4] with computational complexity O(l3) and low memory requirements. [sent-42, score-0.133]
24 Similarly to IP methods, this approach solves a relaxation instead of the original problem and requires setting a parameter weighting the noise penalty and a singular value threshold to estimate the rank which are both difficult to choose. [sent-43, score-0.403]
25 The second method IHTLS (iterative Han- kel Total Least Squares) is a new algorithm that we propose to directly estimate the rank of incomplete, noisy Hankel matrices. [sent-44, score-0.238]
26 The algorithm is based on a noise cleaning algorithm for Hankel matrices introduced in [24] that we modified to handle missing data and estimate rank. [sent-45, score-0.126]
27 The advantages of IHTLS is that it solves the original problem instead of a convex relaxation, it does not require choosing a singular values threshold and it has computational complexity O((l −n)n3), where n < lis the rank of the Hankel matrix. [sent-46, score-0.398]
28 These algorithms were tested and compared against state-of-the-art approaches [7, 12] on a set of videos with challenging scenarios where targets are difficult to discriminate based on appearance alone. [sent-49, score-0.393]
29 The dataset, annotated with ground truth target locations, is available for public use at our website. [sent-50, score-0.066]
30 The experiments show that multi-target tracking using IHTLS performs faster and more accurately than when using ADMM and that both techniques perform significantly better and faster than the state of the art, where performance is measured using the MOTA metric. [sent-51, score-0.072]
31 Finally, it should be noted that the proposed methods are complementary to appearance-based methods: they can improve the performance of appearance-based methods when visual discrimination is possible, yet retain target identities when such information is not available. [sent-52, score-0.066]
32 Themthod f[7]ailsduetocmplexdynamicsand unexpected start and ending locations for the targets trajectories. [sent-54, score-0.396]
33 Other Related Work Often, multi-target tracking is formulated as a network flow problem [33] or variations of it [5, 7, 25, 26, 30]. [sent-57, score-0.112]
34 Most methods rely to a large degree on target appearance and assume simple motion models a priori that work well in settings where the targets are pedestrians or vehicles. [sent-58, score-0.542]
35 However, their performance deteriorates in more challenging scenarios as shown in Figure 2 where the algorithm from [7] fails to track the balls shown in Figure 1. [sent-59, score-0.073]
36 A drawback of using a network flow formulation is that it requires setting up beginning and ending locations and/or times of the targets which, depending on the scenario, may be hard to pinpoint in advance1 . [sent-60, score-0.473]
37 An alternative to network flow are max-clique type formulations. [sent-61, score-0.04]
38 [11] used a maximum weighted set formulation but they only consider two-frames relations. [sent-63, score-0.036]
39 [32] proposed using a General Maximum Clique Partitioning formulation picking one-best candidate from each track- let to achieve global association. [sent-65, score-0.036]
40 The GLA formulation used here is similar to the Linear Assignment formulation of [12, 7, 33], but it has the advantage that allows tracks to start and terminate anywhere in position and time. [sent-66, score-0.21]
41 In spirit, we are also similar to [18] and [32] since we also solve the associations iteratively from easy to hard. [sent-67, score-0.048]
42 Like [32] our algorithm can operate at the tracklet level, but we use all the data on a tracklet rather than a selected portion. [sent-68, score-0.486]
43 In addition, we do not assume any priors for the target motion as in [12, 30]. [sent-69, score-0.125]
44 This allows our algorithm to capture long term dynamics as targets may change their motion behavior over time. [sent-70, score-0.625]
45 Lastly, the dynamics based similarity measure used by our algorithm can deal with different types of scenes, target and camera dynamics (including zooming) while providing a natural way to “inpaint” missing data. [sent-71, score-0.768]
46 There are few multi-target trackers for non-human targets, and they either rely on appearance [11] or prior motion models [8]. [sent-72, score-0.183]
47 More recently, [12] and [3] formulate the multi-target tracking problem using higher order motion models which are one-step better than models previously used but still limited. [sent-74, score-0.131]
48 In contrast, our method can handle arbitrarily high order motions with a run time O(N2), where N << d is the number of tracklets and requires only two parameters. [sent-78, score-0.306]
49 Dynamics-based Multi-tracklet Association Given a set of short tracklets, possibly of different lengths and with no appearance information, we want to associate together those that belong to the same trajectory. [sent-80, score-0.15]
50 There are four challenges that makes this association task difficult: 1. [sent-81, score-0.032]
51 To address these challenges, we formulate the multitarget tracking problem as a Generalized Linear Assignment (GLA) using a dynamics-based similarity measure. [sent-86, score-0.13]
52 Generalized Linear Assignment Problem Given N tracklets {α1 , . [sent-89, score-0.306]
53 , αN}, the Linear Assignment (LAG)i vPeronb Nle tmra cisk sletattse {dα as the optimization problem: ? [sent-92, score-0.037]
54 j=1 where Pij is a suitable similarity measure between αi and αj, such that P is a predecessor-successor matrix, i. [sent-103, score-0.058]
55 The optimization ivsar −ia∞ble Xij indicates that αi is the predecessor of αj when Xij = 1and that they will be merged considering their gap. [sent-106, score-0.075]
56 Equation (1) is the max-flow formulation used by [7, 33] where the constraints enforce that each tracklet has to be assigned to one predecessor and one successor. [sent-107, score-0.354]
57 In general, the problem is augmented with source and sink nodes to simulate the entrance and termination of the tracklets. [sent-108, score-0.315]
58 To avoid this requirement, we use the Generalized Linear Assignment (GLA) [27, 28] formulation (2): ? [sent-109, score-0.036]
59 j=1 The main difference between LA and GLA, is that in the latter, tracklets are not forced to begin, terminate or associate with any other tracklet. [sent-120, score-0.415]
60 avoids setting up sink source nodes and learning tracklet entrance and termination probabilities; and 2. [sent-122, score-0.592]
61 In our experiments, it was observed that the dynamics-based similarity measure we use reduces possible ambiguities and leads softassign to fast and accurate convergence (on average converges in 10 iterations and never takes more than 100 iterations). [sent-125, score-0.127]
62 Tracklet Dynamics and Similarity Measure A tracklet α consists of an ordered sequence of measurements yk, s ≤ k ≤ e, where s and e are the starting and ending tim, ses, ≤ respectively. [sent-128, score-0.316]
63 eTrhee s underlying dynamics ogf a atnhed tracklet can be represented using a linear regressor, since linear regressors are universal approximators [10]: ? [sent-129, score-0.636]
64 The order of the regressor n yk measures the “complexity” of the underlying dynamics and in the absence of noise, n = where is the Hankel matrix with m ≥ n columns: rank(H(αm)) Hα(m) ThenH,(αthme)=˙dy⎡⎢⎣ nyaem−ysi. [sent-133, score-0.418]
65 mbe −t1w⎦⎥ ⎤entw(4o) tracklets αi, αj is defined [14] as: Pij =˙⎨⎧−ra∞nmki(nHβijαir)a+nrka(nHkα(iHjα)j)− 1 oifth αeirwanisde αjconflict where ⎩αij = [αi βij (5) αj] is the joint tracklet padded with βij tracklet at the gap between αi and αj values. [sent-138, score-0.823]
66 The intuition behind the above similarity measure is that if two tracklets are portions of the same trajectory they can be approximated by a single, relatively low order regressor. [sent-139, score-0.406]
67 On the other hand, if two tracklets belong to different trajectories, explaining a merged/joined trajectory requires a higher order regressor than the regressors of each tracklet2. [sent-140, score-0.493]
68 Thus, intuitively, if rank(Hαi ) = ri and rank(Hαj ) = rj, then rank(Hαij ) = rij ≤ (ri + rj). [sent-141, score-0.088]
69 Accordingly, if αi and αj belong to the same trajectory, then ri = rj = rij and Pij = 1, but if αi and αj are not related Pij ≈ 0. [sent-142, score-0.182]
70 Dynamics-based Similarity Computation A major challenge in computing (5) is that one has to estimate the rank of noisy and incomplete structured matrices. [sent-145, score-0.258]
71 [14] addressed this problem by using IP methods to 2This assumes that the object does not drastically change dynamics between tracklets. [sent-146, score-0.279]
72 This is a fair assumption for tracklets that are close to each other in time and akin to assume that appearance will stay similar over-time and change slowly in appearance-based target tracking [6, 11, 18]. [sent-147, score-0.508]
73 However, this minimization has computational complexity O(l6) and O(l4) memory requirements, due to the need to store the Hessian. [sent-155, score-0.133]
74 [22] introduced a different similarity score, based on the subspace angle between the column spaces of the Hankel matrices, that is efficient to compute and does not require estimating rank. [sent-158, score-0.058]
75 However, since the subspace angle is invariant to the initial conditions of the trajectories, it cannot be used to distinguish two targets with the same dynamics and hence is not suitable for this application. [sent-159, score-0.566]
76 We propose two alternatives to address these issues: Rank estimation using ADMM An alternative to using IP methods is to solve a similar convex relaxation using ADMM [4]: mβiijn? [sent-160, score-0.117]
77 The procedure has very modest memory requirements and computational cost O(l3) at each iteration. [sent-166, score-0.117]
78 Thus, it scales well as the length of the trajectories increase. [sent-167, score-0.094]
79 Rank estimation using IHTLS We propose as a second alternative a new algorithm to estimate the rank of an incomplete Hankel matrix corrupted with additive noise. [sent-168, score-0.258]
80 The algorithm is based on two simple modifications of the Hankel Total Least Squares (HTLS) algorithm [24] which allow us to handle missing data and to estimate rank. [sent-169, score-0.126]
81 The first modification is the introduction of an “indicator” binary vector to flag missing data and allow its recovery while performing inpainting to stitch tracklets with gaps. [sent-170, score-0.497]
82 The second modification is to run this algorithm, iteratively, for increasing rank values to find the optimal rank. [sent-171, score-0.2]
83 Thus, like the approaches in [14] and in [4] the algorithm, described in detail below, does not only estimate the rank, but also cleans and completes the data while respecting the structural constraint that H ∈ SH. [sent-172, score-0.075]
wordName wordTfidf (topN-words)
[('hankel', 0.345), ('tracklets', 0.306), ('targets', 0.287), ('dynamics', 0.279), ('gla', 0.276), ('tracklet', 0.243), ('xij', 0.175), ('ihtls', 0.168), ('rank', 0.161), ('pij', 0.139), ('admm', 0.119), ('ij', 0.117), ('sink', 0.115), ('ip', 0.113), ('coe', 0.112), ('trajectories', 0.094), ('neu', 0.089), ('singular', 0.088), ('missing', 0.086), ('entrance', 0.084), ('mxaxi', 0.084), ('assignment', 0.084), ('relaxation', 0.083), ('termination', 0.077), ('predecessor', 0.075), ('ending', 0.073), ('tracking', 0.072), ('softassign', 0.069), ('priori', 0.066), ('target', 0.066), ('camps', 0.065), ('appearance', 0.064), ('regressor', 0.061), ('rj', 0.06), ('trackers', 0.06), ('motion', 0.059), ('similarity', 0.058), ('incomplete', 0.057), ('terminate', 0.057), ('ses', 0.052), ('associate', 0.052), ('rij', 0.05), ('regressors', 0.05), ('memory', 0.049), ('complexity', 0.048), ('associations', 0.048), ('tracks', 0.045), ('disambiguate', 0.045), ('yk', 0.045), ('generalized', 0.044), ('scenarios', 0.042), ('trajectory', 0.042), ('network', 0.04), ('estimate', 0.04), ('modification', 0.039), ('nodes', 0.039), ('ri', 0.038), ('pinpoint', 0.037), ('sear', 0.037), ('tmha', 0.037), ('dicle', 0.037), ('eccs', 0.037), ('octavia', 0.037), ('sznaier', 0.037), ('tmra', 0.037), ('kel', 0.037), ('commitment', 0.037), ('computational', 0.036), ('formulation', 0.036), ('start', 0.036), ('vai', 0.035), ('alert', 0.035), ('dhs', 0.035), ('mario', 0.035), ('cleans', 0.035), ('etrhee', 0.035), ('syst', 0.035), ('ems', 0.035), ('hna', 0.035), ('stitch', 0.035), ('convex', 0.034), ('avoids', 0.034), ('belong', 0.034), ('underlying', 0.033), ('inpaint', 0.033), ('northeastern', 0.033), ('consequences', 0.033), ('sh', 0.032), ('association', 0.032), ('pedestrian', 0.032), ('requirements', 0.032), ('thresholding', 0.031), ('deteriorates', 0.031), ('expertise', 0.031), ('mota', 0.031), ('flag', 0.031), ('atnhed', 0.031), ('padded', 0.031), ('crossings', 0.031), ('solves', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 418 iccv-2013-The Way They Move: Tracking Multiple Targets with Similar Appearance
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
2 0.28797081 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
Author: Baoyuan Wu, Siwei Lyu, Bao-Gang Hu, Qiang Ji
Abstract: We describe a novel method that simultaneously clusters and associates short sequences of detected faces (termed as face tracklets) in videos. The rationale of our method is that face tracklet clustering and linking are related problems that can benefit from the solutions of each other. Our method is based on a hidden Markov random field model that represents the joint dependencies of cluster labels and tracklet linking associations . We provide an efficient algorithm based on constrained clustering and optimal matching for the simultaneous inference of cluster labels and tracklet associations. We demonstrate significant improvements on the state-of-the-art results in face tracking and clustering performances on several video datasets.
3 0.16278879 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
4 0.15174289 120 iccv-2013-Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
Author: K.C. Amit Kumar, Christophe De_Vleeschouwer
Abstract: Given a set of plausible detections, detected at each time instant independently, we investigate how to associate them across time. This is done by propagating labels on a set of graphs that capture how the spatio-temporal and the appearance cues promote the assignment of identical or distinct labels to a pair of nodes. The graph construction is driven by the locally linear embedding (LLE) of either the spatio-temporal or the appearance features associated to the detections. Interestingly, the neighborhood of a node in each appearance graph is defined to include all nodes for which the appearance feature is available (except the ones that coexist at the same time). This allows to connect the nodes that share the same appearance even if they are temporally distant, which gives our framework the uncommon ability to exploit the appearance features that are available only sporadically along the sequence of detections. Once the graphs have been defined, the multi-object tracking is formulated as the problem of finding a label assignment that is consistent with the constraints captured by each of the graphs. This results into a difference of convex program that can be efficiently solved. Experiments are performed on a basketball and several well-known pedestrian datasets in order to validate the effectiveness of the proposed solution.
5 0.13357973 167 iccv-2013-Finding Causal Interactions in Video Sequences
Author: Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps
Abstract: This paper considers the problem of detecting causal interactions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link indicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspondences). Moreover, this approach also identifies time instants where the interactions between agents change, thus providing event detection capabilities. These results are illustrated with several examples involving non–trivial interactions amongst several human subjects. 1. Introduction and Motivation The problem of identifying causal interactions amongst targets in a video sequence has been the focus of considerable attention in the past few years. A large portion of the existing body of work in this field uses human annotated video to build a storyline that includes both recognizing the activities involved and the causal relationships between them (see for instance [10] and references therein). While these methods are powerful and work well when suitably annotated data is available, annotating video clips is expensive and parsing relevant actions requires domain knowledge which may not be readily available. Indeed, in many situations, unveiling potentially hidden causal relationships is a first step towards building such knowledge. In this paper we consider the problem of identifying causal interactions amongst targets, not necessarily human, ∗This work was supported by NSF grants IIS–0713003, IIS-1318145, and ECCS–0901433, AFOSR grant FA9559–12–1–0271, and the Alert DHS Center of Excellence under Award Number 2008-ST-061-ED0001 . from unannotated video sequences and without prior domain knowledge. Our approach exploits the concept of “Granger Causality” [9], that formalizes the intuitive idea that ifa time series {x(t)} is causally related to a second one {thya(tt)if}a, ttihmene knowledge }oifs tchaeu past vrealluateesd otfo a{yse}c1to should l{eya(dt t}o, a ebnett kern prediction o thf efu ptuasret vvaalulueess ooff {{yx}}tt+k. In [l1ea4d], Pora ab bheatktearr eprt.e aicl.t successfully vuasleude a frequency domain reformulation of this concept to uncover pairwise interactions in scenarios involving repeating events, such as social games. This technique was later extended in [17] to model causal correlations between human joints and applied to the problem of activity classification. However, since this approach is based upon estimating the crosscovariance density function between events, it cannot handle situations where these events are non repeating, are too rare to provide an accurate estimate, or where these estimates are biased by outliers or missing data. Further, estimating a pairwise measure of causal correlation requires a spectral factorization of the cross-covariance, followed by numerical integration and statistical thresholding, limiting the approach to moderately large problems. To circumvent these problems, in this paper we propose an alternative approach based upon recasting the problem into that of identifying the topology of a sparse (directed) graph, where each node corresponds to the time traces of relevant features of a target, and each link corresponds to a regressor. The situation is illustrated in Fig. 1 using as an example the problem of finding causal relations amongst 4 tennis players, leading to a graph with 4 nodes, and potentially 12 (directed) links. Note that in general, the problem of identifying causal relationships is ill posed (unless one wants to identify the set of all individuals that could possibly have causal connections), due to the existence of secondary interactions. To illustrate this point, consider a very simplistic scenario with three actors A, B, and C, where A copies (with some delay) the actions of B, which in turn mimics C, also with some delay. In this situation, the ac- tions of A can be explained in terms of either those of B delayed one time sample, or those of C delayed by two samples. Thus, an algorithm based upon a statistical analysis 33556758 would identify a causal connection between A and C, even though there is no direct link between them. Further, if the actions of C can be explained by some simple autoregressive model of the form: = C(t) ?aiC(t − i) then it follows that the acti?ons of A can be explained by the same model, e.g. = A(t) ?aiA(t − i) Hence, multiple graphs topologies, some of which include self-loops, can explain the same set of time-series. On the other hand, note that in this situation, the sparsest graph (in the sense of having the fewest links) is the one that correctly captures the causality relations: the most direct cause of A is B and that of B is C, with C potentially being explained by a self-loop. To capture this feature and regularize the problem, in the sequel we will seek to find the sparsest graph, in the sense of having the least number of interconnections, that explains the observed data, reflecting the fact that, when alternative models are possible, often the most parsimonious is the correct one. Our main result shows that the problem of identifying sparse graph structures from observed noisy data can be reduced to a convex optimization problem (via the use of Group Lasso type arguments) that can be efficiently solved. The advantages of the proposed methods are: • • • • Its ability to handle complex scenarios involving nonrepeating events, een cvoimropnlmeexn stcael changes, clvoillnegct nioonnsof targets that do not necessarily split into well defined groups, outliers and missing data. The ability to identify the sparsest interaction structure tThhaet explains th idee nobtifseyr tvheed s dpaartas e(stthu inst avoiding labeling as causal connections those indirect correlations mediated only by an intermediary), together with a sparse “indicator” function whose support set indicates time instants where the interactions between agents change. Since the approach is not based on semantic analysis, iSt can bt hee applied ctoh ti she n moto btiaosne dof o arbitrary targets, sniost, necessarily humans (indeed, it applies to arbitrary time series including for instance economic or genetic data). From a computational standpoint, the resulting optiFmriozmatio an c problems nhaalve s a specific fthoerm re asmuletinnagbl oep ttiobe solved by a class of iterative algorithms [5, 3], that require at each step only a combination of thresholding and least-squares approximations. These algorithms have been shown to substantially outperform conventional convex-optimization solvers both in terms of memory and computation time requirements. The remainder of the paper is organized as follows. In section 2 we provide a formal reformulation of the problem of finding causal relationships between agents as a sparse graph identification problem. In section 3, we show that this problem can be efficiently solved using a re-weighted Group Lasso approach. Moreover, as shown there, the resulting problem can be solved one node at a time using first order methods, which allows for handling situations involving a large number of agents. Finally, the effectiveness of the proposed method is illustrated in section 4 using both simple scenarios (for which ground truth is readily available) and video clips of sports, involving complex, nonrepeating interactions amongst many agents. Figure 1. Finding causal interactions as a graph identification problem. Top: sample frame from a doubles tennis sequence. Bottom: Representation of this sequence as a graph, where each node represents the time series associated with the position of each player and the links are vector regressive models. Causal interactions exist when one of the time series can be explained as a combination of past values of the others. 2. Preliminaries For ease of reference, in this section we summarize the notation used in the paper and give a formal definition of the problem under consideration. 2.1. Notation (M) ?M? ??MM??F ?M?1 ?M?o σi ∗ ◦ ith largest singular value of the matrix M. nuclear norm: ?M? ?i σ?i (M). Fnruocbleeanrio nours norm: ??M?2F? ?i,j Mi2j ?1 norm: ?M? 1 ?i,j |Mij? ?|. ?o quasi-norm: ?M?o number of non-zero ?eleme?nMts i?n M. Hadamard product of matrices: (A ◦ ∗ =.: =. =. =. B)i,j = Ai,jBi,j. 33556769 2.2. Statement of the Problem Next, we formalize the problem under consideration. Consider a scenario with P moving agents, and denote by the 3D homogenous coordinates of the pth individual at time t. Motivated by the idea of Granger Causality, we will say that the actions of this agent depend causally from those in a set Ip (which can possibly contain p itself), if can be written as: Q˜p(t) Q˜p(t) Q˜p(t) ?N = ? ?ajp(n)Q˜j(t − n) +˜ η p(t) +˜ u p(t) (1) j? ?∈Ip ?n=0 Here ajp are unknown coefficients, and ˜η p(t) and up(t) represent measurement noise and a piecewise constant signal that is intended to account for relatively rare events that cannot be explained by the (past) actions of other agents. Examples include interactions of an agent with the environment, for instance to avoid obstacles, or changes in the interactions between agents. Since these events are infrequent, we will model as a signal that has (component-wise) a sparse derivative. Note in passing that since (1) involves homogeneous coordinates, the coefficients aj,p(.) satisfy the following constraint1 u ?N ? ?ajp(n) j? ?∈Ip ?n=0 =1 (2) Our goal is to identify causal relationships using as data 2D measurements qp(t) in F frames of the affine projections of the 3D coordinates Q˜p(t) of the targets. Note that, under the affine camera assumption, the 2D coordinates are related exactly by the same regressor parameters [2]. Thus, (1) holds if and only if: ?N qp(t) = ? ?ajp(n)qj(t − n) + u˜ p(t) + ηp(t) (3) j?∈Ip ?n=0 In this context, the problem can be precisely stated as: Given qp(t) (in F number of frames) and some a-priori bound N on the order of the regressors (that is the “memory” of the interactions), find the sparsest set of equations of the form (3) that explains the data, that is: aj,pm,ηinp,up?nIp (4) subject to? ?(2) and: = ? ?ajp(n)qj(t − n) + ?N qp(t) j? ?∈Ip ?n=0 up(t) + ηp(t) , p = 1 . . . , P and t = 1, ..F 1This follows by considering the third coordinate in (1) (5) where nIp denotes the cardinality of the set Ip. Rewriting (5) in matrixd efnoormtes yields: [xp; yp] = [Bp, I][apTuxTpuyTp]T + ηp (6) where qp(t) up(t) ηp(t) xp yp ap aip uxp uyp Bp Xp = [xp(t)Typ(t)T]T = [uTxp(t)uyTp(t)]T = [ηxp(t)Tηyp(t)T]T = = [xp(F)xp(F − 1)...xp(1)]T = [yp(F)yp(F − 1)...yp(1)]T [aT1p, a2Tp, ..., aTPp]T = [aip(0), aip(1), ..., aip(N)]T = [uxp(F)uxp(F−1)...uxp(1)]T = [uyp(F)uyp(F−1)...uyp(1)]T = = [Xp; Yp] [hankel(x1 , N) , ..., hankel(xP, N)] Yp = [hankel(y1, N), ..., hankel(yP, N)] and where, for a sequence z(t), hankel(z, N) denotes its associated Hankel matrix: hankel(z, N) = Itfolw⎛⎜⎝ sz t(hNzFa(t. +−a)d1 2e)scrzip(tF io(N. n− )o231f)al· t h· einzt(Frac−zti(.o1N.n)s−a)m12o)⎟ ⎞⎠ ngst uηaq= ? ηuqa1 T ,ηqau2 T ,ηaqu3 T ,· ·, ηauqP T ? T (8) Thus,inthBisc=on⎢⎣⎡teBx0t.1,theB0p.r2ob·le.·m·ofB0 i.nPte⎦⎥r ⎤estcanbeforagents (that is the complete graph structure) is captured by a matrix equation of the form: q = [B, I][aTuT]T + η (7) where and malized as finding the block–sparsest solution to the set of linear equations (2) and (7). 33557770 The problem of identifying a graph structure subject to sparsity constraints, has been the subject of intense research in the past few years. For instance, [1] proposed a Lasso type algorithm to identify a sparse network where each link corresponds to a VAR process. The main idea underlying this method is to exploit the fact that penalizing the ?1 norm of the vector of regression coefficients tends to produce sparse solutions. However, enforcing sparsity of the entire vector of regressor coefficients does not necessarily result in a sparse graph structure, since the resulting solution can consist of many links, each with a few coefficients. This difficulty can be circumvented by resorting to group Lasso type approaches [18], which seek to enforce block sparsity by using a combination of ?1 and ?2 norm constraints on the coefficients of the regressor. While this approach was shown to work well with artificial data in [11], exact recovery of the underlying network can be only guaranteed when the data satisfies suitable “incoherence” type conditions [4]. Finally, a different approach was pursued in [13], based on the use of a modified Orthogonal Least Squares algorithm, Cyclic Orthogonal Least Squares. However, this approach requires enforcing an a-priori limit on the number of links allowed to point to a single node, and such information may not be readily available, specially in cases where this number has high variability amongst nodes. To address these difficulties, in the next section we develop a convex optimization based approach to the problem of identifying sparse graph structures from observed noisy data. This method is closest in spirit to that in [11], in the sense that it is also based on a group Lasso type argument. The main differences consist in the ability to handle the unknown inputs up(t), needed to model exogenous disturbances affecting the agents, and in a reformulation of the problem, that allows for using a re-weighted iterative type algorithm, leading to substantially sparser solutions, even when the conditions in [4] fail. 3. Causality Identification Algorithm In this section we present the main result of this paper, an algorithm to search for block-sparse solutions to (7). For each fixed p, the algorithm searches for sparse solutions to (6) by solving (iteratively) the following problem (suggested by the re-weighted heuristic proposed in [7]) ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 subject to: ?ηp ? ≤ p = 1, . . , P. ∞ ?P ?, ?N ??aip(n) i?= ?1 ?n=0 ?. = 1, p = 1,...,P. (9) where [Δuxp ; Δuyp] represents the first order differences of the exogenous input vector [uxp ; uyp], Wa and Wu are weighting matrices, and λ is a Lagrange multiplier that plays the role of a tuning parameter between graph sparsity and event sensitivity. Intuitively, for a fixed set of weights w, the algorithm attempts to find a block sparse solution to (6) and a set of sparse inp?uts Δuxp ; Δuyp , by exploiting the facts that minimizing ?i ?aip ?2 (the ?2,1 norm of the vector sequence {aip}) te?nds? tao m?aximize block-sparsity [18], while minimizing et?nhed s? 1t norm mmaizxeim blizoceks sparsity [ [1168]]. wOhniclee t mheisnesolutions are found, the weights w are adjusted to penalize those elements of the sequences with small values, so that in the next iteration solutions that set these elements to zero (hence further increasing sparsity) are favored. Note however, that proceeding in this way, requires solving at each iteration a problem with n = P(Pnr + F) variables, where P and F denote the number of agents and frames, respectively, and where nr is a bound on the regressor order. On the other hand, it is easily seen that both the objective function and the constraints in (9) can be partitioned into P groups, with the pth group involving only the variables related to the pth node. It follows then that problem (9) can be solved by solving P smaller problems of the form: ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 ?P subject to: ?ηp?∞ ?N ≤ ? and ??aip(n) i?= ?1 ?n=0 leading to the algorithm given below: =1 (10) Algorithm 1: REWEIGHTEDCAUSALITYALGORITHM for each p wa = [1, 1, ..., 1] = [1, 1, ..., 1] S > 1(self loop weight) s = [1, 1, ..., S, ..., 1] (p’th element is S) while not converged do 1. solve (9) 2. wja = 1/( ?aip ?2 + δ) 3. wja = wja ◦ s (Penalization self loops) 4. = 1./(abs([Δuxp ; Δuyp]) + δ) end while 5. At this point ajp(.) , Ip and up(t) have been identified end for wu wu It is worth emphasizing that, since the computational complexity of standard interior point methods grows as n3, solving these smaller P problems leads to roughly a O(P2) 33557781 reduction in computational time over solving a single, larger optimization. Thus, this approach can handle moderately large problems using standard, interior-point based, semidefinite optimization solvers. Larger problems can be accommodated by noting that the special form of the objective and constraints allow for using iterative Augmented La- grangian Type Methods (ALM), based upon computing, at each step, the closed form solution to suitable intermediate optimization problems. While a complete derivation of such an algorithm is beyond the scope of this paper, using results from [12] it can be shown that each step requires only a combination of thresholding and least-squares approximations. Moreover, it can be shown that such an algorithm converges Q-superlinearly. 4. Handling Outliers and Missing Data The algorithm outlined above assumes an ideal situation where the data matrix B is perfectly known. However, in practice many of its elements may be outliers (due to misidentified correspondences) or missing (due to occlusion). As we briefly show next, these situations can be efficiently handled by performing a structured robust PCA step [3] to obtain a “clean” data matrix, prior to applying Algorithm 1. From equation (6) it follows that, in the absence of exogenous inputs and noise: ?xy11.. . .yxPP? = ?XY11.. . .YXPP? ?a1...aP? (11) Since xi ∈ {col(Xj)} and yi ∈ {col(Yj }), it follows that the sets {∈co {l(cXoli(X)} a)n}d a n{dco yl(Y∈i) { }c? are self-ex?pressive, or, ?equivalently?, Xthe }ma atnridce {sc oXl( =.) }? aXre1 . . . fX-eNxp? eanssdiv eY, ?Y1 ...YN? are mraantkri cdeesfic Xient. ?Consider no?w the case =.r, w?here some ?elements xi, yi of X and Y are missing. From ?the self-expressive property ooff {Xco aln(Xd Yi)} a raen dm i{scsoinlg(Y. Fi)ro} mit tfhoello swelsf tehxaptr ethsessieve missing eyle omf {encotsl are given by: xi = argmin rank(X) , yi x = argmin rank(Y) (12) y Similarly, in the presence of outliers, X, Y can be decomposed irnlyto, itnhe t sum oesfe a lcoew o fra onkut mlieartsr,ix X (,thYe ccalenan b eda dtae)c oamnda sparse one (the outliers) by solving a problem of the form minrank?YXoo?+ λ????EEYX????os. t.: ?XYoo?+?EEYX?=?YX? From the reasoning? abov?e it follows that in the presence of noise and exogenous outputs, the clean data record can be recovered from the corrupted, partial measurements by solving the following optimization problem: s+muλibn3je? ? ? cYXtM ot ? Y?X:∗◦ +Ξ λYX1? ? ? FM XY◦ E XY? ?1+λ2? ?M YX◦ Δ U YX? ?1 ?YX?=?XYoo?+?EEXY?+?UUYX?+?ΞΞYX? (13) where we have used the standard convex relaxations of rank and cardinality2. Here Ξ and U denote noise and piecewise constant exogenous matrices, ΔU denotes the matrix obtained by taking the difference between consecutive elements in U, and MX (MY) is a “mask” matrix, with mi,j = 0 if the element (i, j) in X ( Y) is missing, mi,j = 1 otherw=i0s e, i tuhseed e etom aenvtoi (di, penalizing )e lisem miesnstisn gin, mE, Ξ, U corresponding to missing data. Problem (13) is a structured robust PCA problem (due to the Hankel structure of X, Y) trhobatu can C bAe efficiently suoelv teod t using tkheel fsitrrsut oturrdeer o mf Xeth,oYd) proposed in [3], slightly modified to handle the terms containing ΔU. 5. Experimental Results In this section we illustrate the effectiveness of the proposed approach using several video clips (provided as supplemental material). The results of the experiments are displayed using graphs embedded on the video frames: An arrow indicates causal correlation between agents, with the point of the arrow indicating the agent whose actions are affected by the agent at its tail. The internal parameters of the algorithm were experimentally tuned, leading to the values ? = 0.1, = 0.05, self loop weights S = 10. The algorithm is fairly insensitive to the value of the regularization parameters and S, which could be adjusted up or down by an order of magnitude without affecting the structure of the resulting graph. Finally, we used regressor order N=2 for the first three examples and N=4 for the last one, a choice that is consistent with the frame rate and the complexity of λ λ the actions taking place in each clip. 5.1. Clips from the UT-Interaction Data Set We considered two video clips from the UT Human Interaction Data Set [15] (sequences 6 and 16). Figures 2 and 5 compare the results obtained applying the proposed algorithm versus Group Lasso (GL) [11] and Group Lasso combined with the reweighted heuristic described in (9) (GLRW). In all cases, the inputs to the algorithm were the (approximate) coordinates of the heads of each of the agents, normalized to the interval [−1, 1], artificially corrupted ,w niothrm m10al%iz eodut tloie trhs.e Notably, [t−he1 proposed algorithm 2As shown in [6, 8] under suitable conditions these relaxations the exact minimum rank solution. 33557792 recover Figure 2. Sample frames from the UT sequence 6 with the identified causal connections superimposed. Top: Proposed Method. Center: Reweighted Group Lasso. Bottom: Group Lasso. Only the proposed method identifies the correct connections. was able to correctly identify the correlations between the agents from this very limited amount of information, while the others failed to do so. Note in passing that in both cases none of the algorithms were directly applicable, due to some of the individuals leaving the field of view or being occluded. As illustrated in Fig. 3, the missing data was recovered by solving an RPCA problem prior to applying Algorithm 1. Finally, Fig. 4 sheds more insight on the key role played by the sparse signal u. As shown there, changes in u correspond exactly to time instants when the behavior of the corresponding agent deviates from the general pattern followed during most of the clip. Figure 3. Time traces of the individual heads in the UT sequence 6, artificially corrupted with 10 % outliers. The outliers were removed and the missing data due to targets leaving the field of view was estimated solving a modified RPCA problem. Frame number Figure 4. Sample (derivative sparse) exogenous signals in the UT sequence 6. The changes correspond to the instants when the second person starts moving towards the first, who remains stationary, and when the two persons merge in an embrace. Figure 5. Sample frames from the UT sequence 16. Top: Correct correlations identified by the Proposed Method. Center and Bottom: Reweighted Group Lasso and Group Lasso (circles indicate self-loops). 5.2. Doubles Tennis Experiment This experiment considers a non-staged real-life scenario. The data consists of 230 frames of a video clip from the Australian Open Tennis Doubles Final games. The goal here is to identify causal relationships between the different players using time traces of the respective centroid positions. Note that in this case the ground truth is not available. Nevertheless, since players from the same team usually look at their opponents and react to their motions, we expect a strong causality connection between members of 33557803 opposite teams. This intuition is matched by the correlations unveiled by the algorithm, shown in Fig. 6. The identified sparse input corresponding to the vertical direction is shown in Fig. 7 (similar results for the horizontal component are omitted due to space reasons.) Figure 6. Sample frames from the tennis sequence. Top: The proposed method correctly identifies interactions between opposite team members. Center: Reweighted Group Lasso misses the interaction between the two rear-most individuals of opposite teams, generating self loops instead (denoted by the disks). Bottom: Group Lasso yields an almost complete graph. Figure 7. Exogenous signal corresponding to the vertical axis for the tennis sequence. The change in one component corresponds to the instant when the leftmost player in the bottom team moves from the line towards the net, remaining closer to it from then on. 5.3. Basketball Game Experiment This experiment considers the interactions amongst players in a basketball game. As in the case ofthe tennis players, since the data comes from a real life scenario, the ground truth is not available. However, contrary to the tennis game, this scenario involves complex interactions amongst many players, and causality is hard to discern by inspection. Nevertheless, the results shown in Fig. 8, obtained using the position of the centroids as inputs to our algorithm, match our intuition. Firstly, one would expect a strong cause/effect connection between the actions of the player with the ball and the two defending opponents facing him. These connections (denoted by the yellow arrows) were indeed successfully identified by the algorithm. The next set of causal correlations is represented by the (blue, light green) and (black, white) arrow pairs showing the defending and the opponent players on the far side of the field and under the hoop. An important, counterintuitive, connection identified by the algorithm is represented by the magenta arrows be- tween the right winger of the white team with two of his teammates: the one holding the ball and the one running behind all players. While at first sight this connection is not as obvious as the others, it becomes apparent towards the end of the sequence, when the right winger player is signaling with a raised arm. Notably, our algorithm was able to unveil this signaling without the need to perform a semantic analysis (a very difficult task here, since this signaling is apparent only in the last few frames). Rather, it used the fact that the causal correlation was encapsulated in the dynamics of the relative motions of these players. 6. Conclusions In this paper we propose a new method for detecting causal interactions between agents using video data. The main idea is to recast this problem into a blind directed graph topology identification, where each node corresponds to the observed motion of a given target, each link indicates the presence of a causal correlation and the unknown inputs account for changes in the interaction patterns. In turn, this problem can be reduced to that of finding block-sparse solutions to a set of linear equations, which can be efficiently accomplished using an iterative re-weighted Group-Lasso approach. The ability of the algorithm to correctly identify causal correlations, even in cases where portions of the data record are missing or corrupted by outliers, and the key role played by the unknown exogenous input were illustrated with several examples involving non–trivial inter- actions amongst several human subjects. Remarkably, the proposed algorithm was able to identify both the correct interactions and the time instants when interactions amongst agents changed, based on minimal motion information: in all cases we used just a single time trace per person. This success indicates that in many scenarios, the dynamic information contained in the motion pattern of a single feature associated with a target is rich enough to enable identifying complex interaction patterns, without the need to track multiple features, perform a semantic analysis or use additional domain knowledge. 33557814 Figure 8. Sample frames from a Basketball game. Top: proposed method. Center: Reweighted Group the signaling player and his teammates. Bottom: Group Lasso yields an almost complete graph. Lasso misses the interaction between References [1] A. Arnold, Y. Liu, and N. Abe. Estimating brain functional connectivity with sparse multivariate autoregression. In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 66–75, 2007. 4 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps. Dynamic subspace-based coordinated multicamera tracking. In 2011 IEEE ICCV, pages 2462–2469, 2011. 3 [3] M. Ayazoglu, M. Sznaier, and O. Camps. Fast algorithms for structured robust principal component analysis. In 2012 IEEE CVPR, pages 1704–171 1, June 2012. 2, 5 [4] A. Bolstad, B. Van Veen, and R. Nowak. Causal network inference via group sparse regularization. IEEE Transactions on Signal Processing, 59(6):2628–2641, 2011. 4 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] tributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. 2 E. Candes, X. Li, Y. Ma, and J.Wright. Robust principal component analysis? J. ACM, (3), 2011. 5 E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, December 2008. 4 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. Siam J. Optim., (2):572–596, 2011. 5 C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, pages 424–438l, 1969. 1 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In 2009 IEEE CVPR, pages 2012–2019, 2009. 1 S. Haufe, G. Nolte, K. R. Muller, and N. Kramer. Sparse causal discovery in multivariate time series. In Neural Information Processing Systems, 2009. 4, 5 G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 1663–670, 2010. 5 D. Materassi, G. Innocenti, and L. Giarre. Reduced complexity models in identification of dynamical networks: Links with sparsification problems. In 48th IEEE Conference on Decision and Control, pages 4796–4801, 2009. 4 K. Prabhakar, S. Oh, P. Wang, G. Abowd, and J. Rehg. Temporal causality for the analysis ofvisual events. In IEEE Conf Comp. Vision and Pattern Recog. (CVPR)., pages 1967– 1974, 2010. 1 M. S. Ryoo and J. K. Aggarwal. UT Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html, 2010. 5 [16] J. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3): 1030–1051, 2006. 4 [17] S. Yi and V. Pavlovic. Sparse granger causality graphs for human action classification. In 2012 ICPR, pages 3374–3377. 1 [18] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1):49–67, 2006. 4 33557825
6 0.12669447 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
7 0.1163691 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
8 0.11176997 434 iccv-2013-Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition
9 0.1094723 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
10 0.1067242 297 iccv-2013-Online Motion Segmentation Using Dynamic Label Propagation
11 0.1065859 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
12 0.10107736 310 iccv-2013-Partial Sum Minimization of Singular Values in RPCA for Low-Level Vision
13 0.088890634 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
14 0.085373811 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
15 0.085039489 263 iccv-2013-Measuring Flow Complexity in Videos
16 0.084737934 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
17 0.083481543 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
18 0.078978226 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
19 0.078238547 87 iccv-2013-Conservation Tracking
20 0.076054901 39 iccv-2013-Action Recognition with Improved Trajectories
topicId topicWeight
[(0, 0.17), (1, -0.026), (2, -0.006), (3, 0.059), (4, -0.022), (5, 0.018), (6, -0.015), (7, 0.122), (8, 0.089), (9, 0.094), (10, -0.036), (11, -0.077), (12, 0.005), (13, 0.047), (14, 0.034), (15, 0.022), (16, 0.007), (17, 0.043), (18, -0.028), (19, 0.017), (20, -0.102), (21, -0.059), (22, 0.037), (23, -0.01), (24, 0.057), (25, 0.078), (26, 0.084), (27, -0.145), (28, -0.026), (29, -0.017), (30, 0.086), (31, -0.03), (32, 0.067), (33, 0.084), (34, 0.093), (35, 0.05), (36, 0.056), (37, 0.081), (38, 0.036), (39, -0.02), (40, 0.041), (41, -0.055), (42, -0.064), (43, 0.053), (44, -0.202), (45, 0.08), (46, 0.043), (47, -0.066), (48, -0.035), (49, 0.103)]
simIndex simValue paperId paperTitle
same-paper 1 0.92240143 418 iccv-2013-The Way They Move: Tracking Multiple Targets with Similar Appearance
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
2 0.67210275 393 iccv-2013-Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos
Author: Baoyuan Wu, Siwei Lyu, Bao-Gang Hu, Qiang Ji
Abstract: We describe a novel method that simultaneously clusters and associates short sequences of detected faces (termed as face tracklets) in videos. The rationale of our method is that face tracklet clustering and linking are related problems that can benefit from the solutions of each other. Our method is based on a hidden Markov random field model that represents the joint dependencies of cluster labels and tracklet linking associations . We provide an efficient algorithm based on constrained clustering and optimal matching for the simultaneous inference of cluster labels and tracklet associations. We demonstrate significant improvements on the state-of-the-art results in face tracking and clustering performances on several video datasets.
3 0.64341813 167 iccv-2013-Finding Causal Interactions in Video Sequences
Author: Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps
Abstract: This paper considers the problem of detecting causal interactions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link indicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspondences). Moreover, this approach also identifies time instants where the interactions between agents change, thus providing event detection capabilities. These results are illustrated with several examples involving non–trivial interactions amongst several human subjects. 1. Introduction and Motivation The problem of identifying causal interactions amongst targets in a video sequence has been the focus of considerable attention in the past few years. A large portion of the existing body of work in this field uses human annotated video to build a storyline that includes both recognizing the activities involved and the causal relationships between them (see for instance [10] and references therein). While these methods are powerful and work well when suitably annotated data is available, annotating video clips is expensive and parsing relevant actions requires domain knowledge which may not be readily available. Indeed, in many situations, unveiling potentially hidden causal relationships is a first step towards building such knowledge. In this paper we consider the problem of identifying causal interactions amongst targets, not necessarily human, ∗This work was supported by NSF grants IIS–0713003, IIS-1318145, and ECCS–0901433, AFOSR grant FA9559–12–1–0271, and the Alert DHS Center of Excellence under Award Number 2008-ST-061-ED0001 . from unannotated video sequences and without prior domain knowledge. Our approach exploits the concept of “Granger Causality” [9], that formalizes the intuitive idea that ifa time series {x(t)} is causally related to a second one {thya(tt)if}a, ttihmene knowledge }oifs tchaeu past vrealluateesd otfo a{yse}c1to should l{eya(dt t}o, a ebnett kern prediction o thf efu ptuasret vvaalulueess ooff {{yx}}tt+k. In [l1ea4d], Pora ab bheatktearr eprt.e aicl.t successfully vuasleude a frequency domain reformulation of this concept to uncover pairwise interactions in scenarios involving repeating events, such as social games. This technique was later extended in [17] to model causal correlations between human joints and applied to the problem of activity classification. However, since this approach is based upon estimating the crosscovariance density function between events, it cannot handle situations where these events are non repeating, are too rare to provide an accurate estimate, or where these estimates are biased by outliers or missing data. Further, estimating a pairwise measure of causal correlation requires a spectral factorization of the cross-covariance, followed by numerical integration and statistical thresholding, limiting the approach to moderately large problems. To circumvent these problems, in this paper we propose an alternative approach based upon recasting the problem into that of identifying the topology of a sparse (directed) graph, where each node corresponds to the time traces of relevant features of a target, and each link corresponds to a regressor. The situation is illustrated in Fig. 1 using as an example the problem of finding causal relations amongst 4 tennis players, leading to a graph with 4 nodes, and potentially 12 (directed) links. Note that in general, the problem of identifying causal relationships is ill posed (unless one wants to identify the set of all individuals that could possibly have causal connections), due to the existence of secondary interactions. To illustrate this point, consider a very simplistic scenario with three actors A, B, and C, where A copies (with some delay) the actions of B, which in turn mimics C, also with some delay. In this situation, the ac- tions of A can be explained in terms of either those of B delayed one time sample, or those of C delayed by two samples. Thus, an algorithm based upon a statistical analysis 33556758 would identify a causal connection between A and C, even though there is no direct link between them. Further, if the actions of C can be explained by some simple autoregressive model of the form: = C(t) ?aiC(t − i) then it follows that the acti?ons of A can be explained by the same model, e.g. = A(t) ?aiA(t − i) Hence, multiple graphs topologies, some of which include self-loops, can explain the same set of time-series. On the other hand, note that in this situation, the sparsest graph (in the sense of having the fewest links) is the one that correctly captures the causality relations: the most direct cause of A is B and that of B is C, with C potentially being explained by a self-loop. To capture this feature and regularize the problem, in the sequel we will seek to find the sparsest graph, in the sense of having the least number of interconnections, that explains the observed data, reflecting the fact that, when alternative models are possible, often the most parsimonious is the correct one. Our main result shows that the problem of identifying sparse graph structures from observed noisy data can be reduced to a convex optimization problem (via the use of Group Lasso type arguments) that can be efficiently solved. The advantages of the proposed methods are: • • • • Its ability to handle complex scenarios involving nonrepeating events, een cvoimropnlmeexn stcael changes, clvoillnegct nioonnsof targets that do not necessarily split into well defined groups, outliers and missing data. The ability to identify the sparsest interaction structure tThhaet explains th idee nobtifseyr tvheed s dpaartas e(stthu inst avoiding labeling as causal connections those indirect correlations mediated only by an intermediary), together with a sparse “indicator” function whose support set indicates time instants where the interactions between agents change. Since the approach is not based on semantic analysis, iSt can bt hee applied ctoh ti she n moto btiaosne dof o arbitrary targets, sniost, necessarily humans (indeed, it applies to arbitrary time series including for instance economic or genetic data). From a computational standpoint, the resulting optiFmriozmatio an c problems nhaalve s a specific fthoerm re asmuletinnagbl oep ttiobe solved by a class of iterative algorithms [5, 3], that require at each step only a combination of thresholding and least-squares approximations. These algorithms have been shown to substantially outperform conventional convex-optimization solvers both in terms of memory and computation time requirements. The remainder of the paper is organized as follows. In section 2 we provide a formal reformulation of the problem of finding causal relationships between agents as a sparse graph identification problem. In section 3, we show that this problem can be efficiently solved using a re-weighted Group Lasso approach. Moreover, as shown there, the resulting problem can be solved one node at a time using first order methods, which allows for handling situations involving a large number of agents. Finally, the effectiveness of the proposed method is illustrated in section 4 using both simple scenarios (for which ground truth is readily available) and video clips of sports, involving complex, nonrepeating interactions amongst many agents. Figure 1. Finding causal interactions as a graph identification problem. Top: sample frame from a doubles tennis sequence. Bottom: Representation of this sequence as a graph, where each node represents the time series associated with the position of each player and the links are vector regressive models. Causal interactions exist when one of the time series can be explained as a combination of past values of the others. 2. Preliminaries For ease of reference, in this section we summarize the notation used in the paper and give a formal definition of the problem under consideration. 2.1. Notation (M) ?M? ??MM??F ?M?1 ?M?o σi ∗ ◦ ith largest singular value of the matrix M. nuclear norm: ?M? ?i σ?i (M). Fnruocbleeanrio nours norm: ??M?2F? ?i,j Mi2j ?1 norm: ?M? 1 ?i,j |Mij? ?|. ?o quasi-norm: ?M?o number of non-zero ?eleme?nMts i?n M. Hadamard product of matrices: (A ◦ ∗ =.: =. =. =. B)i,j = Ai,jBi,j. 33556769 2.2. Statement of the Problem Next, we formalize the problem under consideration. Consider a scenario with P moving agents, and denote by the 3D homogenous coordinates of the pth individual at time t. Motivated by the idea of Granger Causality, we will say that the actions of this agent depend causally from those in a set Ip (which can possibly contain p itself), if can be written as: Q˜p(t) Q˜p(t) Q˜p(t) ?N = ? ?ajp(n)Q˜j(t − n) +˜ η p(t) +˜ u p(t) (1) j? ?∈Ip ?n=0 Here ajp are unknown coefficients, and ˜η p(t) and up(t) represent measurement noise and a piecewise constant signal that is intended to account for relatively rare events that cannot be explained by the (past) actions of other agents. Examples include interactions of an agent with the environment, for instance to avoid obstacles, or changes in the interactions between agents. Since these events are infrequent, we will model as a signal that has (component-wise) a sparse derivative. Note in passing that since (1) involves homogeneous coordinates, the coefficients aj,p(.) satisfy the following constraint1 u ?N ? ?ajp(n) j? ?∈Ip ?n=0 =1 (2) Our goal is to identify causal relationships using as data 2D measurements qp(t) in F frames of the affine projections of the 3D coordinates Q˜p(t) of the targets. Note that, under the affine camera assumption, the 2D coordinates are related exactly by the same regressor parameters [2]. Thus, (1) holds if and only if: ?N qp(t) = ? ?ajp(n)qj(t − n) + u˜ p(t) + ηp(t) (3) j?∈Ip ?n=0 In this context, the problem can be precisely stated as: Given qp(t) (in F number of frames) and some a-priori bound N on the order of the regressors (that is the “memory” of the interactions), find the sparsest set of equations of the form (3) that explains the data, that is: aj,pm,ηinp,up?nIp (4) subject to? ?(2) and: = ? ?ajp(n)qj(t − n) + ?N qp(t) j? ?∈Ip ?n=0 up(t) + ηp(t) , p = 1 . . . , P and t = 1, ..F 1This follows by considering the third coordinate in (1) (5) where nIp denotes the cardinality of the set Ip. Rewriting (5) in matrixd efnoormtes yields: [xp; yp] = [Bp, I][apTuxTpuyTp]T + ηp (6) where qp(t) up(t) ηp(t) xp yp ap aip uxp uyp Bp Xp = [xp(t)Typ(t)T]T = [uTxp(t)uyTp(t)]T = [ηxp(t)Tηyp(t)T]T = = [xp(F)xp(F − 1)...xp(1)]T = [yp(F)yp(F − 1)...yp(1)]T [aT1p, a2Tp, ..., aTPp]T = [aip(0), aip(1), ..., aip(N)]T = [uxp(F)uxp(F−1)...uxp(1)]T = [uyp(F)uyp(F−1)...uyp(1)]T = = [Xp; Yp] [hankel(x1 , N) , ..., hankel(xP, N)] Yp = [hankel(y1, N), ..., hankel(yP, N)] and where, for a sequence z(t), hankel(z, N) denotes its associated Hankel matrix: hankel(z, N) = Itfolw⎛⎜⎝ sz t(hNzFa(t. +−a)d1 2e)scrzip(tF io(N. n− )o231f)al· t h· einzt(Frac−zti(.o1N.n)s−a)m12o)⎟ ⎞⎠ ngst uηaq= ? ηuqa1 T ,ηqau2 T ,ηaqu3 T ,· ·, ηauqP T ? T (8) Thus,inthBisc=on⎢⎣⎡teBx0t.1,theB0p.r2ob·le.·m·ofB0 i.nPte⎦⎥r ⎤estcanbeforagents (that is the complete graph structure) is captured by a matrix equation of the form: q = [B, I][aTuT]T + η (7) where and malized as finding the block–sparsest solution to the set of linear equations (2) and (7). 33557770 The problem of identifying a graph structure subject to sparsity constraints, has been the subject of intense research in the past few years. For instance, [1] proposed a Lasso type algorithm to identify a sparse network where each link corresponds to a VAR process. The main idea underlying this method is to exploit the fact that penalizing the ?1 norm of the vector of regression coefficients tends to produce sparse solutions. However, enforcing sparsity of the entire vector of regressor coefficients does not necessarily result in a sparse graph structure, since the resulting solution can consist of many links, each with a few coefficients. This difficulty can be circumvented by resorting to group Lasso type approaches [18], which seek to enforce block sparsity by using a combination of ?1 and ?2 norm constraints on the coefficients of the regressor. While this approach was shown to work well with artificial data in [11], exact recovery of the underlying network can be only guaranteed when the data satisfies suitable “incoherence” type conditions [4]. Finally, a different approach was pursued in [13], based on the use of a modified Orthogonal Least Squares algorithm, Cyclic Orthogonal Least Squares. However, this approach requires enforcing an a-priori limit on the number of links allowed to point to a single node, and such information may not be readily available, specially in cases where this number has high variability amongst nodes. To address these difficulties, in the next section we develop a convex optimization based approach to the problem of identifying sparse graph structures from observed noisy data. This method is closest in spirit to that in [11], in the sense that it is also based on a group Lasso type argument. The main differences consist in the ability to handle the unknown inputs up(t), needed to model exogenous disturbances affecting the agents, and in a reformulation of the problem, that allows for using a re-weighted iterative type algorithm, leading to substantially sparser solutions, even when the conditions in [4] fail. 3. Causality Identification Algorithm In this section we present the main result of this paper, an algorithm to search for block-sparse solutions to (7). For each fixed p, the algorithm searches for sparse solutions to (6) by solving (iteratively) the following problem (suggested by the re-weighted heuristic proposed in [7]) ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 subject to: ?ηp ? ≤ p = 1, . . , P. ∞ ?P ?, ?N ??aip(n) i?= ?1 ?n=0 ?. = 1, p = 1,...,P. (9) where [Δuxp ; Δuyp] represents the first order differences of the exogenous input vector [uxp ; uyp], Wa and Wu are weighting matrices, and λ is a Lagrange multiplier that plays the role of a tuning parameter between graph sparsity and event sensitivity. Intuitively, for a fixed set of weights w, the algorithm attempts to find a block sparse solution to (6) and a set of sparse inp?uts Δuxp ; Δuyp , by exploiting the facts that minimizing ?i ?aip ?2 (the ?2,1 norm of the vector sequence {aip}) te?nds? tao m?aximize block-sparsity [18], while minimizing et?nhed s? 1t norm mmaizxeim blizoceks sparsity [ [1168]]. wOhniclee t mheisnesolutions are found, the weights w are adjusted to penalize those elements of the sequences with small values, so that in the next iteration solutions that set these elements to zero (hence further increasing sparsity) are favored. Note however, that proceeding in this way, requires solving at each iteration a problem with n = P(Pnr + F) variables, where P and F denote the number of agents and frames, respectively, and where nr is a bound on the regressor order. On the other hand, it is easily seen that both the objective function and the constraints in (9) can be partitioned into P groups, with the pth group involving only the variables related to the pth node. It follows then that problem (9) can be solved by solving P smaller problems of the form: ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 ?P subject to: ?ηp?∞ ?N ≤ ? and ??aip(n) i?= ?1 ?n=0 leading to the algorithm given below: =1 (10) Algorithm 1: REWEIGHTEDCAUSALITYALGORITHM for each p wa = [1, 1, ..., 1] = [1, 1, ..., 1] S > 1(self loop weight) s = [1, 1, ..., S, ..., 1] (p’th element is S) while not converged do 1. solve (9) 2. wja = 1/( ?aip ?2 + δ) 3. wja = wja ◦ s (Penalization self loops) 4. = 1./(abs([Δuxp ; Δuyp]) + δ) end while 5. At this point ajp(.) , Ip and up(t) have been identified end for wu wu It is worth emphasizing that, since the computational complexity of standard interior point methods grows as n3, solving these smaller P problems leads to roughly a O(P2) 33557781 reduction in computational time over solving a single, larger optimization. Thus, this approach can handle moderately large problems using standard, interior-point based, semidefinite optimization solvers. Larger problems can be accommodated by noting that the special form of the objective and constraints allow for using iterative Augmented La- grangian Type Methods (ALM), based upon computing, at each step, the closed form solution to suitable intermediate optimization problems. While a complete derivation of such an algorithm is beyond the scope of this paper, using results from [12] it can be shown that each step requires only a combination of thresholding and least-squares approximations. Moreover, it can be shown that such an algorithm converges Q-superlinearly. 4. Handling Outliers and Missing Data The algorithm outlined above assumes an ideal situation where the data matrix B is perfectly known. However, in practice many of its elements may be outliers (due to misidentified correspondences) or missing (due to occlusion). As we briefly show next, these situations can be efficiently handled by performing a structured robust PCA step [3] to obtain a “clean” data matrix, prior to applying Algorithm 1. From equation (6) it follows that, in the absence of exogenous inputs and noise: ?xy11.. . .yxPP? = ?XY11.. . .YXPP? ?a1...aP? (11) Since xi ∈ {col(Xj)} and yi ∈ {col(Yj }), it follows that the sets {∈co {l(cXoli(X)} a)n}d a n{dco yl(Y∈i) { }c? are self-ex?pressive, or, ?equivalently?, Xthe }ma atnridce {sc oXl( =.) }? aXre1 . . . fX-eNxp? eanssdiv eY, ?Y1 ...YN? are mraantkri cdeesfic Xient. ?Consider no?w the case =.r, w?here some ?elements xi, yi of X and Y are missing. From ?the self-expressive property ooff {Xco aln(Xd Yi)} a raen dm i{scsoinlg(Y. Fi)ro} mit tfhoello swelsf tehxaptr ethsessieve missing eyle omf {encotsl are given by: xi = argmin rank(X) , yi x = argmin rank(Y) (12) y Similarly, in the presence of outliers, X, Y can be decomposed irnlyto, itnhe t sum oesfe a lcoew o fra onkut mlieartsr,ix X (,thYe ccalenan b eda dtae)c oamnda sparse one (the outliers) by solving a problem of the form minrank?YXoo?+ λ????EEYX????os. t.: ?XYoo?+?EEYX?=?YX? From the reasoning? abov?e it follows that in the presence of noise and exogenous outputs, the clean data record can be recovered from the corrupted, partial measurements by solving the following optimization problem: s+muλibn3je? ? ? cYXtM ot ? Y?X:∗◦ +Ξ λYX1? ? ? FM XY◦ E XY? ?1+λ2? ?M YX◦ Δ U YX? ?1 ?YX?=?XYoo?+?EEXY?+?UUYX?+?ΞΞYX? (13) where we have used the standard convex relaxations of rank and cardinality2. Here Ξ and U denote noise and piecewise constant exogenous matrices, ΔU denotes the matrix obtained by taking the difference between consecutive elements in U, and MX (MY) is a “mask” matrix, with mi,j = 0 if the element (i, j) in X ( Y) is missing, mi,j = 1 otherw=i0s e, i tuhseed e etom aenvtoi (di, penalizing )e lisem miesnstisn gin, mE, Ξ, U corresponding to missing data. Problem (13) is a structured robust PCA problem (due to the Hankel structure of X, Y) trhobatu can C bAe efficiently suoelv teod t using tkheel fsitrrsut oturrdeer o mf Xeth,oYd) proposed in [3], slightly modified to handle the terms containing ΔU. 5. Experimental Results In this section we illustrate the effectiveness of the proposed approach using several video clips (provided as supplemental material). The results of the experiments are displayed using graphs embedded on the video frames: An arrow indicates causal correlation between agents, with the point of the arrow indicating the agent whose actions are affected by the agent at its tail. The internal parameters of the algorithm were experimentally tuned, leading to the values ? = 0.1, = 0.05, self loop weights S = 10. The algorithm is fairly insensitive to the value of the regularization parameters and S, which could be adjusted up or down by an order of magnitude without affecting the structure of the resulting graph. Finally, we used regressor order N=2 for the first three examples and N=4 for the last one, a choice that is consistent with the frame rate and the complexity of λ λ the actions taking place in each clip. 5.1. Clips from the UT-Interaction Data Set We considered two video clips from the UT Human Interaction Data Set [15] (sequences 6 and 16). Figures 2 and 5 compare the results obtained applying the proposed algorithm versus Group Lasso (GL) [11] and Group Lasso combined with the reweighted heuristic described in (9) (GLRW). In all cases, the inputs to the algorithm were the (approximate) coordinates of the heads of each of the agents, normalized to the interval [−1, 1], artificially corrupted ,w niothrm m10al%iz eodut tloie trhs.e Notably, [t−he1 proposed algorithm 2As shown in [6, 8] under suitable conditions these relaxations the exact minimum rank solution. 33557792 recover Figure 2. Sample frames from the UT sequence 6 with the identified causal connections superimposed. Top: Proposed Method. Center: Reweighted Group Lasso. Bottom: Group Lasso. Only the proposed method identifies the correct connections. was able to correctly identify the correlations between the agents from this very limited amount of information, while the others failed to do so. Note in passing that in both cases none of the algorithms were directly applicable, due to some of the individuals leaving the field of view or being occluded. As illustrated in Fig. 3, the missing data was recovered by solving an RPCA problem prior to applying Algorithm 1. Finally, Fig. 4 sheds more insight on the key role played by the sparse signal u. As shown there, changes in u correspond exactly to time instants when the behavior of the corresponding agent deviates from the general pattern followed during most of the clip. Figure 3. Time traces of the individual heads in the UT sequence 6, artificially corrupted with 10 % outliers. The outliers were removed and the missing data due to targets leaving the field of view was estimated solving a modified RPCA problem. Frame number Figure 4. Sample (derivative sparse) exogenous signals in the UT sequence 6. The changes correspond to the instants when the second person starts moving towards the first, who remains stationary, and when the two persons merge in an embrace. Figure 5. Sample frames from the UT sequence 16. Top: Correct correlations identified by the Proposed Method. Center and Bottom: Reweighted Group Lasso and Group Lasso (circles indicate self-loops). 5.2. Doubles Tennis Experiment This experiment considers a non-staged real-life scenario. The data consists of 230 frames of a video clip from the Australian Open Tennis Doubles Final games. The goal here is to identify causal relationships between the different players using time traces of the respective centroid positions. Note that in this case the ground truth is not available. Nevertheless, since players from the same team usually look at their opponents and react to their motions, we expect a strong causality connection between members of 33557803 opposite teams. This intuition is matched by the correlations unveiled by the algorithm, shown in Fig. 6. The identified sparse input corresponding to the vertical direction is shown in Fig. 7 (similar results for the horizontal component are omitted due to space reasons.) Figure 6. Sample frames from the tennis sequence. Top: The proposed method correctly identifies interactions between opposite team members. Center: Reweighted Group Lasso misses the interaction between the two rear-most individuals of opposite teams, generating self loops instead (denoted by the disks). Bottom: Group Lasso yields an almost complete graph. Figure 7. Exogenous signal corresponding to the vertical axis for the tennis sequence. The change in one component corresponds to the instant when the leftmost player in the bottom team moves from the line towards the net, remaining closer to it from then on. 5.3. Basketball Game Experiment This experiment considers the interactions amongst players in a basketball game. As in the case ofthe tennis players, since the data comes from a real life scenario, the ground truth is not available. However, contrary to the tennis game, this scenario involves complex interactions amongst many players, and causality is hard to discern by inspection. Nevertheless, the results shown in Fig. 8, obtained using the position of the centroids as inputs to our algorithm, match our intuition. Firstly, one would expect a strong cause/effect connection between the actions of the player with the ball and the two defending opponents facing him. These connections (denoted by the yellow arrows) were indeed successfully identified by the algorithm. The next set of causal correlations is represented by the (blue, light green) and (black, white) arrow pairs showing the defending and the opponent players on the far side of the field and under the hoop. An important, counterintuitive, connection identified by the algorithm is represented by the magenta arrows be- tween the right winger of the white team with two of his teammates: the one holding the ball and the one running behind all players. While at first sight this connection is not as obvious as the others, it becomes apparent towards the end of the sequence, when the right winger player is signaling with a raised arm. Notably, our algorithm was able to unveil this signaling without the need to perform a semantic analysis (a very difficult task here, since this signaling is apparent only in the last few frames). Rather, it used the fact that the causal correlation was encapsulated in the dynamics of the relative motions of these players. 6. Conclusions In this paper we propose a new method for detecting causal interactions between agents using video data. The main idea is to recast this problem into a blind directed graph topology identification, where each node corresponds to the observed motion of a given target, each link indicates the presence of a causal correlation and the unknown inputs account for changes in the interaction patterns. In turn, this problem can be reduced to that of finding block-sparse solutions to a set of linear equations, which can be efficiently accomplished using an iterative re-weighted Group-Lasso approach. The ability of the algorithm to correctly identify causal correlations, even in cases where portions of the data record are missing or corrupted by outliers, and the key role played by the unknown exogenous input were illustrated with several examples involving non–trivial inter- actions amongst several human subjects. Remarkably, the proposed algorithm was able to identify both the correct interactions and the time instants when interactions amongst agents changed, based on minimal motion information: in all cases we used just a single time trace per person. This success indicates that in many scenarios, the dynamic information contained in the motion pattern of a single feature associated with a target is rich enough to enable identifying complex interaction patterns, without the need to track multiple features, perform a semantic analysis or use additional domain knowledge. 33557814 Figure 8. Sample frames from a Basketball game. Top: proposed method. Center: Reweighted Group the signaling player and his teammates. Bottom: Group Lasso yields an almost complete graph. Lasso misses the interaction between References [1] A. Arnold, Y. Liu, and N. Abe. Estimating brain functional connectivity with sparse multivariate autoregression. In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 66–75, 2007. 4 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps. Dynamic subspace-based coordinated multicamera tracking. In 2011 IEEE ICCV, pages 2462–2469, 2011. 3 [3] M. Ayazoglu, M. Sznaier, and O. Camps. Fast algorithms for structured robust principal component analysis. In 2012 IEEE CVPR, pages 1704–171 1, June 2012. 2, 5 [4] A. Bolstad, B. Van Veen, and R. Nowak. Causal network inference via group sparse regularization. IEEE Transactions on Signal Processing, 59(6):2628–2641, 2011. 4 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] tributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. 2 E. Candes, X. Li, Y. Ma, and J.Wright. Robust principal component analysis? J. ACM, (3), 2011. 5 E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, December 2008. 4 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. Siam J. Optim., (2):572–596, 2011. 5 C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, pages 424–438l, 1969. 1 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In 2009 IEEE CVPR, pages 2012–2019, 2009. 1 S. Haufe, G. Nolte, K. R. Muller, and N. Kramer. Sparse causal discovery in multivariate time series. In Neural Information Processing Systems, 2009. 4, 5 G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 1663–670, 2010. 5 D. Materassi, G. Innocenti, and L. Giarre. Reduced complexity models in identification of dynamical networks: Links with sparsification problems. In 48th IEEE Conference on Decision and Control, pages 4796–4801, 2009. 4 K. Prabhakar, S. Oh, P. Wang, G. Abowd, and J. Rehg. Temporal causality for the analysis ofvisual events. In IEEE Conf Comp. Vision and Pattern Recog. (CVPR)., pages 1967– 1974, 2010. 1 M. S. Ryoo and J. K. Aggarwal. UT Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html, 2010. 5 [16] J. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3): 1030–1051, 2006. 4 [17] S. Yi and V. Pavlovic. Sparse granger causality graphs for human action classification. In 2012 ICPR, pages 3374–3377. 1 [18] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1):49–67, 2006. 4 33557825
4 0.59543425 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking
Author: Aleksandr V. Segal, Ian Reid
Abstract: We propose a novel parametrization of the data association problem for multi-target tracking. In our formulation, the number of targets is implicitly inferred together with the data association, effectively solving data association and model selection as a single inference problem. The novel formulation allows us to interpret data association and tracking as a single Switching Linear Dynamical System (SLDS). We compute an approximate posterior solution to this problem using a dynamic programming/message passing technique. This inference-based approach allows us to incorporate richer probabilistic models into the tracking system. In particular, we incorporate inference over inliers/outliers and track termination times into the system. We evaluate our approach on publicly available datasets and demonstrate results competitive with, and in some cases exceeding the state of the art.
5 0.59023863 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
Author: Ernesto Brau, Jinyan Guan, Kyle Simek, Luca Del Pero, Colin Reimer Dawson, Kobus Barnard
Abstract: Jinyan Guan† j guan1 @ emai l ari z ona . edu . Kyle Simek† ks imek@ emai l ari z ona . edu . Colin Reimer Dawson‡ cdaws on@ emai l ari z ona . edu . ‡School of Information University of Arizona Kobus Barnard‡ kobus @ s i sta . ari z ona . edu ∗School of Informatics University of Edinburgh for tracking an unknown and changing number of people in a scene using video taken from a single, fixed viewpoint. We develop a Bayesian modeling approach for tracking people in 3D from monocular video with unknown cameras. Modeling in 3D provides natural explanations for occlusions and smoothness discontinuities that result from projection, and allows priors on velocity and smoothness to be grounded in physical quantities: meters and seconds vs. pixels and frames. We pose the problem in the context of data association, in which observations are assigned to tracks. A correct application of Bayesian inference to multitarget tracking must address the fact that the model’s dimension changes as tracks are added or removed, and thus, posterior densities of different hypotheses are not comparable. We address this by marginalizing out the trajectory parameters so the resulting posterior over data associations has constant dimension. This is made tractable by using (a) Gaussian process priors for smooth trajectories and (b) approximately Gaussian likelihood functions. Our approach provides a principled method for incorporating multiple sources of evidence; we present results using both optical flow and object detector outputs. Results are comparable to recent work on 3D tracking and, unlike others, our method requires no pre-calibrated cameras.
6 0.55843771 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
7 0.53699511 120 iccv-2013-Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
8 0.53694183 60 iccv-2013-Bayesian Robust Matrix Factorization for Image and Video Processing
9 0.52646393 200 iccv-2013-Higher Order Matching for Consistent Multiple Target Tracking
10 0.5119468 434 iccv-2013-Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition
11 0.49719307 310 iccv-2013-Partial Sum Minimization of Singular Values in RPCA for Low-Level Vision
12 0.48498324 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
13 0.47939408 87 iccv-2013-Conservation Tracking
14 0.47843403 357 iccv-2013-Robust Matrix Factorization with Unknown Noise
15 0.4301489 289 iccv-2013-Network Principles for SfM: Disambiguating Repeated Structures with Local Context
16 0.42571381 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation
17 0.42221168 68 iccv-2013-Camera Alignment Using Trajectory Intersections in Unsynchronized Videos
18 0.42114249 292 iccv-2013-Non-convex P-Norm Projection for Robust Sparsity
19 0.41862002 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
20 0.40977314 55 iccv-2013-Automatic Kronecker Product Model Based Detection of Repeated Patterns in 2D Urban Images
topicId topicWeight
[(2, 0.051), (7, 0.018), (19, 0.281), (26, 0.057), (27, 0.01), (31, 0.048), (34, 0.016), (42, 0.13), (48, 0.012), (64, 0.063), (73, 0.041), (89, 0.174), (95, 0.012), (98, 0.014)]
simIndex simValue paperId paperTitle
1 0.80697352 243 iccv-2013-Learning Slow Features for Behaviour Analysis
Author: Lazaros Zafeiriou, Mihalis A. Nicolaou, Stefanos Zafeiriou, Symeon Nikitidis, Maja Pantic
Abstract: A recently introduced latent feature learning technique for time varying dynamic phenomena analysis is the socalled Slow Feature Analysis (SFA). SFA is a deterministic component analysis technique for multi-dimensional sequences that by minimizing the variance of the first order time derivative approximation of the input signal finds uncorrelated projections that extract slowly-varying features ordered by their temporal consistency and constancy. In this paper, we propose a number of extensions in both the deterministic and the probabilistic SFA optimization frameworks. In particular, we derive a novel deterministic SFA algorithm that is able to identify linear projections that extract the common slowest varying features of two or more sequences. In addition, we propose an Expectation Maximization (EM) algorithm to perform inference in a probabilistic formulation of SFA and similarly extend it in order to handle two and more time varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EMSFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with dynamic time warping techniques for robust sequence timealignment. The proposed SFA algorithms were applied for facial behavior analysis demonstrating their usefulness and appropriateness for this task.
same-paper 2 0.76434577 418 iccv-2013-The Way They Move: Tracking Multiple Targets with Similar Appearance
Author: Caglayan Dicle, Octavia I. Camps, Mario Sznaier
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or occluded behind other objects, crossing trajectories, and camera motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, minimize target mis-identification and recover missing data. Computational efficiency is achieved by using a Generalized Linear Assignment (GLA) coupled with efficient procedures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall motion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
3 0.72571123 371 iccv-2013-Saliency Detection via Absorbing Markov Chain
Author: Bowen Jiang, Lihe Zhang, Huchuan Lu, Chuan Yang, Ming-Hsuan Yang
Abstract: In this paper, we formulate saliency detection via absorbing Markov chain on an image graph model. We jointly consider the appearance divergence and spatial distribution of salient objects and the background. The virtual boundary nodes are chosen as the absorbing nodes in a Markov chain and the absorbed time from each transient node to boundary absorbing nodes is computed. The absorbed time of transient node measures its global similarity with all absorbing nodes, and thus salient objects can be consistently separated from the background when the absorbed time is used as a metric. Since the time from transient node to absorbing nodes relies on the weights on the path and their spatial distance, the background region on the center of image may be salient. We further exploit the equilibrium distribution in an ergodic Markov chain to reduce the absorbed time in the long-range smooth background regions. Extensive experiments on four benchmark datasets demonstrate robustness and efficiency of the proposed method against the state-of-the-art methods.
4 0.7255488 383 iccv-2013-Semi-supervised Learning for Large Scale Image Cosegmentation
Author: Zhengxiang Wang, Rujie Liu
Abstract: This paper introduces to use semi-supervised learning for large scale image cosegmentation. Different from traditional unsupervised cosegmentation that does not use any segmentation groundtruth, semi-supervised cosegmentation exploits the similarity from both the very limited training image foregrounds, as well as the common object shared between the large number of unsegmented images. This would be a much practical way to effectively cosegment a large number of related images simultaneously, where previous unsupervised cosegmentation work poorly due to the large variances in appearance between different images and the lack ofsegmentation groundtruthfor guidance in cosegmentation. For semi-supervised cosegmentation in large scale, we propose an effective method by minimizing an energy function, which consists of the inter-image distance, the intraimage distance and the balance term. We also propose an iterative updating algorithm to efficiently solve this energy function, which decomposes the original energy minimization problem into sub-problems, and updates each image alternatively to reduce the number of variables in each subproblem for computation efficiency. Experiment results on iCoseg and Pascal VOC datasets show that the proposed cosegmentation method can effectively cosegment hundreds of images in less than one minute. And our semi-supervised cosegmentation is able to outperform both unsupervised cosegmentation as well asfully supervised single image segmentation, especially when the training data is limited.
5 0.72159809 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
Author: Zhuwen Li, Jiaming Guo, Loong-Fah Cheong, Steven Zhiying Zhou
Abstract: This paper addresses real-world challenges in the motion segmentation problem, including perspective effects, missing data, and unknown number of motions. It first formulates the 3-D motion segmentation from two perspective views as a subspace clustering problem, utilizing the epipolar constraint of an image pair. It then combines the point correspondence information across multiple image frames via a collaborative clustering step, in which tight integration is achieved via a mixed norm optimization scheme. For model selection, wepropose an over-segment and merge approach, where the merging step is based on the property of the ?1-norm ofthe mutual sparse representation oftwo oversegmented groups. The resulting algorithm can deal with incomplete trajectories and perspective effects substantially better than state-of-the-art two-frame and multi-frame methods. Experiments on a 62-clip dataset show the significant superiority of the proposed idea in both segmentation accuracy and model selection.
6 0.68430787 190 iccv-2013-Handling Occlusions with Franken-Classifiers
7 0.6522631 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
8 0.64985347 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
9 0.64948362 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification
10 0.64829654 182 iccv-2013-GOSUS: Grassmannian Online Subspace Updates with Structured-Sparsity
11 0.64810449 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
12 0.64785713 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
13 0.64715791 338 iccv-2013-Randomized Ensemble Tracking
14 0.64711165 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables
15 0.64685035 97 iccv-2013-Coupling Alignments with Recognition for Still-to-Video Face Recognition
16 0.64670432 277 iccv-2013-Multi-channel Correlation Filters
17 0.64668745 349 iccv-2013-Regionlets for Generic Object Detection
18 0.64633119 157 iccv-2013-Fast Face Detector Training Using Tailored Views
19 0.64583015 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
20 0.64582014 379 iccv-2013-Semantic Segmentation without Annotating Segments