cvpr cvpr2013 cvpr2013-121 knowledge-graph by maker-knowledge-mining

121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

Source: pdf

Author: Anton Milan, Konrad Schindler, Stefan Roth

Abstract: When tracking multiple targets in crowded scenarios, modeling mutual exclusion between distinct targets becomes important at two levels: (1) in data association, each target observation should support at most one trajectory and each trajectory should be assigned at most one observation per frame; (2) in trajectory estimation, two trajectories should remain spatially separated at all times to avoid collisions. Yet, existing trackers often sidestep these important constraints. We address this using a mixed discrete-continuous conditional randomfield (CRF) that explicitly models both types of constraints: Exclusion between conflicting observations with supermodular pairwise terms, and exclusion between trajectories by generalizing global label costs to suppress the co-occurrence of incompatible labels (trajectories). We develop an expansion move-based MAP estimation scheme that handles both non-submodular constraints and pairwise global label costs. Furthermore, we perform a statistical analysis of ground-truth trajectories to derive appropriate CRF potentials for modeling data fidelity, target dynamics, and inter-target occlusion.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We develop an expansion move-based MAP estimation scheme that handles both non-submodular constraints and pairwise global label costs. [sent-4, score-0.355]

2 Furthermore, we perform a statistical analysis of ground-truth trajectories to derive appropriate CRF potentials for modeling data fidelity, target dynamics, and inter-target occlusion. [sent-5, score-0.442]

3 Introduction The task of visual multi-target tracking is to recover the spatio-temporal trajectories of a (usually unknown) number of targets from a video sequence. [sent-7, score-0.578]

4 This is not entirely surprising, since the solution space grows rapidly as the number of visible targets and the length of their trajectories increases. [sent-12, score-0.458]

5 Moreover, physical limits mandate a growing number of constraints (such as mutual exclusion) as more targets are in close proximity to each other. [sent-13, score-0.319]

6 Typical failure cases (top) are addressed with the proposed discrete-continuous CRF (bottom): Detections are forced to take on different labels (a) and physically overlapping trajectories are suppressed even if they do not share detections (b). [sent-16, score-0.478]

7 Batch-type multi-target trackers typically formulate a joint energy function for all targets in all frames. [sent-25, score-0.3]

8 We can distinguish two categories of batch approaches: The vast majority focuses on purely discrete optimization for solving either data association [23, 26, etc. [sent-26, score-0.208]

9 The disadvantage is that the trajectories need to be discretized themselves, hence the necessarily finite spatial resolution can limit tracking performance and lead to visible 333666888200 artifacts. [sent-31, score-0.43]

10 Moreover, exclusion is only handled either at the data-association level [23, 26], or at the trajectory level [5]. [sent-32, score-0.969]

11 While these allow estimating target trajectories in continuous space, they have other drawbacks. [sent-34, score-0.435]

12 [3] needed to employ an optimization scheme with ad-hocjump moves; [4] used the label cost framework of [12], which is not designed for imposing exclusion constraints. [sent-35, score-0.793]

13 In this paper, we propose a mixed discrete-continuous conditional random field (CRF) for multi-target tracking that aims to combine the advantages of continuous-space trajectory estimation with the advantages of discrete methods for enforcing exclusion constraints. [sent-36, score-1.156]

14 We specifically address mutual exclusion both at the data-association and at the trajectory level (cf. [sent-37, score-1.024]

15 We thus go beyond previous discrete-continuous trackers [4] that do not perform explicit exclusion reasoning, and beyond previous discrete approaches that model exclusion only at the trajectory [5] or at the data-association level [26]. [sent-40, score-1.615]

16 To the best of our knowledge our approach is the first to combine both unique data association of individual observations and physical collision-avoidance at the trajectory level in a common model. [sent-43, score-0.529]

17 In the presence of a single target, tracking can be performed by estimating the target location and motion in every frame and building the trajectory through interpolation [17]. [sent-46, score-0.552]

18 In the presence of multiple targets an additional challenge arises, often referred to as data association: each detection must be assigned a target identifier or discarded as a false alarm. [sent-47, score-0.284]

19 The mixed discrete-continuous model of [4] preserves the benefits of a continuous trajectory space, while allowing for discrete data association. [sent-55, score-0.503]

20 Trajectory level constraints are formulated as global label costs, and optimized with a graph cut-based discrete-continuous scheme [12]. [sent-56, score-0.231]

21 a term that penalizes or entirely disallows solutions where two or more targets collide, is a crucial property of multi-target tracking. [sent-59, score-0.187]

22 , 21, 23, 26] usually represent the state space of target trajectories through the underlying detections. [sent-62, score-0.348]

23 While this allows a one-to-one mapping between each detection and each trajectory, situations where the data is missing are not captured properly, hence two trajectories may in fact intersect. [sent-63, score-0.271]

24 , 5] explicitly model mutual exclusion between target locations at the trajectory level by imposing linear constraints. [sent-66, score-1.101]

25 However, such a discrete grid is a somewhat crude approximation of the continuous trajectory space. [sent-67, score-0.455]

26 A continuous mutual exclusion term [3] leads to a non-convex objective that is optimized with ad-hoc jumps. [sent-68, score-0.759]

27 Aside from representing trajectories in continuous space, this allows us to impose mutual exclusion simultaneously at the data association and at the trajectory level using the discrete part of the model. [sent-70, score-1.562]

28 While these constraints go beyond the capabilities of the label cost optimization framework of [12], we show how it can be extended appropriately. [sent-71, score-0.212]

29 To make different tracking systems comparable, we use publicly available detector responses [3, 19, 24] throughout, which are generated by popular object detectors [e. [sent-74, score-0.203]

30 (ii) The resulting trajectories have to explain the observations in a physically plausible way, i. [sent-82, score-0.38]

31 all velocities must remain within physical limits and trajectories must not overlap, because two objects cannot occupy the same physical space at the same time. [sent-84, score-0.369]

32 In this work we address both challenges, at two different levels: The first one is handled by introducing pairwise terms between competing detections to avoid an unnatural interpretation of the data. [sent-86, score-0.244]

33 That constraint on its own is insufficient, however, since it could lead to phantom trajectories that mirror existing ones, but are physically not plausible. [sent-87, score-0.332]

34 The second challenge is thus approached at the trajectory level. [sent-88, score-0.316]

35 We introduce a novel pairwise co-occurrence label cost that is applied only if both labels are present in a solution. [sent-89, score-0.239]

36 Although these model components introduce pairwise terms that are supermodular, as well as global terms relating many variables, our proposed optimization scheme is able to efficiently minimize the CRF energy to a local minimum. [sent-90, score-0.248]

37 We begin by describing the proposed exclusion handling and later summarize the remaining model. [sent-92, score-0.581]

38 Detection-level exclusion We first describe how we integrate mutual exclusion at the detection level. [sent-95, score-1.253]

39 Assuming a target size s, it is impossible that two detections originating from the same frame and being at least a distance s apart are caused by the same object. [sent-100, score-0.168]

40 Factor graph of the underlying CRF with black circular nodes representing the random variables (assignment of detections) and square nodes representing the pairwise potentials. [sent-131, score-0.163]

41 In addition to simple temporal smoothing factors (red) in (a), we model pairwise exclusion between detections within the same time step (blue, subset shown) to prevent implausible data association (b). [sent-133, score-0.976]

42 Note that only considering exclusion at the detection level is not enough in order to prevent collisions between targets. [sent-134, score-0.662]

43 In fact, the optimization may otherwise be forced to pick two almost identical trajectories in order to satisfy these inter-object constraints. [sent-135, score-0.326]

44 Trajectory-level exclusion Let us now turn to the more challenging task of enforcing exclusion at the level of continuous trajectories. [sent-139, score-1.285]

45 It is obvious that multi-target tracking should take care to prevent situations where two or more targets occupy the same physical space at the same time. [sent-140, score-0.401]

46 It has been proposed to encode a collision penalty into a global label cost [4], such that the graph cut framework of [12] can be used for MAP estimation. [sent-142, score-0.298]

47 The penalty considered how much a trajectory overlapped with any other trajectory (whether active or not). [sent-143, score-0.714]

48 To ensure that only one of two overlapping trajectories is suppressed, the penalty was added only to one trajectory in each competing pair. [sent-144, score-0.698]

49 In particular, we describe the corresponding factor graph for a single α-expansion step, where 0 corresponds to no label change and 1 means a variable is switched to label α. [sent-151, score-0.346]

50 An illustration of the factor graph (without unary and pairwise terms) is depicted in Fig. [sent-152, score-0.177]

51 Similar to [12], one auxiliary node for each existing label is added and connected to each variable that carries, or may carry, the corresponding label (Aβ, Aγ, and Aα in Fig. [sent-155, score-0.301]

52 Random variables and their current labels are represented by solid circles, while auxiliary variables are outlined with dashed circles. [sent-158, score-0.16]

53 The corresponding potentials are depicted on the right with L◦ and L◦ ◦ being the respective label cost for a single label and a pair of labels. [sent-160, score-0.321]

54 Note that all factors that are unrelated to the label cost are omitted for clarity. [sent-161, score-0.198]

55 which act as indicator switches for each label: The auxiliary variable contributes the cost L◦ (black factors) of having a certain label only once if it is switched on, otherwise its associated cost is 0. [sent-162, score-0.439]

56 Having the same graph structure as before, it is possible to insert a connect- cost1 ing factor between each pair of auxiliary variables (red and cyan). [sent-167, score-0.173]

57 It is therefore reasonable to apply a suitable penalty L◦◦ = ζ, if and only if both corresponding auxiliary variables are switched on: hX(Ti,Tj,f) =? [sent-169, score-0.25]

58 = j (3) Here Ti denotes a continuous trajectory of target i. [sent-173, score-0.48]

59 Discrete-continuous multiple object tracking Given a set of detections D and an over-complete set of potential trajectories T , the goal is to find a data association fpoort eDnt,i il. [sent-191, score-0.649]

60 At the same time, the geometric shape of all active trajectories T∗ ⊆ T should be mfitetetdri cto s hthaep corresponding tdraetjeeccttioorniess, Tsuch⊆ ⊆tha Tt t sheh o ruelsdid bueals between the observation and the tracker output are minimized. [sent-194, score-0.342]

61 To do so, we extend the discrete-continuous multitarget tracking approach of [4] with our exclusion handling. [sent-195, score-0.77]

62 dA ∈ ∈C DRF c energy En,( afn,d dT l )et, dfe ∈fin Led over the dcuisrcrreentte labeling fA, as Fw eenlle as t Ehe( fco,Tnti )n,u doeufsin trajectories T , is minimized to obtain a plausible solution. [sent-203, score-0.422]

63 To facilitate tThe , optimization, we batltaeirnn aate p la beutswibeleen s updating Tthoe f adcisiclirteattee variables while keeping the continuous ones fixed, and updating the continuous variables with the discrete ones fixed. [sent-204, score-0.318]

64 =i with the following components: The unaries φ measure how well the trajectories follow the detector evidence; they are described in more detail in the following section. [sent-221, score-0.315]

65 The first pairwise term ψS encourages temporally smooth data association with a standard generalized Potts model (cf. [sent-222, score-0.214]

66 The factors ψS are defined on pairs of detections in adjacent frames that are spatially close: ES = ? [sent-225, score-0.166]

67 The second pairwise term ψX are the detection-level exclusion constraints from Eq. [sent-243, score-0.698]

68 The first higher-order term (label cost) hf(Ti, f) = hang(Ti) + hlin(Ti) + hocc(Ti) + hper(Ti) (8) 333666888533 models the plausibility of each trajectory in terms of its dynamics and persistence. [sent-245, score-0.316]

69 The second higherorder term hX is the pairwise co-occurrence label cost from Eq. [sent-247, score-0.239]

70 In many cases the potentials (or energy components) are handcrafted, guided by intuition or mathematical convenience. [sent-258, score-0.169]

71 Here, we systematically analyze the distribution of various trajectory properties based on eight video sequences (PETS [14] and TUD-Stadtmitte [2]) with ground truth annotations. [sent-260, score-0.352]

72 To construct more realistic energies, we analyze the empirical frequencies of the trajectory properties that we model in our CRF, see Fig. [sent-264, score-0.316]

73 We observe that the energy grows linearly with the distance, suggesting a linear penalty for the data term (respective exponential distribution) φ(fd, T ) = cit · ? [sent-275, score-0.185]

74 This is not a major limitation, since realistic trajectories tw. [sent-286, score-0.271]

75 We therefore model the penalty for trajectories that are not supported by detections through multiple consecutive frames as a Cauchy-Lorentz distribution: hocc(Ti) = λocc ? [sent-311, score-0.474]

76 aps(Ti) Here, γj is the number of frames in which trajectory ihas no detections close by. [sent-316, score-0.437]

77 Assuming that the scene does not contain doors or other openings where objects might disappear, a trajectory will always start and terminate close to the border of the image (or the tracking area). [sent-318, score-0.475]

78 To prevent fragmented trajectories and allow a buffer entry zone τ, we impose a soft threshold hper(Ti) = λper · min ? [sent-320, score-0.316]

79 (14) The temporal length of trajectories varies significantly across sequences and does not exhibit a consistent behavior. [sent-324, score-0.307]

80 Moreover, the label space of each random variable is reduced to only those (few) trajectory hypotheses that lie within reasonable reach of a detection. [sent-333, score-0.447]

81 Like [4, 12] we perform MAP estimation by alternatingly minimizing the energy of the discrete and the continuous variables. [sent-336, score-0.242]

82 The emphasis of this work is on designing a physically and statistically plausible model, with the consequence that the resulting optimization problem becomes harder, even with such an alternation scheme. [sent-337, score-0.171]

83 Since the energy is non-submodular, we use TRW-S [18] for each binary expansion step. [sent-339, score-0.208]

84 It may seem unnatural to use message passing within αexpansion instead of an st-cut, since message passing algorithms are generally capable of performing inference in multi-label problems. [sent-344, score-0.184]

85 The factor graph for each expansion move on the other hand is much smaller. [sent-346, score-0.194]

86 The continuous part of the proposed energy function is not convex and cannot be minimized in closed form. [sent-347, score-0.19]

87 We minimize a simplified energy including only the unary terms φ and the continuous label costs hang and hlin. [sent-349, score-0.448]

88 The solution of this simplified energy minimization step is discarded, if it does not decrease the full CRF energy from Eq. [sent-350, score-0.206]

89 The computational effort is similar to the discretecontinuous optimization without mutual exclusion [4]. [sent-353, score-0.7]

90 On one hand, tracking by detection critically depends on which object detector is used. [sent-370, score-0.203]

91 The Multiple Object Tracking Accuracy (MOTA) combines all false positives, false negatives, and identity switches into a single number, and Multiple Object Tracking Precision (MOTP) measures the average distance between the ground truth and the tracker output. [sent-389, score-0.211]

92 Due to its extraordinary speed, we use the tracklets generated by baseline 1 as proposal trajectories for our optimization. [sent-399, score-0.358]

93 Baseline 2 is a recent method based on discrete-continuous optimization [4], but unlike our approach does not properly model inter-object exclusion and uses hand-defined energies that are not derived statistically. [sent-400, score-0.671]

94 We report the results of our full method (combined), as well as three intermediate results: only using the statistically motivated energies (statistics), adding only the detection-level exclusion factors (det. [sent-404, score-0.722]

95 Due to its extraordinary speed, we use the tracklets generated by DP as proposal trajectories for our optimization. [sent-457, score-0.358]

96 Since our CRF does not model these, we postprocess our tracker output with a simple extrapolation-based track linking scheme to explore the capabilities of our method when combined with such track linking. [sent-459, score-0.238]

97 We suggested an expansion move-based optimization scheme to handle the non-submodular energy with global cooccurrence label costs. [sent-464, score-0.411]

98 Our experiments show state-of-theart results on public benchmarks, with clear improvements from the simultaneous exclusion constraints. [sent-465, score-0.581]

99 Globallyoptimal greedy algorithms for tracking a variable number of objects. [sent-605, score-0.188]

100 Global data association for multi-object tracking using network flows. [sent-636, score-0.319]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('exclusion', 0.581), ('trajectory', 0.316), ('trajectories', 0.271), ('crf', 0.19), ('tracking', 0.159), ('targets', 0.148), ('association', 0.128), ('expansion', 0.105), ('energy', 0.103), ('label', 0.102), ('mutual', 0.091), ('detections', 0.091), ('continuous', 0.087), ('pairwise', 0.086), ('switches', 0.084), ('penalty', 0.082), ('ti', 0.081), ('target', 0.077), ('tracker', 0.071), ('auxiliary', 0.068), ('potentials', 0.066), ('supermodular', 0.065), ('ethms', 0.065), ('pti', 0.065), ('dit', 0.065), ('costs', 0.064), ('energies', 0.062), ('physically', 0.061), ('opengm', 0.06), ('hang', 0.06), ('switched', 0.054), ('discrete', 0.052), ('cost', 0.051), ('physical', 0.049), ('djt', 0.049), ('extraordinary', 0.049), ('hper', 0.049), ('trackers', 0.049), ('mixed', 0.048), ('plausible', 0.048), ('linking', 0.048), ('pets', 0.046), ('variables', 0.046), ('prevent', 0.045), ('factors', 0.045), ('detector', 0.044), ('hx', 0.044), ('track', 0.044), ('pjt', 0.043), ('hlin', 0.043), ('hocc', 0.043), ('cooccurrence', 0.042), ('id', 0.042), ('message', 0.041), ('angular', 0.04), ('entirely', 0.039), ('fd', 0.038), ('andres', 0.038), ('beier', 0.038), ('unnatural', 0.038), ('filmed', 0.038), ('pit', 0.038), ('tracklets', 0.038), ('batches', 0.036), ('andriyenko', 0.036), ('busy', 0.036), ('sequences', 0.036), ('level', 0.036), ('mot', 0.035), ('mota', 0.035), ('dp', 0.034), ('statistically', 0.034), ('velocity', 0.033), ('occlusion', 0.033), ('network', 0.032), ('collision', 0.032), ('unary', 0.032), ('passing', 0.032), ('graph', 0.031), ('tthe', 0.031), ('scheme', 0.031), ('assigned', 0.031), ('constraints', 0.031), ('move', 0.03), ('frames', 0.03), ('multitarget', 0.03), ('variable', 0.029), ('competing', 0.029), ('hf', 0.029), ('aside', 0.029), ('ml', 0.029), ('increasingly', 0.029), ('false', 0.028), ('optimization', 0.028), ('moves', 0.028), ('factor', 0.028), ('suppressed', 0.028), ('derive', 0.028), ('forced', 0.027), ('fidelity', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999857 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

Author: Anton Milan, Konrad Schindler, Stefan Roth

2 0.30401477 199 cvpr-2013-Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization

Author: Shoou-I Yu, Yi Yang, Alexander Hauptmann

Abstract: A device just like Harry Potter’s Marauder’s Map, which pinpoints the location ofeachperson-of-interest at all times, provides invaluable information for analysis of surveillance videos. To make this device real, a system would be required to perform robust person localization and tracking in real world surveillance scenarios, especially for complex indoor environments with many walls causing occlusion and long corridors with sparse surveillance camera coverage. We propose a tracking-by-detection approach with nonnegative discretization to tackle this problem. Given a set of person detection outputs, our framework takes advantage of all important cues such as color, person detection, face recognition and non-background information to perform tracking. Local learning approaches are used to uncover the manifold structure in the appearance space with spatio-temporal constraints. Nonnegative discretization is used to enforce the mutual exclusion constraint, which guarantees a person detection output to only belong to exactly one individual. Experiments show that our algorithm performs robust lo- calization and tracking of persons-of-interest not only in outdoor scenes, but also in a complex indoor real-world nursing home environment.

3 0.26939565 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

Author: Guillem Palou, Philippe Salembier

Abstract: As early stage of video processing, we introduce an iterative trajectory merging algorithm that produces a regionbased and hierarchical representation of the video sequence, called the Trajectory Binary Partition Tree (BPT). From this representation, many analysis and graph cut techniques can be used to extract partitions or objects that are useful in the context of specific applications. In order to define trajectories and to create a precise merging algorithm, color and motion cues have to be used. Both types of informations are very useful to characterize objects but present strong differences of behavior in the spatial and the temporal dimensions. On the one hand, scenes and objects are rich in their spatial color distributions, but these distributions are rather stable over time. Object motion, on the other hand, presents simple structures and low spatial variability but may change from frame to frame. The proposed algorithm takes into account this key difference and relies on different models and associated metrics to deal with color and motion information. We show that the proposed algorithm outperforms existing hierarchical video segmentation algorithms and provides more stable and precise regions.

4 0.23122753 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts

Author: Varun Ramakrishna, Takeo Kanade, Yaser Sheikh

Abstract: The human body is structurally symmetric. Tracking by detection approaches for human pose suffer from double counting, where the same image evidence is used to explain two separate but symmetric parts, such as the left and right feet. Double counting, if left unaddressed can critically affect subsequent processes, such as action recognition, affordance estimation, and pose reconstruction. In this work, we present an occlusion aware algorithm for tracking human pose in an image sequence, that addresses the problem of double counting. Our key insight is that tracking human pose can be cast as a multi-target tracking problem where the ”targets ” are related by an underlying articulated structure. The human body is modeled as a combination of singleton parts (such as the head and neck) and symmetric pairs of parts (such as the shoulders, knees, and feet). Symmetric body parts are jointly tracked with mutual exclusion constraints to prevent double counting by reasoning about occlusion. We evaluate our algorithm on an outdoor dataset with natural background clutter, a standard indoor dataset (HumanEva-I), and compare against a state of the art pose estimation algorithm.

5 0.23002973 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Author: Fernando Flores-Mangas, Allan D. Jepson

Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644

6 0.21902941 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow

7 0.21424411 209 cvpr-2013-Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking

8 0.20864645 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models

9 0.19643781 301 cvpr-2013-Multi-target Tracking by Rank-1 Tensor Approximation

10 0.16682382 440 cvpr-2013-Tracking People and Their Objects

11 0.16599907 263 cvpr-2013-Learning the Change for Automatic Image Cropping

12 0.15875451 314 cvpr-2013-Online Object Tracking: A Benchmark

13 0.15595782 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking

14 0.13856009 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

15 0.13802758 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

16 0.13636073 457 cvpr-2013-Visual Tracking via Locality Sensitive Histograms

17 0.13067779 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential

18 0.12941273 414 cvpr-2013-Structure Preserving Object Tracking

19 0.12626842 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters

20 0.1233367 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.247), (1, 0.049), (2, 0.008), (3, -0.103), (4, 0.04), (5, -0.034), (6, 0.193), (7, -0.13), (8, 0.034), (9, 0.224), (10, 0.055), (11, -0.006), (12, -0.102), (13, 0.053), (14, -0.005), (15, 0.102), (16, -0.021), (17, 0.052), (18, 0.06), (19, -0.044), (20, 0.062), (21, -0.039), (22, -0.07), (23, 0.166), (24, 0.106), (25, 0.046), (26, 0.086), (27, -0.039), (28, -0.113), (29, -0.146), (30, -0.037), (31, -0.081), (32, 0.015), (33, -0.019), (34, 0.068), (35, -0.016), (36, -0.055), (37, -0.036), (38, 0.008), (39, 0.036), (40, -0.011), (41, 0.008), (42, 0.087), (43, -0.064), (44, -0.046), (45, 0.042), (46, -0.04), (47, -0.053), (48, 0.043), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95398885 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

Author: Anton Milan, Konrad Schindler, Stefan Roth

2 0.87014282 301 cvpr-2013-Multi-target Tracking by Rank-1 Tensor Approximation

Author: Xinchu Shi, Haibin Ling, Junling Xing, Weiming Hu

Abstract: In this paper we formulate multi-target tracking (MTT) as a rank-1 tensor approximation problem and propose an ?1 norm tensor power iteration solution. In particular, a high order tensor is constructed based on trajectories in the time window, with each tensor element as the affinity of the corresponding trajectory candidate. The local assignment variables are the ?1 normalized vectors, which are used to approximate the rank-1 tensor. Our approach provides a flexible and effective formulation where both pairwise and high-order association energies can be used expediently. We also show the close relation between our formulation and the multi-dimensional assignment (MDA) model. To solve the optimization in the rank-1 tensor approximation, we propose an algorithm that iteratively powers the intermediate solution followed by an ?1 normalization. Aside from effectively capturing high-order motion information, the proposed solver runs efficiently with proved convergence. The experimental validations are conducted on two challenging datasets and our method demonstrates promising performances on both.

3 0.82455093 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models

Author: Jingchen Liu, Peter Carr, Robert T. Collins, Yanxi Liu

Abstract: We employ hierarchical data association to track players in team sports. Player movements are often complex and highly correlated with both nearby and distant players. A single model would require many degrees of freedom to represent the full motion diversity and could be difficult to use in practice. Instead, we introduce a set of Game Context Features extracted from noisy detections to describe the current state of the match, such as how the players are spatially distributed. Our assumption is that players react to the current situation in only a finite number of ways. As a result, we are able to select an appropriate simplified affinity model for each player and time instant using a random decisionforest based on current track and game contextfeatures. Our context-conditioned motion models implicitly incorporate complex inter-object correlations while remaining tractable. We demonstrate significant performance improvements over existing multi-target tracking algorithms on basketball and field hockey sequences several minutes in duration and containing 10 and 20 players respectively.

4 0.78691757 209 cvpr-2013-Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking

Author: Martin Hofmann, Daniel Wolf, Gerhard Rigoll

Abstract: We generalize the network flow formulation for multiobject tracking to multi-camera setups. In the past, reconstruction of multi-camera data was done as a separate extension. In this work, we present a combined maximum a posteriori (MAP) formulation, which jointly models multicamera reconstruction as well as global temporal data association. A flow graph is constructed, which tracks objects in 3D world space. The multi-camera reconstruction can be efficiently incorporated as additional constraints on the flow graph without making the graph unnecessarily large. The final graph is efficiently solved using binary linear programming. On the PETS 2009 dataset we achieve results that significantly exceed the current state of the art.

5 0.67218846 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

Author: Horst Possegger, Sabine Sternig, Thomas Mauthner, Peter M. Roth, Horst Bischof

Abstract: Combining foreground images from multiple views by projecting them onto a common ground-plane has been recently applied within many multi-object tracking approaches. These planar projections introduce severe artifacts and constrain most approaches to objects moving on a common 2D ground-plane. To overcome these limitations, we introduce the concept of an occupancy volume exploiting the full geometry and the objects ’ center of mass and develop an efficient algorithm for 3D object tracking. Individual objects are tracked using the local mass density scores within a particle filter based approach, constrained by a Voronoi partitioning between nearby trackers. Our method benefits from the geometric knowledge given by the occupancy volume to robustly extract features and train classifiers on-demand, when volumetric information becomes unreliable. We evaluate our approach on several challenging real-world scenarios including the public APIDIS dataset. Experimental evaluations demonstrate significant improvements compared to state-of-theart methods, while achieving real-time performance. – –

6 0.67148685 199 cvpr-2013-Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization

7 0.65791142 440 cvpr-2013-Tracking People and Their Objects

8 0.64633703 300 cvpr-2013-Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow

9 0.64364111 224 cvpr-2013-Information Consensus for Distributed Multi-target Tracking

10 0.62609911 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis

11 0.59640682 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

12 0.58326477 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

13 0.54820824 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

14 0.54439455 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking

15 0.51976317 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking

16 0.5013082 314 cvpr-2013-Online Object Tracking: A Benchmark

17 0.49492911 439 cvpr-2013-Tracking Human Pose by Tracking Symmetric Parts

18 0.48889408 267 cvpr-2013-Least Soft-Threshold Squares Tracking

19 0.48063445 24 cvpr-2013-A Principled Deep Random Field Model for Image Segmentation

20 0.47450402 308 cvpr-2013-Nonlinearly Constrained MRFs: Exploring the Intrinsic Dimensions of Higher-Order Cliques

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.153), (14, 0.124), (16, 0.022), (26, 0.05), (33, 0.327), (34, 0.01), (67, 0.057), (69, 0.054), (87, 0.112)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96219599 169 cvpr-2013-Fast Patch-Based Denoising Using Approximated Patch Geodesic Paths

Author: Xiaogang Chen, Sing Bing Kang, Jie Yang, Jingyi Yu

Abstract: Patch-based methods such as Non-Local Means (NLM) and BM3D have become the de facto gold standard for image denoising. The core of these approaches is to use similar patches within the image as cues for denoising. The operation usually requires expensive pair-wise patch comparisons. In this paper, we present a novel fast patch-based denoising technique based on Patch Geodesic Paths (PatchGP). PatchGPs treat image patches as nodes and patch differences as edge weights for computing the shortest (geodesic) paths. The path lengths can then be used as weights of the smoothing/denoising kernel. We first show that, for natural images, PatchGPs can be effectively approximated by minimum hop paths (MHPs) that generally correspond to Euclidean line paths connecting two patch nodes. To construct the denoising kernel, we further discretize the MHP search directions and use only patches along the search directions. Along each MHP, we apply a weightpropagation scheme to robustly and efficiently compute the path distance. To handle noise at multiple scales, we conduct wavelet image decomposition and apply PatchGP scheme at each scale. Comprehensive experiments show that our approach achieves comparable quality as the state-of-the-art methods such as NLM and BM3D but is a few orders of magnitude faster.

2 0.95989096 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling

Author: Xiaoshuai Sun, Xin-Jing Wang, Hongxun Yao, Lei Zhang

Abstract: In this paper, we propose a computational model of visual representativeness by integrating cognitive theories of representativeness heuristics with computer vision and machine learning techniques. Unlike previous models that build their representativeness measure based on the visible data, our model takes the initial inputs as explicit positive reference and extend the measure by exploring the implicit negatives. Given a group of images that contains obvious visual concepts, we create a customized image ontology consisting of both positive and negative instances by mining the most related and confusable neighbors of the positive concept in ontological semantic knowledge bases. The representativeness of a new item is then determined by its likelihoods for both the positive and negative references. To ensure the effectiveness of probability inference as well as the cognitive plausibility, we discover the potential prototypes and treat them as an intermediate representation of semantic concepts. In the experiment, we evaluate the performance of representativeness models based on both human judgements and user-click logs of commercial image search engine. Experimental results on both ImageNet and image sets of general concepts demonstrate the superior performance of our model against the state-of-the-arts.

same-paper 3 0.9565087 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

Author: Anton Milan, Konrad Schindler, Stefan Roth

4 0.949516 299 cvpr-2013-Multi-source Multi-scale Counting in Extremely Dense Crowd Images

Author: Haroon Idrees, Imran Saleemi, Cody Seibert, Mubarak Shah

Abstract: We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd visible in a single image. Due to problems including perspective, occlusion, clutter, and few pixels per person, counting by human detection in such images is almost impossible. Instead, our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Secondly, we employ a global consistency constraint on counts using Markov Random Field. This caters for disparity in counts in local neighborhoods and across scales. We tested our approach on a new dataset of fifty crowd images containing 64K annotated humans, with the head counts ranging from 94 to 4543. This is in stark con- trast to datasets usedfor existing methods which contain not more than tens of individuals. We experimentally demonstrate the efficacy and reliability of the proposed approach by quantifying the counting performance.

5 0.94665432 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

Author: Horst Possegger, Sabine Sternig, Thomas Mauthner, Peter M. Roth, Horst Bischof

6 0.9448185 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

7 0.94289422 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

8 0.94229132 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

9 0.94055349 143 cvpr-2013-Efficient Large-Scale Structured Learning

10 0.94030881 227 cvpr-2013-Intrinsic Scene Properties from a Single RGB-D Image

11 0.93909627 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

12 0.93871236 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

13 0.93858761 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

14 0.9381609 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image

15 0.93788588 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

16 0.93763447 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning

17 0.93758267 19 cvpr-2013-A Minimum Error Vanishing Point Detection Approach for Uncalibrated Monocular Images of Man-Made Environments

18 0.93740314 414 cvpr-2013-Structure Preserving Object Tracking

19 0.9373675 71 cvpr-2013-Boundary Cues for 3D Object Shape Recovery

20 0.93729419 147 cvpr-2013-Ensemble Learning for Confidence Measures in Stereo Vision