cvpr cvpr2013 cvpr2013-170 knowledge-graph by maker-knowledge-mining

170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models


Source: pdf

Author: Fernando Flores-Mangas, Allan D. Jepson

Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. [sent-4, score-0.876]

2 But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. [sent-5, score-1.323]

3 Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. [sent-9, score-0.594]

4 Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. [sent-10, score-0.392]

5 Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1. [sent-12, score-0.716]

6 b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ? [sent-20, score-0.3]

7 c) Spatially-local support subset for a candidate motion in blue. [sent-23, score-0.299]

8 d) Candidate motion model inliers in red, control points from Eq. [sent-24, score-0.523]

9 However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. [sent-31, score-0.251]

10 Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. [sent-32, score-0.394]

11 A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. [sent-34, score-0.3]

12 The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). [sent-35, score-0.949]

13 Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. [sent-40, score-0.3]

14 In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. [sent-44, score-0.729]

15 The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. [sent-45, score-0.266]

16 Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. [sent-48, score-0.687]

17 For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. [sent-50, score-0.553]

18 The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. [sent-51, score-0.766]

19 Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. [sent-52, score-0.817]

20 The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. [sent-53, score-0.851]

21 In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. [sent-55, score-0.779]

22 Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=? [sent-58, score-0.601]

23 Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r. [sent-63, score-0.336]

24 ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. [sent-68, score-0.498]

25 Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? [sent-69, score-0.286]

26 A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. [sent-82, score-0.527]

27 Overview of the Method The proposed motion segmentation algorithm has three stages. [sent-96, score-0.312]

28 First, a pool of M motion model hypotheses M = s{tMag1e , . [sent-97, score-0.305]

29 The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. [sent-101, score-0.251]

30 A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). [sent-105, score-0.512]

31 otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. [sent-107, score-0.357]

32 The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3. [sent-117, score-0.505]

33 In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). [sent-119, score-0.537]

34 Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . [sent-120, score-0.632]

35 The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . [sent-124, score-0.625]

36 + Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. [sent-132, score-0.451]

37 , (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. [sent-148, score-0.311]

38 The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ? [sent-150, score-0.441]

39 The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). [sent-162, score-0.944]

40 The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. [sent-163, score-0.561]

41 2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. [sent-167, score-0.314]

42 The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? [sent-170, score-0.268]

43 A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. [sent-181, score-0.33]

44 Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. [sent-185, score-0.445]

45 The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. [sent-186, score-0.359]

46 (13) A common assumption is that trajectories from the same underlying motion are locally coherent. [sent-188, score-0.57]

47 Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). [sent-189, score-0.463]

48 The magnitude of the extrapolation error changes with the distance between the control points. [sent-192, score-0.374]

49 The trade-off that results from drawing model control points from a small region, however, is extrapolation error. [sent-194, score-0.283]

50 A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. [sent-195, score-0.823]

51 Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. [sent-197, score-0.283]

52 Identical noise is added to the same trajectory in all four scenarios. [sent-198, score-0.414]

53 The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. [sent-199, score-0.41]

54 To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. [sent-204, score-0.401]

55 Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. [sent-207, score-0.43]

56 It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). [sent-208, score-0.675]

57 Left: A mixed set with some trajectories from the blue and green classes. [sent-210, score-0.357]

58 Right: Another mixed set with all of the trajectories in the red class and some from the blue class. [sent-211, score-0.357]

59 Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. [sent-213, score-0.941]

60 4), and the second one is to estimate the magnitude of the noise of a 3D model (§4. [sent-224, score-0.317]

61 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ? [sent-228, score-0.948]

62 (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. [sent-231, score-0.759]

63 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ? [sent-235, score-0.636]

64 But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid? [sent-241, score-0.357]

65 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=? [sent-277, score-0.339]

66 The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. [sent-282, score-0.516]

67 Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. [sent-286, score-0.654]

68 Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? [sent-289, score-0.712]

69 oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. [sent-297, score-0.427]

70 When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. [sent-302, score-0.725]

71 ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. [sent-306, score-0.453]

72 A small spherical residual covariance strongly suggests a planar underlying geometry. [sent-307, score-0.341]

73 First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). [sent-317, score-0.311]

74 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). [sent-322, score-0.277]

75 Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. [sent-325, score-0.283]

76 An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. [sent-326, score-0.445]

77 In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. [sent-328, score-0.875]

78 However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. [sent-329, score-0.242]

79 ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. [sent-338, score-0.424]

80 Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. [sent-364, score-0.558]

81 However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. [sent-365, score-0.416]

82 Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. [sent-370, score-0.65]

83 The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. [sent-371, score-0.359]

84 Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ? [sent-372, score-0.3]

85 Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i. [sent-384, score-0.494]

86 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. [sent-388, score-0.332]

87 Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli? [sent-403, score-0.637]

88 Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. [sent-420, score-0.612]

89 The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. [sent-425, score-0.347]

90 The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. [sent-430, score-0.404]

91 To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. [sent-432, score-0.357]

92 The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. [sent-443, score-0.289]

93 Several sequences were tracked and the resulting trajectories classified by our method. [sent-445, score-0.357]

94 Figure 5 shows qualitatively good motion segmentation results for four sequences. [sent-446, score-0.312]

95 Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. [sent-449, score-0.612]

96 Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. [sent-450, score-0.594]

97 Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. [sent-451, score-0.266]

98 The method was also found to be useful for motion segmentation of real-world, raw trajectory data. [sent-458, score-0.612]

99 Shape and motion from image streams under orthography: a factorization method. [sent-514, score-0.25]

100 Adaptive motion segmentation algorithm based on the principal angles configuration. [sent-540, score-0.312]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('trajectories', 0.357), ('trajectory', 0.3), ('rif', 0.235), ('residuals', 0.223), ('motion', 0.213), ('magnitude', 0.165), ('epipolar', 0.163), ('residual', 0.151), ('affine', 0.142), ('ransac', 0.14), ('control', 0.133), ('ssc', 0.132), ('inlier', 0.132), ('instantiation', 0.117), ('noise', 0.114), ('bif', 0.111), ('ef', 0.11), ('inliers', 0.103), ('af', 0.102), ('planar', 0.102), ('segmentation', 0.099), ('cpa', 0.086), ('unaccounted', 0.086), ('trials', 0.084), ('predictions', 0.082), ('orthography', 0.081), ('rigid', 0.079), ('outliers', 0.078), ('extrapolation', 0.076), ('singular', 0.075), ('magnitudes', 0.074), ('degenerate', 0.07), ('parallax', 0.067), ('frame', 0.065), ('mj', 0.062), ('instantiated', 0.061), ('zw', 0.059), ('orders', 0.058), ('prediction', 0.057), ('prix', 0.057), ('usv', 0.057), ('oy', 0.057), ('pool', 0.054), ('motions', 0.053), ('covariance', 0.053), ('nondegenerate', 0.051), ('cif', 0.051), ('multibody', 0.051), ('italian', 0.051), ('art', 0.051), ('subset', 0.05), ('rf', 0.048), ('ownership', 0.047), ('degeneracies', 0.047), ('xbi', 0.047), ('subsets', 0.047), ('ox', 0.046), ('mechanisms', 0.045), ('incorporates', 0.045), ('reused', 0.044), ('tha', 0.044), ('determine', 0.044), ('displacements', 0.043), ('state', 0.043), ('loglikelihood', 0.042), ('relative', 0.041), ('winning', 0.04), ('grand', 0.04), ('line', 0.04), ('formula', 0.039), ('model', 0.038), ('hopkins', 0.038), ('subspace', 0.038), ('plane', 0.037), ('factorization', 0.037), ('originate', 0.037), ('eds', 0.037), ('labellings', 0.037), ('othne', 0.037), ('competitive', 0.036), ('points', 0.036), ('segment', 0.036), ('support', 0.036), ('explained', 0.036), ('disk', 0.036), ('spherical', 0.035), ('transformation', 0.035), ('regularizes', 0.035), ('smallest', 0.035), ('plus', 0.034), ('arbitrary', 0.034), ('determined', 0.033), ('df', 0.032), ('penalizes', 0.032), ('moving', 0.032), ('depth', 0.032), ('affinities', 0.031), ('mm', 0.031), ('accuracy', 0.03), ('ofa', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000014 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Author: Fernando Flores-Mangas, Allan D. Jepson

Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644

2 0.32190478 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

Author: Guillem Palou, Philippe Salembier

Abstract: As early stage of video processing, we introduce an iterative trajectory merging algorithm that produces a regionbased and hierarchical representation of the video sequence, called the Trajectory Binary Partition Tree (BPT). From this representation, many analysis and graph cut techniques can be used to extract partitions or objects that are useful in the context of specific applications. In order to define trajectories and to create a precise merging algorithm, color and motion cues have to be used. Both types of informations are very useful to characterize objects but present strong differences of behavior in the spatial and the temporal dimensions. On the one hand, scenes and objects are rich in their spatial color distributions, but these distributions are rather stable over time. Object motion, on the other hand, presents simple structures and low spatial variability but may change from frame to frame. The proposed algorithm takes into account this key difference and relies on different models and associated metrics to deal with color and motion information. We show that the proposed algorithm outperforms existing hierarchical video segmentation algorithms and provides more stable and precise regions.

3 0.23002973 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

Author: Anton Milan, Konrad Schindler, Stefan Roth

Abstract: When tracking multiple targets in crowded scenarios, modeling mutual exclusion between distinct targets becomes important at two levels: (1) in data association, each target observation should support at most one trajectory and each trajectory should be assigned at most one observation per frame; (2) in trajectory estimation, two trajectories should remain spatially separated at all times to avoid collisions. Yet, existing trackers often sidestep these important constraints. We address this using a mixed discrete-continuous conditional randomfield (CRF) that explicitly models both types of constraints: Exclusion between conflicting observations with supermodular pairwise terms, and exclusion between trajectories by generalizing global label costs to suppress the co-occurrence of incompatible labels (trajectories). We develop an expansion move-based MAP estimation scheme that handles both non-submodular constraints and pairwise global label costs. Furthermore, we perform a statistical analysis of ground-truth trajectories to derive appropriate CRF potentials for modeling data fidelity, target dynamics, and inter-target occlusion.

4 0.21845682 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy

Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200

5 0.19734208 317 cvpr-2013-Optimal Geometric Fitting under the Truncated L2-Norm

Author: Erik Ask, Olof Enqvist, Fredrik Kahl

Abstract: This paper is concerned with model fitting in the presence of noise and outliers. Previously it has been shown that the number of outliers can be minimized with polynomial complexity in the number of measurements. This paper improves on these results in two ways. First, it is shown that for a large class of problems, the statistically more desirable truncated L2-norm can be optimized with the same complexity. Then, with the same methodology, it is shown how to transform multi-model fitting into a purely combinatorial problem—with worst-case complexity that is polynomial in the number of measurements, though exponential in the number of models. We apply our framework to a series of hard registration and stitching problems demonstrating that the approach is not only of theoretical interest. It gives a practical method for simultaneously dealing with measurement noise and large amounts of outliers for fitting problems with lowdimensional models.

6 0.17352794 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

7 0.16307902 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models

8 0.16289502 306 cvpr-2013-Non-rigid Structure from Motion with Diffusion Maps Prior

9 0.16100414 113 cvpr-2013-Dense Variational Reconstruction of Non-rigid Surfaces from Monocular Video

10 0.15486822 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

11 0.15276864 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

12 0.15129767 379 cvpr-2013-Scalable Sparse Subspace Clustering

13 0.14904082 334 cvpr-2013-Pose from Flow and Flow from Pose

14 0.12470976 88 cvpr-2013-Compressible Motion Fields

15 0.12351609 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

16 0.1227531 440 cvpr-2013-Tracking People and Their Objects

17 0.12185223 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration

18 0.11817452 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation

19 0.10934523 349 cvpr-2013-Reconstructing Gas Flows Using Light-Path Approximation

20 0.10747349 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.238), (1, 0.141), (2, -0.002), (3, -0.028), (4, -0.043), (5, -0.044), (6, 0.058), (7, -0.165), (8, -0.04), (9, 0.115), (10, 0.111), (11, 0.142), (12, 0.002), (13, -0.07), (14, 0.093), (15, 0.039), (16, -0.011), (17, 0.003), (18, -0.108), (19, 0.034), (20, 0.105), (21, -0.062), (22, 0.011), (23, 0.032), (24, 0.201), (25, 0.069), (26, -0.005), (27, -0.06), (28, 0.024), (29, -0.094), (30, -0.064), (31, 0.017), (32, -0.074), (33, -0.036), (34, 0.058), (35, 0.003), (36, -0.136), (37, -0.032), (38, -0.013), (39, 0.049), (40, -0.055), (41, 0.016), (42, 0.025), (43, -0.053), (44, -0.037), (45, 0.06), (46, 0.062), (47, 0.038), (48, 0.133), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95360404 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Author: Fernando Flores-Mangas, Allan D. Jepson

Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644

2 0.77662927 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

Author: Guillem Palou, Philippe Salembier

Abstract: As early stage of video processing, we introduce an iterative trajectory merging algorithm that produces a regionbased and hierarchical representation of the video sequence, called the Trajectory Binary Partition Tree (BPT). From this representation, many analysis and graph cut techniques can be used to extract partitions or objects that are useful in the context of specific applications. In order to define trajectories and to create a precise merging algorithm, color and motion cues have to be used. Both types of informations are very useful to characterize objects but present strong differences of behavior in the spatial and the temporal dimensions. On the one hand, scenes and objects are rich in their spatial color distributions, but these distributions are rather stable over time. Object motion, on the other hand, presents simple structures and low spatial variability but may change from frame to frame. The proposed algorithm takes into account this key difference and relies on different models and associated metrics to deal with color and motion information. We show that the proposed algorithm outperforms existing hierarchical video segmentation algorithms and provides more stable and precise regions.

3 0.71825773 290 cvpr-2013-Motion Estimation for Self-Driving Cars with a Generalized Camera

Author: Gim Hee Lee, Friedrich Faundorfer, Marc Pollefeys

Abstract: In this paper, we present a visual ego-motion estimation algorithm for a self-driving car equipped with a closeto-market multi-camera system. By modeling the multicamera system as a generalized camera and applying the non-holonomic motion constraint of a car, we show that this leads to a novel 2-point minimal solution for the generalized essential matrix where the full relative motion including metric scale can be obtained. We provide the analytical solutions for the general case with at least one inter-camera correspondence and a special case with only intra-camera correspondences. We show that up to a maximum of 6 solutions exist for both cases. We identify the existence of degeneracy when the car undergoes straight motion in the special case with only intra-camera correspondences where the scale becomes unobservable and provide a practical alternative solution. Our formulation can be efficiently implemented within RANSAC for robust estimation. We verify the validity of our assumptions on the motion model by comparing our results on a large real-world dataset collected by a car equipped with 4 cameras with minimal overlapping field-of-views against the GPS/INS ground truth.

4 0.68186122 46 cvpr-2013-Articulated and Restricted Motion Subspaces and Their Signatures

Author: Bastien Jacquet, Roland Angst, Marc Pollefeys

Abstract: Articulated objects represent an important class ofobjects in our everyday environment. Automatic detection of the type of articulated or otherwise restricted motion and extraction of the corresponding motion parameters are therefore of high value, e.g. in order to augment an otherwise static 3D reconstruction with dynamic semantics, such as rotation axes and allowable translation directions for certain rigid parts or objects. Hence, in this paper, a novel theory to analyse relative transformations between two motion-restricted parts will be presented. The analysis is based on linear subspaces spanned by relative transformations. Moreover, a signature for relative transformations will be introduced which uniquely specifies the type of restricted motion encoded in these relative transformations. This theoretic framework enables the derivation of novel algebraic constraints, such as low-rank constraints for subsequent rotations around two fixed axes for example. Lastly, given the type of restricted motion as predicted by the signature, the paper shows how to extract all the motion parameters with matrix manipulations from linear algebra. Our theory is verified on several real data sets, such as a rotating blackboard or a wheel rolling on the floor amongst others.

5 0.6751008 301 cvpr-2013-Multi-target Tracking by Rank-1 Tensor Approximation

Author: Xinchu Shi, Haibin Ling, Junling Xing, Weiming Hu

Abstract: In this paper we formulate multi-target tracking (MTT) as a rank-1 tensor approximation problem and propose an ?1 norm tensor power iteration solution. In particular, a high order tensor is constructed based on trajectories in the time window, with each tensor element as the affinity of the corresponding trajectory candidate. The local assignment variables are the ?1 normalized vectors, which are used to approximate the rank-1 tensor. Our approach provides a flexible and effective formulation where both pairwise and high-order association energies can be used expediently. We also show the close relation between our formulation and the multi-dimensional assignment (MDA) model. To solve the optimization in the rank-1 tensor approximation, we propose an algorithm that iteratively powers the intermediate solution followed by an ?1 normalization. Aside from effectively capturing high-order motion information, the proposed solver runs efficiently with proved convergence. The experimental validations are conducted on two challenging datasets and our method demonstrates promising performances on both.

6 0.65640616 441 cvpr-2013-Tracking Sports Players with Context-Conditioned Motion Models

7 0.6475389 88 cvpr-2013-Compressible Motion Fields

8 0.62996286 113 cvpr-2013-Dense Variational Reconstruction of Non-rigid Surfaces from Monocular Video

9 0.60401577 244 cvpr-2013-Large Displacement Optical Flow from Nearest Neighbor Fields

10 0.59416795 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

11 0.58317065 124 cvpr-2013-Determining Motion Directly from Normal Flows Upon the Use of a Spherical Eye Platform

12 0.57212979 306 cvpr-2013-Non-rigid Structure from Motion with Diffusion Maps Prior

13 0.5708257 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

14 0.54427725 118 cvpr-2013-Detecting Pulse from Head Motions in Video

15 0.52795571 341 cvpr-2013-Procrustean Normal Distribution for Non-rigid Structure from Motion

16 0.52691329 356 cvpr-2013-Representing and Discovering Adversarial Team Behaviors Using Player Roles

17 0.52449471 283 cvpr-2013-Megastereo: Constructing High-Resolution Stereo Panoramas

18 0.5232653 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization

19 0.51400489 317 cvpr-2013-Optimal Geometric Fitting under the Truncated L2-Norm

20 0.51147234 23 cvpr-2013-A Practical Rank-Constrained Eight-Point Algorithm for Fundamental Matrix Estimation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.159), (12, 0.174), (16, 0.034), (26, 0.055), (28, 0.028), (33, 0.308), (67, 0.042), (69, 0.051), (87, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93830317 406 cvpr-2013-Spatial Inference Machines

Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli

Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.

same-paper 2 0.92090619 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Author: Fernando Flores-Mangas, Allan D. Jepson

Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644

3 0.90165454 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Author: Chao-Yeh Chen, Kristen Grauman

Abstract: We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process. Using unlabeled video containing various human activities, the system first learns how body pose tends to change locally in time. Then, given a small number of labeled static images, it uses that model to extrapolate beyond the given exemplars and generate “synthetic ” training examples—poses that could link the observed images and/or immediately precede or follow them in time. In this way, we expand the training set without requiring additional manually labeled examples. We explore both example-based and manifold-based methods to implement our idea. Applying our approach to recognize actions in both images and video, we show it enhances a state-of-the-art technique when very few labeled training examples are available.

4 0.89915937 270 cvpr-2013-Local Fisher Discriminant Analysis for Pedestrian Re-identification

Author: Sateesh Pedagadi, James Orwell, Sergio Velastin, Boghos Boghossian

Abstract: Metric learning methods, , forperson re-identification, estimate a scaling for distances in a vector space that is optimized for picking out observations of the same individual. This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. Very high dimensional features are extracted from the source color image. A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. A regularization step is introduced to avoid singular matrices during this stage. The experiments conducted on three publicly available datasets confirm that the proposed method outperforms the state-of-the-art performance, including all other known metric learning methods. Furthermore, the method is an effective way to process observations comprising multiple shots, and is non-iterative: the computation times are relatively modest. Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). This measure is invariant to test set size and provides an intuitive indication of performance.

5 0.89276886 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration

Author: Jiayi Ma, Ji Zhao, Jinwen Tian, Zhuowen Tu, Alan L. Yuille

Abstract: We present a new point matching algorithm for robust nonrigid registration. The method iteratively recovers the point correspondence and estimates the transformation between two point sets. In the first step of the iteration, feature descriptors such as shape context are used to establish rough correspondence. In the second step, we estimate the transformation using a robust estimator called L2E. This is the main novelty of our approach and it enables us to deal with the noise and outliers which arise in the correspondence step. The transformation is specified in a functional space, more specifically a reproducing kernel Hilbert space. We apply our method to nonrigid sparse image feature correspondence on 2D images and 3D surfaces. Our results quantitatively show that our approach outperforms state-ofthe-art methods, particularly when there are a large number of outliers. Moreover, our method of robustly estimating transformations from correspondences is general and has many other applications.

6 0.89104313 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking

7 0.89095515 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

8 0.89075017 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

9 0.89049965 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

10 0.89041817 414 cvpr-2013-Structure Preserving Object Tracking

11 0.88989103 267 cvpr-2013-Least Soft-Threshold Squares Tracking

12 0.8895303 143 cvpr-2013-Efficient Large-Scale Structured Learning

13 0.889498 245 cvpr-2013-Layer Depth Denoising and Completion for Structured-Light RGB-D Cameras

14 0.88908744 325 cvpr-2013-Part Discovery from Partial Correspondence

15 0.88908482 227 cvpr-2013-Intrinsic Scene Properties from a Single RGB-D Image

16 0.8888883 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

17 0.88885969 121 cvpr-2013-Detection- and Trajectory-Level Exclusion in Multiple Object Tracking

18 0.88814282 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

19 0.88803393 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

20 0.88790077 68 cvpr-2013-Blur Processing Using Double Discrete Wavelet Transform