cvpr cvpr2013 cvpr2013-406 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
Reference: text
sentIndex sentText sentNum sentScore
1 by adding spatial factors that model mid-range and long-range dependencies inherent in the data. [sent-4, score-0.462]
2 During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. [sent-6, score-0.679]
3 Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. [sent-8, score-0.285]
4 Prediction using these models typically comprises two key steps: learning, which involves estimation of the dependencies between variables from training data, and inference, which involves estimation of the most probable values of the variables of interest under the model. [sent-13, score-0.492]
5 Pairwise Markov random field (MRF) is a widely-used variant of graphical models that incorporates dependencies Pushmeet Kohli Microsoft Research Cambridge http : / / re s earch . [sent-17, score-0.293]
6 The dependencies in pairwise random fields are typically sparse (with few exceptions, e. [sent-20, score-0.287]
7 This enables efficient inference ofthe MAP solution under the model, but low expressive power of such sparse MRFs limits their prediction accuracy. [sent-24, score-0.285]
8 Sequential classification is an alternative prediction mechanism that can also handle dependencies between variables. [sent-27, score-0.321]
9 [15] inter222999888533 pret this idea as performing generalized message-passing inference in graphical models and develop the inference machines framework to explain it. [sent-46, score-0.456]
10 Under this framework, the solution to the labelling problem is found by performing inference in a graphical model. [sent-47, score-0.304]
11 However, instead of learning the parameters of potential functions and then performing inference using a message passing algorithm, they propose to learn the messages passed by the inference algorithm directly. [sent-48, score-0.792]
12 While in theory inference machines can employ factors that span different parts of the scene, it remains unclear how to define them. [sent-53, score-0.412]
13 The source (group of variables) of the factor affects the belief about the label of the destination (individual variable) by passing a message in this direction; we incorporate prior knowledge about different kinds owfe c ionnctoerxptuoaral dependencies by means of factor types. [sent-59, score-1.39]
14 A separate prediction function is learned for each factor type, which helps to keep them simple. [sent-60, score-0.413]
15 The aim of factor type design is to model mid-range (scene context) dependencies that can be highly anisotropic unlike to close-range (low-level) dependencies. [sent-61, score-0.585]
16 We show how to use these factors to account for spatial semantic context in 3D point cloud segmentation. [sent-62, score-0.419]
17 Prior to describing the spatial inference machines framework, we review the formulation of graphical models in terms of factor graphs to settle the notation. [sent-69, score-0.693]
18 In semantic segmentation, each variable state yv may denote the label assigned to the corresponding super-pixel—a group of co-located similar image pixels or 3D points. [sent-71, score-0.435]
19 We define a factor f ∈ F as a group of fvoarrimab olfes f along . [sent-73, score-0.304]
20 Pairwise random field is a special case of this formulation where each factor is a set of exactly two variables. [sent-80, score-0.304]
21 For segmentation of image pixels or 3D point clouds, those pairwise factors typically correspond to the pairs of superpixels that are spatially close to each other and/or have similar appearance. [sent-81, score-0.408]
22 The potential function of any given factor is typically modeled as a parametric function defined over the variables from the the factor scope and features of local scene appearance. [sent-82, score-0.776]
23 These methods work by iteratively passing messages between variables and factors. [sent-88, score-0.456]
24 The messages sent from factors to variables in the n-th iteration of the algorithm are computed as μfn→v(yv) = ? [sent-89, score-0.72]
25 f\{v} (2) 222999888644 where μvn→f are the messages from variables to factors: μvn→f(yv) = ? [sent-99, score-0.425]
26 To obtain the final unnormalized estimates of marginal probabilities for each variable, messages from the corresponding factors are multiplied: p(yv) ∝ ? [sent-108, score-0.523]
27 Note that the message-passing formalism allows us to directly deal with messages rather than factors forming the graphical model. [sent-117, score-0.575]
28 So instead of learning the factors, one may try to directly learn how to update messages that are circulating during loopy belief propagation. [sent-118, score-0.362]
29 In their method, messages to factors are treated as arbitrary functions of the previous iteration messages to factors: μvn→f(yv) = g ¯n? [sent-121, score-0.942]
30 =ff (4) Here, the operation ⊕ is domain-specific and can imply, for iHnsetraen, tche,e f oepateurraeti averaging or stacking (icco anndca tceanna itmiopnl)y, ,a fnodr xf is a vector of appearance features for the factor scope (e. [sent-144, score-0.377]
31 Similar to belief propagation, inference machines recompute messages iteratively, applying (4). [sent-147, score-0.597]
32 In the last iteration, the classifier applies a function with a slightly dif- ferent signature: for each variable it takes as argument all the messages from the neighbouring factors and returns the predicted label of the variable. [sent-148, score-0.615]
33 The function g¯n depends on the factor f only by means of its features xf. [sent-151, score-0.304]
34 During training, the previous iteration messages are estimated on the hold-out set. [sent-154, score-0.406]
35 The rest of the data are used for fitting the prediction functions, where ground truth values of yv serve as values of the target variables. [sent-155, score-0.403]
36 Spatial inference machines This section explains how we extend the inference machine model by employing the new notion of factor types. [sent-157, score-0.679]
37 We first define d-factor (short for directed factor) using a pair p = (df , Sf) composed of a destination variable df and a set of source variables Sf. [sent-158, score-0.538]
38 Each d-factor f belongs to one of the factor types t(f) ∈ T, which encodes how source vonaeria obfl ethse’ flaacbteolsr aynpde corresponding fceha teunrceosd impact tsoheu rlcaebel of destination. [sent-159, score-0.498]
39 We compute the messages from the source to the destination variables using learned factor type specific functions: (ydf ) = ? [sent-163, score-1.137]
40 The d-factor features are combined with destination features xdf , the previous iteration beliefs about the label pn−1 (ydf ), d-factor features xf, aggregated source features xSf , and the averaged source set beliefs from previous iteration Ev∼Sf (pn−1 (yv)). [sent-167, score-0.972]
41 In the n-th iteration of the algorithm, the belief (probability of each label yv) is defined as the weighted product of d-factor messages from Sf to df: (pn−1(yv))? [sent-168, score-0.484]
42 f =v } , (6) where αt(f) is the weight of the response of the factor type t(f) . [sent-172, score-0.373]
43 This causes the loss of information about the layout of the labels within the source region, but prevents overfitting and makes inference computationally tractable and independent of point density. [sent-174, score-0.355]
44 [15] (a) use the previous iteration messages that have been sent to all variables that share a factor with the marked variable. [sent-180, score-0.847]
45 Also, the scope of our predictors is smaller: they take the predicted labels for only the variables involved in a single factor, while the prediction functions used by Ross et al. [sent-185, score-0.406]
46 [15] take concatenated messages from all the variables that share a factor with v, excluding the target factor f. [sent-186, score-1.033]
47 We combine the results of the small-scope predictors explicitly using (6) to get the beliefs for the next iteration or final marginal probabilities. [sent-189, score-0.333]
48 Since message prediction functions depend on the beliefs from the previous iteration, usage of the same training set for estimation of previous iteration beliefs might lead to a biased model. [sent-193, score-0.68]
49 For a certain fold, the predictors for each factor type are trained on all the other folds, and then used to get the fold labels for the next iteration. [sent-196, score-0.49]
50 Spatial factors are important special types of d-factors that model dependencies between variables that represent points in some coordinate space. [sent-200, score-0.601]
51 For example, factor types could be parameterized by a pair: Algorithm 1 Inference machine training 1:Input: labeled instance (x,y), set of factors F divided on folds f, set of factor types T, number of iter’s N. [sent-201, score-1.076]
52 13: 14: 15: 16: 17: 18: end for specify factor type weights αn, e. [sent-224, score-0.459]
53 e ttt(ifng) α=n t = 1, or by maximizing (7) if n < N then for all yv ∈ y do obtain thi∈s iy te rdaotion beliefs pn (yv) estimated on the hold-out sets using (6) end for 19: end if 20: end for (δv, r). [sent-228, score-0.503]
54 Pairwise dependencies between neighboring points may form a special factor type too. [sent-231, score-0.585]
55 Using a large number of factor types may lead to overfitting and degradation of final predictive performance. [sent-235, score-0.459]
56 The rows contain names of factor types and their acronyms, followed by relative coordinates of corresponding support regions the class labels L. [sent-245, score-0.452]
57 L1 regularization of weights is used to remove weak factor types and leads to a sparse representation. [sent-246, score-0.538]
58 In particular, if the data lacks mid-range dependencies, the corresponding factor types’ weights are set to zero due to regularization (see Section 5). [sent-247, score-0.434]
59 Note that this factor type selection step is conceptually similar to the one in the things and stuff model [6], although the authors solve a different problem, i. [sent-250, score-0.402]
60 Their spatial dependencies of the form “the detection iis about 100 pixels away from the region j” are similar to our spatial factors. [sent-253, score-0.358]
61 (a) Point src region(b) Superpixel source region Figure 2: Regions used to collect source points when the destination element is represented by point (a) and superpixel (b) for Down factor type. [sent-255, score-0.859]
62 The red sphere and plain segment denote the destination point and superpixel in the point cloud, respectively. [sent-256, score-0.405]
63 See Table 1 for the list of spatial factor types we use. [sent-276, score-0.481]
64 For example, Down factor type (line 4) assumes that spatial d-factor source includes all the points lower than the destination with 10 cm × gap in the 60 cm 60 cm corridor (Fig. [sent-277, score-0.944]
65 Thus, for a tabletop segment, Down factor type collects all the points below it with 10 cm gap and including a 30 cm border (Fig. [sent-280, score-0.578]
66 The basic classifiers (5) use concatenation of local feature vectors and previous iteration labels, which may depend on the factor type. [sent-296, score-0.393]
67 For spatial dfactors we concatenate local features of destination, mean previous iteration labels of source and destination, totally 56 + 2|L| values, where |L| is the number of class labels. [sent-297, score-0.296]
68 Each of the 24 office and 28 home scenes was reconstructed from 8–9 scans. [sent-312, score-0.286]
69 We perform 4-fold cross-validation for both home and office scenes such that no scene can be shared by two or more folds. [sent-318, score-0.286]
70 The structural links used by our model correspond to the pairwise factors used in [8] which only operate on points that labelled with one of the 17 classes, i. [sent-319, score-0.358]
71 This results in 690 super-pixels for the office scenes and 800 for the home scenes. [sent-322, score-0.286]
72 The model with only structural dependencies (STR) works better than the linear CRF [8], although the same structure and features are used. [sent-334, score-0.348]
73 It turns out that adding spatial factor types without learning weights (STR+SPAT), in spite of being well-motivated from a probabilistic viewpoint, performs worse than just using structural factors. [sent-336, score-0.733]
74 STR+SPAT: structural and all spatial factor types are combined, all weights are assumed to be 1. [sent-340, score-0.703]
75 STR+SPAT C: 10 factor types, learn coefficients with regularization, C = 0. [sent-341, score-0.304]
76 o560e24c798 (a) Source colored point cloud (b) Only structural factor types (c) With spatial factor types Figure 3: Scene example where spatial factor types improve segmentation quality. [sent-348, score-1.674]
77 The model with only structural factor types (b) misclassifies the book superpixel (on the left) and the floor superpixel (on the right), while the model with structural and spatial factor types (c) classifies the whole scene correctly. [sent-349, score-1.537]
78 In practice, large regularization coefficients lead to the spa- tial factor types being discarded (they are assigned zero weights). [sent-353, score-0.452]
79 In office scenes learned spatial factor types gain 1–1. [sent-354, score-0.642]
80 While a single wall is present in office data, corners are usual in home data. [sent-357, score-0.344]
81 Near the corners, the relative angle the to wall is ambiguous, so “horizontal” spatial factor types are not reliable. [sent-358, score-0.579]
82 While “vertical” factor types are still reliable in this case, they are often overpowered by structural factors, which connect the superpixels within 0. [sent-359, score-0.628]
83 # (a) Office weights (b) Home weights Figure 4: Factor type weights averaged across iterations and folds, and rate of non-null d-factors of each type for Office and Home data [8]. [sent-386, score-0.432]
84 It can be seen that structural (S) factor type weights are never null, while left (Lr) and right (Rr) factor types are almost useless. [sent-387, score-1.003]
85 Spatial factor types (Table 1) were defined suboptimally, which also provides room for improvement. [sent-389, score-0.408]
86 The parametrization of spatial factor types with continuous variables potentially allows for efficient search of the bestperforming factor types using gradient-based or sampling techniques, which is a promising direction for future work. [sent-390, score-0.997]
87 The model with only structural factors classifies book on the left in Fig. [sent-391, score-0.429]
88 Spatial features of structural dependencies are not expressive enough to forbid cpuTop anywhere except on the top of cpuFront and cpuSide. [sent-393, score-0.384]
89 Also, structural factors are restricted to the length of 0. [sent-396, score-0.313]
90 Increasing the distance threshold for structural dependencies would lead to capturing spurious dependencies given the limited training data [8]. [sent-398, score-0.594]
91 The model with spatial factor types classifies the floor superpixel correctly. [sent-399, score-0.694]
92 For the model with the structural and 9 spatial factor types, average inference time for an office scene is 0. [sent-402, score-0.774]
93 Note that this does not include pre-processing time, where source indices and features of structural and spatial d-factors are computed. [sent-404, score-0.299]
94 Right: error after 5 iterations when depth of trees is restricted during training Relevance of spatial factor types. [sent-499, score-0.447]
95 The null weights mean irrelevant factor types, so the weights can give us a clue on which subset of factor types is sufficient for modeling the relations (see Fig. [sent-501, score-0.884]
96 In our experiments, we observed that in the first iteration only the weights corresponding to struc- tural factors were assigned non-zero weights for both the home and office datasets. [sent-504, score-0.684]
97 This hints to the fact that structural factors define strong relationships, which can be later improved by spatial d-factors. [sent-505, score-0.386]
98 Another explanation is that there are typically several structural factors for each destination (that have a single weight for all) and one d-factor of each spatial type (with separate weights), so the regularization penalizes the latter more. [sent-507, score-0.778]
99 During the entire training process, accuracy is uniformly higher in the model with spatial factor types than without them. [sent-512, score-0.515]
100 Conclusions In this paper we introduce a method for 3D point cloud segmentation that builds on the inference machines framework [15] by explicitly accounting for spatial semantic context. [sent-514, score-0.519]
wordName wordTfidf (topN-words)
[('messages', 0.317), ('factor', 0.304), ('yv', 0.294), ('destination', 0.249), ('dependencies', 0.212), ('factors', 0.177), ('koppula', 0.175), ('beliefs', 0.142), ('inference', 0.14), ('structural', 0.136), ('home', 0.125), ('message', 0.122), ('office', 0.121), ('prediction', 0.109), ('variables', 0.108), ('types', 0.104), ('tabletop', 0.099), ('wall', 0.098), ('superpixel', 0.096), ('ydf', 0.096), ('munoz', 0.096), ('machines', 0.095), ('ross', 0.091), ('source', 0.09), ('iteration', 0.089), ('weights', 0.086), ('cpufront', 0.085), ('spat', 0.085), ('str', 0.084), ('superpixels', 0.084), ('labelling', 0.083), ('graphical', 0.081), ('predictors', 0.073), ('spatial', 0.073), ('xsf', 0.072), ('semantic', 0.071), ('clouds', 0.07), ('type', 0.069), ('cloud', 0.068), ('floor', 0.068), ('book', 0.067), ('pn', 0.067), ('cpuside', 0.064), ('cputop', 0.064), ('df', 0.054), ('cm', 0.053), ('sequential', 0.052), ('overfitting', 0.051), ('sf', 0.051), ('argument', 0.051), ('classifies', 0.049), ('vn', 0.049), ('folds', 0.049), ('predic', 0.048), ('shapovalov', 0.048), ('stf', 0.048), ('tabledrawer', 0.048), ('tableleg', 0.048), ('tpit', 0.048), ('xdf', 0.048), ('bagnell', 0.046), ('belief', 0.045), ('pairwise', 0.045), ('regularization', 0.044), ('ff', 0.044), ('labels', 0.044), ('xf', 0.043), ('chairbackrest', 0.043), ('fini', 0.043), ('fpi', 0.043), ('segmentation', 0.042), ('functions', 0.042), ('scenes', 0.04), ('chairbase', 0.04), ('stitched', 0.037), ('variable', 0.037), ('iterations', 0.036), ('expressive', 0.036), ('prevent', 0.036), ('colorado', 0.036), ('springs', 0.036), ('speech', 0.035), ('keyboard', 0.034), ('training', 0.034), ('label', 0.033), ('ev', 0.032), ('passing', 0.031), ('kinect', 0.031), ('june', 0.031), ('fn', 0.03), ('scope', 0.03), ('probabilistic', 0.03), ('typically', 0.03), ('point', 0.03), ('sent', 0.029), ('stuff', 0.029), ('tagging', 0.029), ('marginal', 0.029), ('colored', 0.028), ('ft', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 406 cvpr-2013-Spatial Inference Machines
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
2 0.19270375 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
Author: Yan Wang, Rongrong Ji, Shih-Fu Chang
Abstract: Recent years have witnessed a growing interest in understanding the semantics of point clouds in a wide variety of applications. However, point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient 3D point labels towards training effective classifiers. In this paper, we overcome this challenge by utilizing the existing massive 2D semantic labeled datasets from decadelong community efforts, such as ImageNet and LabelMe, and a novel “cross-domain ” label propagation approach. Our proposed method consists of two major novel components, Exemplar SVM based label propagation, which effectively addresses the cross-domain issue, and a graphical model based contextual refinement incorporating 3D constraints. Most importantly, the entire process does not require any training data from the target scenes, also with good scalability towards large scale applications. We evaluate our approach on the well-known Cornell Point Cloud Dataset, achieving much greater efficiency and comparable accuracy even without any 3D training data. Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-ofthe-art approaches with far better efficiency.
3 0.15027755 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
Author: Gautam Singh, Jana Kosecka
Abstract: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.
4 0.12992802 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
Author: Matthieu Guillaumin, Luc Van_Gool, Vittorio Ferrari
Abstract: Pairwise discrete energies defined over graphs are ubiquitous in computer vision. Many algorithms have been proposed to minimize such energies, often concentrating on sparse graph topologies or specialized classes of pairwise potentials. However, when the graph is fully connected and the pairwise potentials are arbitrary, the complexity of even approximate minimization algorithms such as TRW-S grows quadratically both in the number of nodes and in the number of states a node can take. Moreover, recent applications are using more and more computationally expensive pairwise potentials. These factors make it very hard to employ fully connected models. In this paper we propose a novel, generic algorithm to approximately minimize any discrete pairwise energy function. Our method exploits tractable sub-energies to filter the domain of the function. The parameters of the filter are learnt from instances of the same class of energies with good candidate solutions. Compared to existing methods, it efficiently handles fully connected graphs, with many states per node, and arbitrary pairwise potentials, which might be expensive to compute. We demonstrate experimentally on two applications that our algorithm is much more efficient than other generic minimization algorithms such as TRW-S, while returning essentially identical solutions.
5 0.12367433 186 cvpr-2013-GeoF: Geodesic Forests for Learning Coupled Predictors
Author: Peter Kontschieder, Pushmeet Kohli, Jamie Shotton, Antonio Criminisi
Abstract: Conventional decision forest based methods for image labelling tasks like object segmentation make predictions for each variable (pixel) independently [3, 5, 8]. This prevents them from enforcing dependencies between variables and translates into locally inconsistent pixel labellings. Random field models, instead, encourage spatial consistency of labels at increased computational expense. This paper presents a new and efficient forest based model that achieves spatially consistent semantic image segmentation by encoding variable dependencies directly in the feature space the forests operate on. Such correlations are captured via new long-range, soft connectivity features, computed via generalized geodesic distance transforms. Our model can be thought of as a generalization of the successful Semantic Texton Forest, Auto-Context, and Entangled Forest models. A second contribution is to show the connection between the typical Conditional Random Field (CRF) energy and the forest training objective. This analysis yields a new objective for training decision forests that encourages more accurate structured prediction. Our GeoF model is validated quantitatively on the task of semantic image segmentation, on four challenging and very diverse image datasets. GeoF outperforms both stateof-the-art forest models and the conventional pairwise CRF.
6 0.12069701 6 cvpr-2013-A Comparative Study of Modern Inference Techniques for Discrete Energy Minimization Problems
7 0.12064551 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
8 0.11466022 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes
9 0.11235045 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
10 0.10526214 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
11 0.10298854 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
12 0.10278564 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection
13 0.098277621 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
14 0.097806796 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
15 0.096863583 335 cvpr-2013-Poselet Conditioned Pictorial Structures
16 0.095392354 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
17 0.093508072 262 cvpr-2013-Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
18 0.090989478 197 cvpr-2013-Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
19 0.090866506 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
20 0.089954421 29 cvpr-2013-A Video Representation Using Temporal Superpixels
topicId topicWeight
[(0, 0.215), (1, 0.009), (2, 0.031), (3, -0.003), (4, 0.113), (5, 0.021), (6, 0.042), (7, 0.104), (8, -0.077), (9, -0.0), (10, 0.108), (11, -0.023), (12, -0.04), (13, 0.08), (14, -0.071), (15, 0.05), (16, 0.026), (17, 0.016), (18, -0.004), (19, -0.033), (20, 0.013), (21, -0.038), (22, -0.051), (23, 0.029), (24, -0.025), (25, 0.021), (26, -0.063), (27, 0.012), (28, -0.005), (29, -0.01), (30, -0.067), (31, -0.014), (32, -0.014), (33, -0.036), (34, -0.083), (35, -0.043), (36, -0.103), (37, 0.042), (38, -0.032), (39, 0.03), (40, -0.06), (41, -0.01), (42, -0.041), (43, -0.049), (44, 0.113), (45, -0.017), (46, 0.083), (47, 0.017), (48, 0.009), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.95262915 406 cvpr-2013-Spatial Inference Machines
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
2 0.73905987 262 cvpr-2013-Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
Author: Aurélien Lucchi, Yunpeng Li, Pascal Fua
Abstract: We propose a working set based approximate subgradient descent algorithm to minimize the margin-sensitive hinge loss arising from the soft constraints in max-margin learning frameworks, such as the structured SVM. We focus on the setting of general graphical models, such as loopy MRFs and CRFs commonly used in image segmentation, where exact inference is intractable and the most violated constraints can only be approximated, voiding the optimality guarantees of the structured SVM’s cutting plane algorithm as well as reducing the robustness of existing subgradient based methods. We show that the proposed method obtains better approximate subgradients through the use of working sets, leading to improved convergence properties and increased reliability. Furthermore, our method allows new constraints to be randomly sampled instead of computed using the more expensive approximate inference techniques such as belief propagation and graph cuts, which can be used to reduce learning time at only a small cost of performance. We demonstrate the strength of our method empirically on the segmentation of a new publicly available electron microscopy dataset as well as the popular MSRC data set and show state-of-the-art results.
3 0.73547673 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
Author: Yan Wang, Rongrong Ji, Shih-Fu Chang
Abstract: Recent years have witnessed a growing interest in understanding the semantics of point clouds in a wide variety of applications. However, point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient 3D point labels towards training effective classifiers. In this paper, we overcome this challenge by utilizing the existing massive 2D semantic labeled datasets from decadelong community efforts, such as ImageNet and LabelMe, and a novel “cross-domain ” label propagation approach. Our proposed method consists of two major novel components, Exemplar SVM based label propagation, which effectively addresses the cross-domain issue, and a graphical model based contextual refinement incorporating 3D constraints. Most importantly, the entire process does not require any training data from the target scenes, also with good scalability towards large scale applications. We evaluate our approach on the well-known Cornell Point Cloud Dataset, achieving much greater efficiency and comparable accuracy even without any 3D training data. Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-ofthe-art approaches with far better efficiency.
4 0.71737409 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation
Author: Heesoo Myeong, Kyoung Mu Lee
Abstract: We propose a novel nonparametric approach for semantic segmentation using high-order semantic relations. Conventional context models mainly focus on learning pairwise relationships between objects. Pairwise relations, however, are not enough to represent high-level contextual knowledge within images. In this paper, we propose semantic relation transfer, a method to transfer high-order semantic relations of objects from annotated images to unlabeled images analogous to label transfer techniques where label information are transferred. Wefirst define semantic tensors representing high-order relations of objects. Semantic relation transfer problem is then formulated as semi-supervised learning using a quadratic objective function of the semantic tensors. By exploiting low-rank property of the semantic tensors and employing Kronecker sum similarity, an efficient approximation algorithm is developed. Based on the predicted high-order semantic relations, we reason semantic segmentation and evaluate the performance on several challenging datasets.
5 0.71245939 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
Author: Gautam Singh, Jana Kosecka
Abstract: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.
6 0.68348438 186 cvpr-2013-GeoF: Geodesic Forests for Learning Coupled Predictors
8 0.67521185 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation
9 0.66835886 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations
10 0.64935654 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation
11 0.63978195 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
12 0.63418281 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
13 0.6340732 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
14 0.62505651 13 cvpr-2013-A Higher-Order CRF Model for Road Network Extraction
15 0.61685109 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
16 0.60553539 284 cvpr-2013-Mesh Based Semantic Modelling for Indoor and Outdoor Scenes
17 0.60485446 26 cvpr-2013-A Statistical Model for Recreational Trails in Aerial Images
18 0.60068262 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
19 0.57744461 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
20 0.57512641 366 cvpr-2013-Robust Region Grouping via Internal Patch Statistics
topicId topicWeight
[(10, 0.179), (12, 0.206), (16, 0.016), (19, 0.015), (26, 0.049), (28, 0.019), (33, 0.251), (67, 0.066), (69, 0.043), (80, 0.011), (87, 0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.88102889 406 cvpr-2013-Spatial Inference Machines
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
2 0.84979033 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models
Author: Fernando Flores-Mangas, Allan D. Jepson
Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644
3 0.83275747 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
4 0.83222377 414 cvpr-2013-Structure Preserving Object Tracking
Author: Lu Zhang, Laurens van_der_Maaten
Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.
5 0.828996 314 cvpr-2013-Online Object Tracking: A Benchmark
Author: Yi Wu, Jongwoo Lim, Ming-Hsuan Yang
Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.
6 0.82867646 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking
7 0.82742566 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
8 0.82733583 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning
9 0.82633233 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds
10 0.82550067 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
11 0.82422489 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
12 0.82401937 186 cvpr-2013-GeoF: Geodesic Forests for Learning Coupled Predictors
13 0.8232373 325 cvpr-2013-Part Discovery from Partial Correspondence
14 0.82314432 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems
15 0.82231677 462 cvpr-2013-Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines
16 0.82172585 131 cvpr-2013-Discriminative Non-blind Deblurring
17 0.81980985 154 cvpr-2013-Explicit Occlusion Modeling for 3D Object Class Representations
18 0.81961977 198 cvpr-2013-Handling Noise in Single Image Deblurring Using Directional Filters
19 0.81921124 360 cvpr-2013-Robust Estimation of Nonrigid Transformation for Point Set Registration
20 0.8185941 386 cvpr-2013-Self-Paced Learning for Long-Term Tracking