cvpr cvpr2013 cvpr2013-270 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sateesh Pedagadi, James Orwell, Sergio Velastin, Boghos Boghossian
Abstract: Metric learning methods, , forperson re-identification, estimate a scaling for distances in a vector space that is optimized for picking out observations of the same individual. This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. Very high dimensional features are extracted from the source color image. A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. A regularization step is introduced to avoid singular matrices during this stage. The experiments conducted on three publicly available datasets confirm that the proposed method outperforms the state-of-the-art performance, including all other known metric learning methods. Furthermore, the method is an effective way to process observations comprising multiple shots, and is non-iterative: the computation times are relatively modest. Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). This measure is invariant to test set size and provides an intuitive indication of performance.
Reference: text
sentIndex sentText sentNum sentScore
1 This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. [sent-2, score-0.113]
2 Very high dimensional features are extracted from the source color image. [sent-3, score-0.129]
3 A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. [sent-4, score-0.218]
4 A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. [sent-5, score-0.106]
5 Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). [sent-9, score-0.137]
6 The main challenges in person reidentification can be attributed to the variations in pose, illumination and viewpoint. [sent-13, score-0.505]
7 Many researchers have approached the problem by proposing feature-based methods which estimate a reliable and distinctive signature per person regardless of the scene by combining one or more extracted feature types. [sent-15, score-0.204]
8 A recent study [4] evaluated the performance of various local features for person re-identification: it has been proposed [10] to augment maximally stable color regions (MSCR) with histograms and recurrent local color patches. [sent-17, score-0.436]
9 A similar proposal [5] is to use histograms along with ‘epitome’ - a collection of recurrent stable color patches extracted over a series of frames. [sent-18, score-0.155]
10 Person re-identification can be formulated as a data asso333333 111 686 ciation problem, to match observations obtained from pairs of cameras. [sent-39, score-0.117]
11 In this case, it is assumed that a mapping space common to both sets of observations can be found, in which a nearest neighbor classification solves the data association problem. [sent-40, score-0.112]
12 Dimensionality reduction and distance metric learning are known to have a significant contextual dependency. [sent-48, score-0.136]
13 In the case of unsupervised distance metric learning, a lowdimensional manifold space is estimated for which the objective is to maximize the preservation of geometric relationships (e. [sent-49, score-0.154]
14 Dm (y − x) (2) Typically, a single embedding space based on the above approximation may not be adequate for this class of nearest neighbor classification problems. [sent-56, score-0.098]
15 Hence, the idea of projecting simple features into different, local, embedding spaces is explored in this work. [sent-57, score-0.123]
16 The first and second stages of dimensionality reduction are described in sections 2. [sent-63, score-0.19]
17 Proposed Method The proposed method LF uses a feature extraction technique followed by supervised and then unsupervised dimensionality reduction stages. [sent-71, score-0.284]
18 Feature Extraction The color descriptor ui is extracted for the observation indexed with i, working in the HSV color space. [sent-75, score-0.227]
19 It is the concatenation of a parametric c and non-parametric h representations of colors in each of the m densely sampled (overlapping) tiles defined within the observation (see Fig. [sent-76, score-0.108]
20 For each tile j , cj is the first three moments calculated separately on each ofthe three components in the color space. [sent-78, score-0.18]
21 In addition, the color descriptor vi is defined identically except that it uses the YUV color space as the input. [sent-80, score-0.228]
22 Both of these vectors have 33m dimensions: in the experiments, m = 341, implying an initial dimensionality of 11,253. [sent-81, score-0.104]
23 Unsupervised Dimensionality Reduction The first stage employs an unsupervised dimensionality reduction technique to estimate a low dimensional embedding space from the high dimensional feature space. [sent-85, score-0.506]
24 Nan}d u, tthhee descriptor machatr i nxd oivfi iadlul ainld aivvaiidl-uals in HSV color space. [sent-92, score-0.119]
25 Similarly, v is the color descriptor matrix in YUV color space. [sent-93, score-0.231]
26 The use of color descriptors defined in more than one color space is useful for estimating a reliable embedding space in the sibsequent stages. [sent-94, score-0.284]
27 It is common for the descriptor matrix u to be high dimensional and also the accumulation of descriptors from a dense grid is likely to introduce noise. [sent-95, score-0.159]
28 For a given color space, the matrix u is de-noised, and its dimensions reduced, by using Principal Component Analysis (PCA) [20]. [sent-96, score-0.112]
29 Proposed feature extraction and first processing stage the data projected into the low dimensional manifold estimated by PCA is written ui? [sent-98, score-0.125]
30 = Duui, where Du is the embedding transformation matrix corresponding to the Eigenvectors derived from PCA. [sent-99, score-0.147]
31 The overall output xi from the first stage is the concatenation of the two sets of principal components: xi = }. [sent-103, score-0.249]
32 Supervised Dimensionality Reduction The second stage uses a supervised dimensionality reduction method to estimate a lower dimensional embedding space into which xi may be transformed. [sent-109, score-0.509]
33 In this work, Local Fisher Discriminant Analysis (LFDA) [29] is employed to learn a distance metric between a set of descriptor pairs corresponding to pairs of observations of a set of individuals. [sent-111, score-0.172]
34 It combines the supervised aspect of Fisher Discriminant Analysis (FDA) [11] while preserving the local neighborhood structure preserving nature (LPP) [17]. [sent-113, score-0.117]
35 xi (3) where zi ∈ Rm (1 ≤ m ≤ d) is a lower dimensional space of dimensionality m. [sent-130, score-0.269]
36 Then, an Eigenvalue problem can be formulated using S(w) and S(b) as S(b)ϕ = λS(w) ϕ, where {ϕi}id=1 are the eigenvectors associated to the eigenvalues {λϕ1 ≥} λ2 ≥ . [sent-146, score-0.105]
37 The Locality-Preserving Projection (LPP) uses an n n affinity Lmoactarilxity -AP, describing othjeec affinity Pbe)t uwseesen a nv anri ×ou ns samples within the data. [sent-155, score-0.168]
38 Typically, the affinity value between two samples separated by a small Euclidean distance will be higher than two samples separated by a larger distance. [sent-156, score-0.138]
39 |ϕd) (6) where {ϕi} are the d eigenvectors associated with the d eigenvalues { aγrie} tohfe eth de following eigenvalue problem: XLX? [sent-165, score-0.153]
40 , LFDA combines the above with a Fischer Discriminat Analysis: both the within-class scatter matrix SW and between class scatter matrix SB are weighted by the affinity matrix A of data. [sent-170, score-0.271]
41 With the use of affinity matrix, the contribution made by far apart in-class pairs of samples is almost negligible in the calculation of SW and SB: SW = 21i,? [sent-174, score-0.102]
42 Similar to FDA, the estimation of Tlfda ies Tach ∈iev Red by representing the above as a generalized Eigenvalue problem, SBϕ = λSWϕ, where {ϕi} and {λi} are the eigenvectors and eigenvalues o wf htherise system. [sent-192, score-0.105]
43 Using equation 3, the final projection into the embedding space characterized by LFDA is written as zi = Tlf? [sent-193, score-0.136]
44 Regularization in LFDA The size of the transformation matrix in LFDA is d m, wheTrhee ed s i zse et ohef dimensionality aofnte mr stage 1n aLnFdD m i ss dth×e mdi-, mensionality of embedding space. [sent-196, score-0.324]
45 The resulting eigenvectors constitute the final transformation matrix, Tlfda, for projecting X into the space in which similarities can be measured. [sent-207, score-0.14]
46 Performance Evaluation The standard methodology to evaluate person reidentification algorithms is adopted here, with one additional contribution. [sent-209, score-0.469]
47 This methodology assumes separate training and test sets, both containing exactly two observations of a number of different subjects. [sent-210, score-0.113]
48 The test procedure is to select a ‘probe’ observation and compare against a ‘gallery’ of observations including the single remaining ‘correct’ observation of this probe subject; the remainder of the gallary are distractors. [sent-212, score-0.219]
49 Each re-identification method is required to rank the gallery observations, in order of their likelihood of representing the same individual as the probe observation. [sent-213, score-0.261]
50 The rank of the correct observation is recorded and aggregated over the entire test set to generate a Match Characteristic M(r), the propbability that the rank r choice will be correct. [sent-214, score-0.271]
51 In these particular experimental conditions, in which a probe observation is compared against a gallery set of S observations, let us assume that each member of the set has an equal prior probability of being correct. [sent-220, score-0.203]
52 After a measurement (by any chosen method), the members of the gallery set are ranked by order of their similarity to the probe measurement. [sent-222, score-0.141]
53 Assuming a stationary and ergodic 333333112199 data source, the posterior probability that the member at rank r is the correct match, is equal to the match characteristic M(r). [sent-223, score-0.218]
54 Furthermore, it is useful to normlise this entropy reduction by the original entropy: this ‘Proportion of Uncertainty Removed’ (PUR) is invariant to the test data set size S, and the base of the logarithm used for the entropy: −? [sent-228, score-0.137]
55 lSro=g1(MS)()r)log(M(r) (16) Thus, the PUR is the normalised entropy reduction between a randomized rank and any given method’s output rank. [sent-230, score-0.257]
56 VIPER dataset - Pairs of consecutive images belong to a person in 2 cameras into testing and training in a random fashion over multiple runs in the experimental set up. [sent-241, score-0.278]
57 In each run, the training images are divided into 8x8 blocks with 50% overlap from which 8 bin histograms and 3 moments are estimated for each of the color channels in HSV and YUV color spaces. [sent-242, score-0.32]
58 For a given color space and a block, the resulting histograms and moments within that block are concatenated as a single vector of 33 dimensions. [sent-243, score-0.242]
59 In the same color space, for each image, the concatenation of all color descriptors will result in a 11,253-dimensional vector. [sent-244, score-0.19]
60 r5e37Rs856ent the proportion of uncertainty removed (see Sec 3. [sent-251, score-0.188]
61 1) and columns [1,10,25,50] represent the recognition percentage scored by each of the methods at given rank index. [sent-252, score-0.198]
62 data after which first 20 and 80 principal vectors are chosen for HSV and YUV color spaces respectively. [sent-253, score-0.188]
63 The training data is then projected into each of the respective spaces and the resulting transformed data is concatenated in the two color spaces to form a 100 dimensional subspace for Stage 2. [sent-254, score-0.276]
64 Color de- scriptors for test data are extracted in a similar manner in training and they are then projected into PCA and LFDA manifolds (in the same order) using the transformation matrices estimated in training phase. [sent-256, score-0.149]
65 The number of principal components in Stage 1 for each color space and LFDA sub space dimensionality in LF are fixed to the above mentioned values for all the reported experimental results. [sent-257, score-0.328]
66 The VIPER experiment shows that at rank 1, LF was able to achieve 1. [sent-262, score-0.12]
67 3DPES 3DPES dataset consists images of 191 individuals captured on multiple occasions along their trajectory through an academic campus, from 8 different surveillance cameras. [sent-271, score-0.161]
68 Significantly, of the dataset is that it contains a high viewing angle for the cameras which is typical of surveillance cameras in outdoor scenes, e. [sent-273, score-0.161]
69 3DPES dataset: Pairs of consecutive images above belong to a person in 2 cameras TabLlKReMIA2SLN. [sent-280, score-0.245]
70 r18e4Rs159ents the proportion of uncertainty removed (see Sec 3. [sent-286, score-0.188]
71 1) and columns [1,10,25,50] represent the recognition percentage scored by each of the methods at given rank index. [sent-287, score-0.198]
72 sults for this experiment are reported as CMC in 4, recognition percentage at a given rank and PUR values in table 2. [sent-288, score-0.227]
73 LF outperforms the other two methods with significant margins, especially in the lower rank indices {1, 10, 25}. [sent-289, score-0.12]
74 CAVIAR CAVIAR is another person re-identification dataset con- taining images of 72 individuals captured from 2 cameras in a shopping center scenario. [sent-292, score-0.297]
75 Moreover, images from the second camera are dominated by background lighting making the dataset very challenging when compared to the existing datasets in person re-identification literature. [sent-294, score-0.204]
76 CAVIAR e-identifcationdat set:Pairsofconsecu- tive images above belong to a person in 2 cameras. [sent-297, score-0.204]
77 It was unclear as to how the authors of (A)HPE have carried out their experiment for this dataset but given that their method does not involve training, it is assumed that all 50 persons in one camera were matched against their observations in the other camera. [sent-308, score-0.114]
78 The experiment conducted for LF makes use of the multi-shot nature of the dataset for which 5 images per person were used during training phase. [sent-309, score-0.237]
79 In the test set, the features extracted from all observations belonging to a person are averaged to give one feature vector. [sent-311, score-0.284]
80 Two, the average of all features is likely to be an estimate of the centroid for all samples and hence is a good representation for making on-on-one criterion for each person during matching stage. [sent-314, score-0.24]
81 Similar to the results reported for other datasets in this work, recognition percentage for a given rank index and PUR values are reported in 4. [sent-316, score-0.256]
82 2 by performing PCA on the combined features generated in HSV and YUV colour spaces. [sent-323, score-0.227]
83 The recognition performance was better for the case of separately encoding each of the colour spaces when compared to the joint encoding, for example, rank 1 result is 24. [sent-324, score-0.482]
84 This can be attributed to the fact the colour space conversion between HSV and YUV is non-linear in nature and separately encoding features collected in each colour space with PCA retains the most informative principal components. [sent-327, score-0.657]
85 Histogram features are robust compared to colour moments, and they achieve about 7% improvement in recognition performance. [sent-331, score-0.256]
86 This robustness is simply not enough when compared 6 to the joint feature constructed by adding the colour moments, the parametric representation of data. [sent-333, score-0.268]
87 The joint features outperform histogram features with an increased recognition performance of 6% at rank 1 and 13% for the case of colour moments alone. [sent-334, score-0.479]
88 Colour Space Analysis The features generated from YUV colour space provide better recognition performance compared to the HSV colour space. [sent-337, score-0.515]
89 This is reflected by the number of principal components chosen during PCA projection step for each colour space: 80 of the principal components contribute to the YUV colour space where only 20 components from HSV colour space are used. [sent-338, score-0.853]
90 In this case, the features extracted from both the colour spaces 6 improve the recognition performance by 2% at rank 1 for YUV and 13% for HSV colour space respectively. [sent-339, score-0.692]
91 It is also interesting to note the recognition performance without the use of the stage 1process of feature data projection into PCA subspace. [sent-340, score-0.102]
92 The application of LFDA alone on accumulated parametric and non-parametric features from both the colour spaces provides a recognition percentage of 17% at rank 1. [sent-341, score-0.523]
93 Conclusion This work presented a novel approach for person reidentification by combining parametric (color moments) and non-parametric (histograms) representation of colors as feature. [sent-346, score-0.51]
94 The proposed method is a two stage training based low manifold learning framework using unsupervised (PCA) and supervised (LFDA) dimensionality reduction methods. [sent-347, score-0.39]
95 A regularized form of the supervised dimensionality reduction was also proposed for LFDA technique used in second stage of the framework. [sent-348, score-0.316]
96 Additionally, for performance evaluation purposes; the percentage of uncertainty removed (PUR) was presented as a dataset size invariant measure for recognition problems. [sent-349, score-0.212]
97 Experiments were conducted to evaluate and compare the proposed method on three different person re-identification datasets with better performance compared to state of the art methods. [sent-350, score-0.204]
98 Evaluation of local features for person re-identification in image sequences. [sent-379, score-0.204]
99 Multiple person reidentification using part based spatio-temporal color appearance model. [sent-401, score-0.546]
100 Kernel analysis over riemannian manifolds for visual recognition of actions, pedestrians and textures. [sent-472, score-0.096]
wordName wordTfidf (topN-words)
[('lfda', 0.403), ('reidentification', 0.265), ('pur', 0.229), ('colour', 0.227), ('lf', 0.211), ('yuv', 0.208), ('person', 0.204), ('cmc', 0.159), ('viper', 0.155), ('hsv', 0.141), ('rank', 0.12), ('lpp', 0.119), ('pca', 0.115), ('discriminant', 0.11), ('tlfda', 0.108), ('dimensionality', 0.104), ('moments', 0.103), ('conf', 0.099), ('fisher', 0.099), ('eldfv', 0.095), ('fda', 0.088), ('reduction', 0.086), ('uncertainty', 0.082), ('observations', 0.08), ('surveillance', 0.079), ('color', 0.077), ('probe', 0.077), ('caviar', 0.076), ('avss', 0.076), ('stage', 0.073), ('kissme', 0.072), ('pages', 0.07), ('embedding', 0.066), ('affinity', 0.066), ('sw', 0.065), ('hpe', 0.064), ('gallery', 0.064), ('pedestrian', 0.063), ('eigenvectors', 0.062), ('int', 0.061), ('bazzani', 0.06), ('cristani', 0.06), ('perina', 0.057), ('spaces', 0.057), ('proportion', 0.054), ('principal', 0.054), ('jungling', 0.054), ('xdx', 0.054), ('supervised', 0.053), ('individuals', 0.052), ('removed', 0.052), ('dimensional', 0.052), ('entropy', 0.051), ('scatter', 0.05), ('metric', 0.05), ('xj', 0.05), ('encoding', 0.049), ('percentage', 0.049), ('eigenvalue', 0.048), ('ltd', 0.048), ('recurrent', 0.048), ('swain', 0.048), ('sb', 0.047), ('transformation', 0.046), ('adaboost', 0.044), ('workshops', 0.043), ('eigenvalues', 0.043), ('xi', 0.043), ('descriptor', 0.042), ('pcca', 0.042), ('cameras', 0.041), ('parametric', 0.041), ('unsupervised', 0.041), ('farenzena', 0.04), ('zi', 0.038), ('manifolds', 0.037), ('match', 0.037), ('locality', 0.036), ('samples', 0.036), ('attributed', 0.036), ('concatenation', 0.036), ('matrix', 0.035), ('persons', 0.034), ('training', 0.033), ('advanced', 0.032), ('space', 0.032), ('preserving', 0.032), ('member', 0.031), ('preservation', 0.031), ('log', 0.031), ('jn', 0.031), ('observation', 0.031), ('characteristic', 0.03), ('accumulation', 0.03), ('histograms', 0.03), ('equivalence', 0.03), ('academic', 0.03), ('riemannian', 0.03), ('reported', 0.029), ('recognition', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 270 cvpr-2013-Local Fisher Discriminant Analysis for Pedestrian Re-identification
Author: Sateesh Pedagadi, James Orwell, Sergio Velastin, Boghos Boghossian
Abstract: Metric learning methods, , forperson re-identification, estimate a scaling for distances in a vector space that is optimized for picking out observations of the same individual. This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. Very high dimensional features are extracted from the source color image. A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. A regularization step is introduced to avoid singular matrices during this stage. The experiments conducted on three publicly available datasets confirm that the proposed method outperforms the state-of-the-art performance, including all other known metric learning methods. Furthermore, the method is an effective way to process observations comprising multiple shots, and is non-iterative: the computation times are relatively modest. Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). This measure is invariant to test set size and provides an intuitive indication of performance.
2 0.19551922 271 cvpr-2013-Locally Aligned Feature Transforms across Views
Author: Wei Li, Xiaogang Wang
Abstract: In this paper, we propose a new approach for matching images observed in different camera views with complex cross-view transforms and apply it to person reidentification. It jointly partitions the image spaces of two camera views into different configurations according to the similarity of cross-view transforms. The visual features of an image pair from different views are first locally aligned by being projected to a common feature space and then matched with softly assigned metrics which are locally optimized. The features optimal for recognizing identities are different from those for clustering cross-view transforms. They are jointly learned by utilizing sparsityinducing norm and information theoretical regularization. . cuhk . edu .hk (a) Camera view A (b) Camera view B This approach can be generalized to the settings where test images are from new camera views, not the same as those in the training set. Extensive experiments are conducted on public datasets and our own dataset. Comparisons with the state-of-the-art metric learning and person re-identification methods show the superior performance of our approach.
3 0.1786368 451 cvpr-2013-Unsupervised Salience Learning for Person Re-identification
Author: Rui Zhao, Wanli Ouyang, Xiaogang Wang
Abstract: Human eyes can recognize person identities based on some small salient regions. However, such valuable salient information is often hidden when computing similarities of images with existing approaches. Moreover, many existing approaches learn discriminative features and handle drastic viewpoint change in a supervised way and require labeling new training data for a different pair of camera views. In this paper, we propose a novel perspective for person re-identification based on unsupervised salience learning. Distinctive features are extracted without requiring identity labels in the training procedure. First, we apply adjacency constrained patch matching to build dense correspondence between image pairs, which shows effectiveness in handling misalignment caused by large viewpoint and pose variations. Second, we learn human salience in an unsupervised manner. To improve the performance of person re-identification, human salience is incorporated in patch matching to find reliable and discriminative matched patches. The effectiveness of our approach is validated on the widely used VIPeR dataset and ETHZ dataset.
Author: Shoou-I Yu, Yi Yang, Alexander Hauptmann
Abstract: A device just like Harry Potter’s Marauder’s Map, which pinpoints the location ofeachperson-of-interest at all times, provides invaluable information for analysis of surveillance videos. To make this device real, a system would be required to perform robust person localization and tracking in real world surveillance scenarios, especially for complex indoor environments with many walls causing occlusion and long corridors with sparse surveillance camera coverage. We propose a tracking-by-detection approach with nonnegative discretization to tackle this problem. Given a set of person detection outputs, our framework takes advantage of all important cues such as color, person detection, face recognition and non-background information to perform tracking. Local learning approaches are used to uncover the manifold structure in the appearance space with spatio-temporal constraints. Nonnegative discretization is used to enforce the mutual exclusion constraint, which guarantees a person detection output to only belong to exactly one individual. Experiments show that our algorithm performs robust lo- calization and tracking of persons-of-interest not only in outdoor scenes, but also in a complex indoor real-world nursing home environment.
5 0.10552134 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification
Author: Zhen Li, Shiyu Chang, Feng Liang, Thomas S. Huang, Liangliang Cao, John R. Smith
Abstract: This paper considers the person verification problem in modern surveillance and video retrieval systems. The problem is to identify whether a pair of face or human body images is about the same person, even if the person is not seen before. Traditional methods usually look for a distance (or similarity) measure between images (e.g., by metric learning algorithms), and make decisions based on a fixed threshold. We show that this is nevertheless insufficient and sub-optimal for the verification problem. This paper proposes to learn a decision function for verification that can be viewed as a joint model of a distance metric and a locally adaptive thresholding rule. We further formulate the inference on our decision function as a second-order large-margin regularization problem, and provide an efficient algorithm in its dual from. We evaluate our algorithm on both human body verification and face verification problems. Our method outperforms not only the classical metric learning algorithm including LMNN and ITML, but also the state-of-the-art in the computer vision community.
6 0.095129646 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
7 0.093201891 237 cvpr-2013-Kernel Learning for Extrinsic Classification of Manifold Features
8 0.091466457 64 cvpr-2013-Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification
9 0.087542981 111 cvpr-2013-Dense Reconstruction Using 3D Object Shape Priors
10 0.087115347 191 cvpr-2013-Graph-Laplacian PCA: Closed-Form Solution and Robustness
11 0.082985453 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
12 0.078871615 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
13 0.077774324 138 cvpr-2013-Efficient 2D-to-3D Correspondence Filtering for Scalable 3D Object Recognition
14 0.073991776 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
15 0.073767401 383 cvpr-2013-Seeking the Strongest Rigid Detector
16 0.072674297 328 cvpr-2013-Pedestrian Detection with Unsupervised Multi-stage Feature Learning
17 0.070760809 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video
18 0.070058078 405 cvpr-2013-Sparse Subspace Denoising for Image Manifolds
19 0.069405876 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
20 0.069144398 238 cvpr-2013-Kernel Methods on the Riemannian Manifold of Symmetric Positive Definite Matrices
topicId topicWeight
[(0, 0.198), (1, -0.026), (2, -0.035), (3, 0.011), (4, 0.025), (5, -0.009), (6, -0.016), (7, -0.089), (8, -0.016), (9, -0.04), (10, -0.018), (11, -0.026), (12, -0.005), (13, -0.072), (14, 0.003), (15, -0.014), (16, -0.085), (17, -0.007), (18, -0.02), (19, -0.039), (20, 0.045), (21, 0.079), (22, -0.072), (23, 0.015), (24, -0.045), (25, -0.025), (26, -0.017), (27, 0.102), (28, -0.066), (29, -0.062), (30, -0.011), (31, 0.003), (32, -0.019), (33, -0.009), (34, 0.02), (35, 0.022), (36, -0.018), (37, 0.102), (38, -0.023), (39, -0.057), (40, 0.028), (41, 0.069), (42, 0.132), (43, 0.009), (44, 0.035), (45, -0.085), (46, 0.004), (47, -0.038), (48, -0.033), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.92304397 270 cvpr-2013-Local Fisher Discriminant Analysis for Pedestrian Re-identification
Author: Sateesh Pedagadi, James Orwell, Sergio Velastin, Boghos Boghossian
Abstract: Metric learning methods, , forperson re-identification, estimate a scaling for distances in a vector space that is optimized for picking out observations of the same individual. This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. Very high dimensional features are extracted from the source color image. A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. A regularization step is introduced to avoid singular matrices during this stage. The experiments conducted on three publicly available datasets confirm that the proposed method outperforms the state-of-the-art performance, including all other known metric learning methods. Furthermore, the method is an effective way to process observations comprising multiple shots, and is non-iterative: the computation times are relatively modest. Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). This measure is invariant to test set size and provides an intuitive indication of performance.
2 0.81129438 271 cvpr-2013-Locally Aligned Feature Transforms across Views
Author: Wei Li, Xiaogang Wang
Abstract: In this paper, we propose a new approach for matching images observed in different camera views with complex cross-view transforms and apply it to person reidentification. It jointly partitions the image spaces of two camera views into different configurations according to the similarity of cross-view transforms. The visual features of an image pair from different views are first locally aligned by being projected to a common feature space and then matched with softly assigned metrics which are locally optimized. The features optimal for recognizing identities are different from those for clustering cross-view transforms. They are jointly learned by utilizing sparsityinducing norm and information theoretical regularization. . cuhk . edu .hk (a) Camera view A (b) Camera view B This approach can be generalized to the settings where test images are from new camera views, not the same as those in the training set. Extensive experiments are conducted on public datasets and our own dataset. Comparisons with the state-of-the-art metric learning and person re-identification methods show the superior performance of our approach.
3 0.74948812 451 cvpr-2013-Unsupervised Salience Learning for Person Re-identification
Author: Rui Zhao, Wanli Ouyang, Xiaogang Wang
Abstract: Human eyes can recognize person identities based on some small salient regions. However, such valuable salient information is often hidden when computing similarities of images with existing approaches. Moreover, many existing approaches learn discriminative features and handle drastic viewpoint change in a supervised way and require labeling new training data for a different pair of camera views. In this paper, we propose a novel perspective for person re-identification based on unsupervised salience learning. Distinctive features are extracted without requiring identity labels in the training procedure. First, we apply adjacency constrained patch matching to build dense correspondence between image pairs, which shows effectiveness in handling misalignment caused by large viewpoint and pose variations. Second, we learn human salience in an unsupervised manner. To improve the performance of person re-identification, human salience is incorporated in patch matching to find reliable and discriminative matched patches. The effectiveness of our approach is validated on the widely used VIPeR dataset and ETHZ dataset.
4 0.68354994 252 cvpr-2013-Learning Locally-Adaptive Decision Functions for Person Verification
Author: Zhen Li, Shiyu Chang, Feng Liang, Thomas S. Huang, Liangliang Cao, John R. Smith
Abstract: This paper considers the person verification problem in modern surveillance and video retrieval systems. The problem is to identify whether a pair of face or human body images is about the same person, even if the person is not seen before. Traditional methods usually look for a distance (or similarity) measure between images (e.g., by metric learning algorithms), and make decisions based on a fixed threshold. We show that this is nevertheless insufficient and sub-optimal for the verification problem. This paper proposes to learn a decision function for verification that can be viewed as a joint model of a distance metric and a locally adaptive thresholding rule. We further formulate the inference on our decision function as a second-order large-margin regularization problem, and provide an efficient algorithm in its dual from. We evaluate our algorithm on both human body verification and face verification problems. Our method outperforms not only the classical metric learning algorithm including LMNN and ITML, but also the state-of-the-art in the computer vision community.
5 0.63406354 391 cvpr-2013-Sensing and Recognizing Surface Textures Using a GelSight Sensor
Author: Rui Li, Edward H. Adelson
Abstract: Sensing surface textures by touch is a valuable capability for robots. Until recently it wwas difficult to build a compliant sensor with high sennsitivity and high resolution. The GelSight sensor is coompliant and offers sensitivity and resolution exceeding that of the human fingertips. This opens the possibility of measuring and recognizing highly detailed surface texxtures. The GelSight sensor, when pressed against a surfacce, delivers a height map. This can be treated as an image, aand processed using the tools of visual texture analysis. WWe have devised a simple yet effective texture recognitioon system based on local binary patterns, and enhanced it by the use of a multi-scale pyramid and a Hellinger ddistance metric. We built a database with 40 classes of taactile textures using materials such as fabric, wood, and sanndpaper. Our system can correctly categorize materials fromm this database with high accuracy. This suggests that the GGelSight sensor can be useful for material recognition by roobots.
6 0.62980968 178 cvpr-2013-From Local Similarity to Global Coding: An Application to Image Classification
8 0.60630083 272 cvpr-2013-Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery
9 0.60371375 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
10 0.58399141 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
11 0.57984793 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos
12 0.56639171 239 cvpr-2013-Kernel Null Space Methods for Novelty Detection
13 0.55620867 276 cvpr-2013-MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification
14 0.55418539 109 cvpr-2013-Dense Non-rigid Point-Matching Using Random Projections
15 0.55331534 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
16 0.54959249 237 cvpr-2013-Kernel Learning for Extrinsic Classification of Manifold Features
17 0.54762769 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
18 0.54736626 191 cvpr-2013-Graph-Laplacian PCA: Closed-Form Solution and Robustness
19 0.54520392 259 cvpr-2013-Learning a Manifold as an Atlas
20 0.53910917 210 cvpr-2013-Illumination Estimation Based on Bilayer Sparse Coding
topicId topicWeight
[(10, 0.096), (12, 0.204), (16, 0.081), (26, 0.044), (33, 0.264), (67, 0.093), (69, 0.054), (76, 0.014), (80, 0.013), (87, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.85673034 270 cvpr-2013-Local Fisher Discriminant Analysis for Pedestrian Re-identification
Author: Sateesh Pedagadi, James Orwell, Sergio Velastin, Boghos Boghossian
Abstract: Metric learning methods, , forperson re-identification, estimate a scaling for distances in a vector space that is optimized for picking out observations of the same individual. This paper presents a novel approach to the pedestrian re-identification problem that uses metric learning to improve the state-of-the-art performance on standard public datasets. Very high dimensional features are extracted from the source color image. A first processing stage performs unsupervised PCA dimensionality reduction, constrained to maintain the redundancy in color-space representation. A second stage further reduces the dimensionality, using a Local Fisher Discriminant Analysis defined by a training set. A regularization step is introduced to avoid singular matrices during this stage. The experiments conducted on three publicly available datasets confirm that the proposed method outperforms the state-of-the-art performance, including all other known metric learning methods. Furthermore, the method is an effective way to process observations comprising multiple shots, and is non-iterative: the computation times are relatively modest. Finally, a novel statistic is derived to characterize the Match Characteris- tic: the normalized entropy reduction can be used to define the ’Proportion of Uncertainty Removed’ (PUR). This measure is invariant to test set size and provides an intuitive indication of performance.
2 0.84767544 406 cvpr-2013-Spatial Inference Machines
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
3 0.83243704 363 cvpr-2013-Robust Multi-resolution Pedestrian Detection in Traffic Scenes
Author: Junjie Yan, Xucong Zhang, Zhen Lei, Shengcai Liao, Stan Z. Li
Abstract: The serious performance decline with decreasing resolution is the major bottleneck for current pedestrian detection techniques [14, 23]. In this paper, we take pedestrian detection in different resolutions as different but related problems, and propose a Multi-Task model to jointly consider their commonness and differences. The model contains resolution aware transformations to map pedestrians in different resolutions to a common space, where a shared detector is constructed to distinguish pedestrians from background. For model learning, we present a coordinate descent procedure to learn the resolution aware transformations and deformable part model (DPM) based detector iteratively. In traffic scenes, there are many false positives located around vehicles, therefore, we further build a context model to suppress them according to the pedestrian-vehicle relationship. The context model can be learned automatically even when the vehicle annotations are not available. Our method reduces the mean miss rate to 60% for pedestrians taller than 30 pixels on the Caltech Pedestrian Benchmark, which noticeably outperforms previous state-of-the-art (71%).
4 0.82999116 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Author: Chao-Yeh Chen, Kristen Grauman
Abstract: We propose an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process. Using unlabeled video containing various human activities, the system first learns how body pose tends to change locally in time. Then, given a small number of labeled static images, it uses that model to extrapolate beyond the given exemplars and generate “synthetic ” training examples—poses that could link the observed images and/or immediately precede or follow them in time. In this way, we expand the training set without requiring additional manually labeled examples. We explore both example-based and manifold-based methods to implement our idea. Applying our approach to recognize actions in both images and video, we show it enhances a state-of-the-art technique when very few labeled training examples are available.
5 0.82918775 170 cvpr-2013-Fast Rigid Motion Segmentation via Incrementally-Complex Local Models
Author: Fernando Flores-Mangas, Allan D. Jepson
Abstract: The problem of rigid motion segmentation of trajectory data under orthography has been long solved for nondegenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, motion degeneracies and motion dependencies, recently proposed motion segmentation methods resort to non-trivial representations to achieve state of the art segmentation accuracies, at the expense of a large computational cost. This paperproposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accuracy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standardHopkins 155 dataset). Computational efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model complexity are chosen from a pool of candidates, which are then used for segmentation. 1. Rigid Motion Segmentation Rigid motion segmentation (MS) consists on separating regions, features, or trajectories from a video sequence into spatio-temporally coherent subsets that correspond to independent, rigidly-moving objects in the scene (Figure 1.b or 1.f). The problem currently receives renewed attention, partly because of the extensive amount of video sources and applications that benefit from MS to perform higher level computer vision tasks, but also because the state of the art is reaching functional maturity. Motion Segmentation methods are widely diverse, but most capture only a small subset of constraints or algebraic properties from those that govern the image formation process of moving objects and their corresponding trajectories, such as the rank limit theorem [9, 10], the linear independence constraint (between trajectories from independent motions) [2, 13], the epipolar constraint [7], and the reduced rank property [11, 15, 13]. Model-selection based (a)Orignalvideoframes(b)Clas -label dtrajectories (c)Modelsup ort egion(d)Modelinlersandcontrolpoints (e)Modelresiduals (f) Segmenta ion result Figure 1: Model instantiation and segmentation. a) fth original frame, Italian Grand Prix ?c 2012 Formula 1. b) Classilanbaell feradm, trajectory Gdaratan Wd P r(rixed ?,c green, bolrumeu alnad 1 .bbl a)c Ck correspond to chassis, helmet, background and outlier classes respectively). c) Spatially-local support subset for a candidate motion in blue. d) Candidate motion model inliers in red, control points from Eq. 3) in white. e) Residuals from Eq. 11) color-coded with label data, the radial coordinate is logarithmic. f) Segmentation result. Wˆf (rif (cif methods [11, 6, 8] balance model complexity with modeling accuracy and have been successful at incorporating more of these aspects into a single formulation. For instance, in [8] most model parameters are estimated automatically from the data, including the number of independent motions and their complexity, as well as the segmentation labels (including outliers). However, because of the large number of necessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of 222222555977 model parameters that must be estimated, the flexibility offered by this method comes at a large computational cost. Current state of the art methods follow the trend of using sparse low-dimensional subspaces to represent trajectory data. This representation is then fed into a clustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspace Clustering (SSC) [3] in which each trajectory is represented as a sparse linear combination of a few other basis trajectories. The assumption is that the basis trajectories must belong to the same rigid motion as the reconstructed trajectory (or else, the reconstruction would be impossible). When the assumption is true, the sparse mixing coefficients can be interpreted as the connectivity weights of a graph (or a similarity matrix), which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmentation results three times more accurate than the best predecessor. The practical downside, however, is the inherently large computational cost of finding the optimal sparse representation, which is at least cubic on the number of trajectories. The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on clustering the principal angles (CPA) of the local subspaces associated to each trajectory and its nearest neighbors. The clustering re-weights a traditional metric of subspace affinity between principal angles. Re-weighted affinities are then used for segmentation. The approach produces segmentation results with accuracies similar to those of SSC, but the computational cost is close to 10 times bigger than SSC’s. In this work we argue that competitive segmentation results are possible using a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [14] respectively, currently considered the state of the art. 1.1. Affine Motion Projective Geometry is often used to model the image motion of trajectories from rigid objects between pairs of frames. However, alternative geometric relationships that facilitate parameter computation have also been proven useful for this purpose. For instance, in perspective projection, general image motion from rigid objects can be modeled via the composition of two elements: a 2D homography, and parallax residual displacements [5]. The homography describes the motion of an arbitrary plane, and the parallax residuals account for relative depths, that are unaccounted for by the planar surface model. Under orthography, in contrast, image motion of rigid objects can be modeled via the composition of a 2D affine transformation plus epipolar residual displacements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals account for relative depths. Crucially, these two components can be computed separately and incrementally, which enables an explicit mechanism to deal with motion degeneracy. In the context of 3D motion, a motion is degenerate when the trajectories originate from a planar (or linear) object, or when neither the camera nor the imaged object exercise all of their degrees of freedom, such as when the object only translates, or when the camera only rotates. These are common situations in real world video sequences. The incremental nature of the decompositions described above, facilitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projection of trajectories from a planar surface can be modeled with the affine transformation: ⎣⎡xy1c ⎦⎤=?0D? 1t?⎡⎣yx1w ⎦⎤= A2wD→c⎣⎡yx1w ⎦⎤, (1) where D ∈ R2×2 is an invertible matrix, and t ∈ R2 is a threarnesl Datio ∈n v Rector. Trajectory icboloer mdiantartixe,s (axnwd , tyw ∈) are in the plane’s reference frame (modulo a 2D affine transformation) and (xc, yc) are image coordinates. Now, let W ∈ R2F×P be matrix of trajectory data that conNtaoiwns, tlehet x a n∈d y image coordinates of P feature points tracked through F frames, as in TocmputehW pa=r⎢m⎣ ⎡⎢ etyx e1F r.,s 1 ofA· . ·2.Dfyx r1Fo m., P tra⎦⎥ ⎤je.ctorydat,(l2e)t C = [c1, c2 , c3] ∈ R2f×3 be three columns (three full trajectories) from W, and let = be the x and y coordinates of the i-th control trajectory at frame f. Then the transformation between points from an arbitrary source frame s to a target frame f can be written as: cif [ci2f−1, ci2f]? ?c1f1 c12f c1f3?= A2sD→f?c11s and A2s→Df c12s c13s?, (3) ?−1. (4) can be simply computed as: A2sD→f= ? c11f c12f c13f ? ? c11s c12s c13s The inverse in the right-hand side matrix of Eq. 4 exists so long as the points cis are not collinear. For simplicity we refer to as and consequently A2sD is the identity matrix. A2s→Df A2fD 222222556088 3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be accounted using the epipolar residual displacements. This means extending Eq. 1 with a direction vector, scaled by the corresponding relative depth of each point, as in: ⎣⎡xy1c ⎦⎤=?0?D? t1? ⎡⎣xy1w ⎦⎤+ δzw⎣⎡a 0213 ⎦⎤. The depth δzw is relative to the arbitrary plane tion is modeled by A2D; a point that lies on would have δzw = 0. We call the orthographic the plane plus parallax decomposition, the 2D Epipolar (2DAPE) decomposition. Eq. 5 is equivalent to (5) whose mosuch plane version of Affine Plus wher⎣⎡ ixyt1c is⎤⎦cl=ear⎡⎣tha 120t hea p201a2rma2e10t3erst1o2f⎦⎤A⎣⎢⎡3Dδyxd1zwefin⎦⎥⎤ean(o6r)thographically projected 3D affine transformation. Deter- mining the motion and structure parameters of a 3D model from point correspondences can be done using the classical matrix factorization approach [10], but besides being sensitive to noise and outliers, the common scenario where the solution becomes degenerate makes the approach difficult to use in real-world applications. Appropriately accommodating and dealing with the degenerate cases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = s{tMag1e , . . . , rMst,M a} p oiso generated using a omdeetlh hoydp tohthate csoesm Mbine =s a MRandom Sampling naenrda eCdon usseinngsu as m(ReAthNodS tAhaCt) [o4m] bteinche-s nique with the 2DAPE decomposition. The goal is to generate one motion model for each of the N independent, rigidly-moving objects in the scene; N is assumed to be known a priori. The method instantiates many more models than those expected necessary (M ? N) in an attempt ienlscr tehaasne tthhoes elik eexlpiheocotedd o nfe generating Mco ?rrec Nt m) iond aenl proposals for all N motions. A good model accurately describes a large subset of coherently moving trajectories with the smallest number of parameters (§3). Ialnl ethste n suemcobnedr stage, msubetseertss o§f3 )m.otion models from M are ncom thebi sneecdo ntod explain ualbl sthetes trajectories mino tdheel sequence. The problem is framed as an objective function that must be minimized. The objective function is the negative loglikelihood over prediction accuracy, regularized by model complexity (number of model parameters) and modeling overlap (trajectories explained by multiple models). Notice that after this stage, the segmentation that results from the optimal model combination could be reported as a segmentation result (§5). ioTnhe r tshuilrtd ( stage incorporates the results from a set of model combinations that are closest to the optimal. Segmentation results are aggregated into an affinity matrix, which is then passed to a spectral clustering algorithm to produce the final segmentation result. This refinement stage generally results in improved accuracy and reduced segmentation variability (§6). 3. Motion Model Instantiation Each model M ∈ M is instantiated independently using RacAhN mSAodCel. MThis ∈ c Mhoi cies niss manotitiavteatded in bdeecpaeunsdee otlfy th usismethod’s well-known computational efficiency and robustness to outliers, but also because of its ability to incorporate spatially local constraints and (as explained below) because most of the computations necessary to evaluate a planar model can be reused to estimate the likelihoods of a potentially necessary 3D model, yielding significant computational savings. The input to our model instantiation algorithm is a spatially-local, randomly drawn subset of trajectory data Wˆ[2F×I] ⊆ W[2F×P] (§3.1). In turn, at each RANSAC trial, the algorithm draw(§s3 uniformly d,i asttr eibaucthed R, A rNanSdoAmC subsets of three control trajectories (C[2F×3] ⊂ Wˆ[2F×I]). Each set of control trajectories is used to estim⊂ate the family of 2D affine transformations {A1, . . . , AF} between the iblyase o ffr 2aDm aef ainnde aralln sotfoherrm fartaimoness { iAn the sequence, wtwheicehn are then used to determine a complete set of model parameters M = {B, σ, C, ω}. The matrix B ∈ {0, 1}[F×I] indicates Mwhe =the {rB t,hσe ,iC-th, trajectory asthroixu Bld ∈b e predicted by model M at frame f (inlier, bif = 1) or not (outlier, bif = 0), σ = {σ1 , . . . , σF} are estimates of the magnitude of the σnois =e {foσr each fram}e a, aen eds ω ∈at {s2 oDf, t3hDe} m isa tnhietu edsetim ofa ttehde nmooidseel f type. hTh fera goal aisn dto ω ωfin ∈d {t2heD c,3oDntr}ol is points tainmda ttehed associated parameters that minimize the objective function O(Wˆ,M) =f?∈Fi?∈IbifLω? wˆif| Af,σf?+ Ψ(ω) + Γ(B) across (7) wˆfi a number of RANSAC trials, where = = are the coordinates of the i-th trajectory from the support subset at frame f. The negative log-likelihood term Lω (·) penalizes reconstruction error, while Ψ(·) and Γ(·) are regularizers. Tcohen tthrureceti otenr mer-s are ,d wefhinileed Ψ Ψ b(e·l)ow an. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respectively, Ψ(ω) regularizes over model complexity using: (xif, yif) ( wˆ 2if−1, wˆ i2f) Wˆ Ψ(ω) =?86((FF − − 1 1)), i f ωω== 32DD. 222222556199 (8) Γ(B) strongly penalizes models that describe too few trajectories: Γ(B) =?0∞,, oifth?erwI?iseFbif< Fλi (9) The control set C whose M minimizes Eq. 7 across a number of RANSAC trials becomes part of the pool of candidates M. 2D likelihoods. For the planar case (ω = 2D) the negative log-likelihood term is evaluated with: L2D( wˆif| Af,σf) = −log?2π|Σ1|21exp?−21rif?Σ−1rif??, (10) which is a zero-mean 2D Normal distribution evaluated at the residuals The spherical covariance matrix is Σ = rif. rif (σf)2I. The residuals are determined by the differences between the predictions made by a hypothesized model Af, and the observations at each frame ?r?1f?=? w˜1?f?− Af? w˜1?s?. (11) 3D likelihoods. The negative log-likelihood term for the 3D case is based on the the 2DAPE decomposition. The 2D affinities Af and residuals rf are reused, but to account for the effect of relative depth, an epipolar line segment ef is robustly fit to the residual data at each frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not constrain relative depths to remain constant across frames, but only requires trajectories to be close to the epipolar line. So, if the unitary vector ef indicates the orthogonal direction to ef, then the negativ⊥e log-likelihood term for the 3D case is estimated with: L3D( wˆfi| Af,σf) = −2log⎜⎝⎛√21πσfexp⎪⎨⎪⎧−?r2if(?σfe)f⊥2?2⎪⎬⎪⎫⎞⎟⎠, ⎠(12,) which is also a zero-mean 2D Norma⎩l distribution ⎭computed as the product of two identical, separable, singlevariate, normal distributions, evaluated at the distance from the residual to the epipolar line. The first one corresponds to the actual deviation in the direction of ef , which is analyti- cally computed using rif?ef. The seco⊥nd one corresponds to an estimate of the deviat⊥ion in the perpendicular direction (ef), which cannot be determined using the 2DAPE decomposition model, but can be approximated to be equal to rif ? ef, which is a plausible estimate under the isotropic noise as⊥sumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typical for RANSAC. Instead, we found that better motion models resulted from Algorithm 1: Motion model instantiation × Algorithm 1: Motion model instantiation Input:b Traasejec frtoamrye d bata W[2F×P], number of RANSAC trials K, arbitrary Output: Parameters of the motion model M = {B , σn , ω} // determine the training set c ← rand( 1, P) ; r ← rand(rmin , rmax ) // random center and radius I] ← t ra j e ct oriesWithinDis k (W, r,c) // support subset X ← homoCoords(Wˆb) // points at base frame for K RANSAC trials do Wˆ[2F Wˆ return M = {B} optimizing over the accuracy ofthe model predictions for an (estimated) inlier subset, which also means that the effect of outliers is explicitly uncounted. Figure 1.b shows an example of class-labeled trajectory data, 1.c shows a typical spatially-local support subset. Figures 1.d and 1.e show a model’s control points and its corresponding (class-labeled) residuals, respectively. A pseudocode description of the motion instantiation algorithm is provided in Algorithm 1. Details on how to determine Wˆ, as well as B, σ, and ω follow. 3.1. Local Coherence The subset of trajectories Wˆ given to RANSAC to generate a model M is constrained to a spatially local region. The probability ofchoosing an uncontaminated set of 3 control trajectories, necessary to compute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = 1 − (1 − r3)k. This means that the number of trials pne =ede 1d −to (fi1n d− a subset of 3 inliers with probability p is k =lloogg((11 − − r p3)). (13) A common assumption is that trajectories from the same underlying motion are locally coherent. Hence, a compact region is likely to increase r, exponentially reducing 222222666200 Figure 2: Predictions (red) from a 2D affine model with standard Gaussian noise (green) on one of the control points (black). Noiseless model predictions in blue. All four scenarios have identical noise. The magnitude of the extrapolation error changes with the distance between the control points. k, and with it, RANSAC’s computation time by a proportional amount. The trade-off that results from drawing model control points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make predictions for trajectories outside the region defined by the control points. The magnitude of modeling error depends on the magnitude of the noise affecting the control points, and although hard to characterize in general, extrapolation error can be expected to grow with the distance from the prediction to the control points, and inversely with the distance between the control points themselves. Figure 2 shows a series of synthetic scenarios where one of the control points is affected by zero mean Gaussian noise of small magnitude. Identical noise is added to the same trajectory in all four scenarios. The figure illustrates the relation between the distance between the control points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the scale of the objects in the scene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and locations to construct a diverse set of support subsets. Each support subset is then determined by Wˆ = {wi | (xbi − ox)2 + (ybi − oy)2 < r2}, (14) where (ox , oy) are the coordinates of the center of a disk of radius r. To promote uniform image coverage, the disk is centered at a randomly chosen trajectory (ox , oy) = (xbi, yib) with uniformly distributed i ∼ U(1, P) and base frame b) w∼i h U u(1n,i fFor)m. yTo d asltrloibwu efodr idi ∼ffer Ue(n1t, region ds bizaesse, tfhraem read bius ∼ r is( ,cFho)s.en T ofro amllo a u fnoirfo dirmffe rdeinsttr riebugtiioonn r ∼s, tUh(erm raidni,u ursm rax i)s. Ihfo tsheenre f are mI a trajectories swtritihbiunt othne support region, then ∈ R2F×I. It is worth noting that the construction of theW support region does not incorporate any knowledge about the motion of objects in the scene, and in consequence will likely contain trajectories that originate from more than one independently moving object (Figure 3). Wˆ Wˆ Figure 3: Two randomly drawn local support sets. Left: A mixed set with some trajectories from the blue and green classes. Right: Another mixed set with all of the trajectories in the red class and some from the blue class. 4. Characterizing the Residual Distribution At each RANSAC iteration, residuals rf are computed using the 2D affine model Af that results from the constraints provided by the control trajectories C. Characterizing the distribution of rf has three initial goals. The first one is to determine 2D model inliers b2fD (§4.1), the second one is to compute estimates of the magnitude ,o tfh thee s ncooinsed at every frame σ2fD (§4.2), and the third one is to determine whether the residual( §d4i.s2t)r,ib auntidon th originates efr iosm to a planar or a 3D object (§4.3). If the object is suspected 3D, then two more goals n (§e4ed.3 )to. bIfe t achieved. sT shues pfiercstt one Dis, t hoe nde ttweromine 3D model inliers b3fD (§4.4), and the second one is to estimate the magnitude of the noise of a 3D model (§4.5). (σf3D) to reflect the use 4.1. 2D Inlier Detection Suppose the matrix Wˆ contains trajectories Wˆ1 ∈ R2F×I and Wˆ2 ∈ R2F×J from two independently moving objects, and ∈tha Rt these trajectories are contaminated with zero-mean Gaussian noise of spherical covariance η ∼ N(0, (σf)2I): Wˆ = ?Wˆ1|Wˆ2? + η. (15) A1f Now, assume we know the true affine transformations and that describe the motion of trajectories for the subsets Wˆ1 and Wˆ2, respectively. If is used to compute predictions for all of Wˆ (at frame f), the expected value (denoted by ?·?) of the magnitude of the residuals (rf from Eq. 11) for trajectories aing nWiˆtud1 will be in the order of the magnitude of the underlying noise ?|rif |? = σf for each i∈ {1, . . . , I}. But in this scenario, trajectories in Wˆ2 ewaicl h b ie ∈ predicted using tth ien wrong emnaodrioel,, resulting isn i nr esid?uals? wit?h magnitudes de?termined by the motion differential A2f ???rif??? A1f ???(A1f − A2f) wˆib???. If we = can assume that the motion ?d??riff???er =en???t(iAal is bigger tha???n. tIhfe w deis cpalnac aesmsuemnte d thuea t toh eno miseo:t ???(A1f − A2f)wib??? 222222666311 > σf, (16) then the model inliers can be determined by thresholding | with the magnitude of the noise, scaled by a constant |(τr =| w wλitσhσtf h):e |rif bif=?10,, |orthife|r ≤wi τse. (17) But because σf is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let be the vector of residual magnitudes where rˆi ≤ ˆ ri+1 . Now, let r˜ = median ( rˆi+1 −ˆ r i). The threshold i≤s trˆ h en defined as r τ = min{ rˆi | (ri+1 − ri) > λr˜ r}, (18) which corresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be efficient and effective. Figure 1.e shows classlabeled residuals. Notice the presence of a (low density) gap between the residuals from the trajectories explained by the correct model (in red, close to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model r2fD Let contain only the residuals of the inlier trajectories (those where = 1), and let USV? be the singular value decomposition of the covariance matrix bif ofˆ r2fD: USV?= svd??1bpf( rˆ2fD)? rˆ2fD?.(19) Then the magnitude of the n?oise corresponds to the largest singular value σ2 = s1, because if the underlying geometry is in fact planar, then the only unaccounted displacements captured by the residuals are due to noise. Model capacity can also be determined from S, as explained next. 4.3. Model Capacity The ratio of largest over smallest singular values (s1/s2) determines when upgrading to a 3D model is beneficial. When the underlying geometry is actually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), reflecting that their relative depth is being unaccounted for. This produces a covariance matrix with a large ratio s1/s2 ? 1. If on the other hand, if s1/s2 ≈ 1, then there is no in 1d.ic Iafti oonn tohfe unexplained relative depth, tihn wnh thicehr case, fitting a olinne o tfo u spherically distributed residuals will only increase the model complexity without explaining the residual variance much better. A small spherical residual covariance strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detection When the residual distribution is elongated (s1/s2 ? 1), a line segment is robustly fit to the (potentially con?tam 1i)-, nated) set of residuals. The segment must go through the origin and its parameters are computed using a Hough transform. Further details about this algorithm can be found in the supplementary material. Inlier detection The resulting line segment is used to determine 3D model inliers. Trajectory ibecomes an inlier at frame f if it satisfies two conditions. First, the projection of rif onto the line must lie within the segment limits (β ≤ ≤ γ). Second, the normalized distance to the rif?ef (ef?rif line must be below a threshold ≤ σ2λd). Notice that the threshold depends on the smalle≤st singular value from Eq. 19 to (roughly) account for the presence of noise in the direction perpendicular to the epipolar (ef). 4.5. Magnitude of the Noise, 3D Model let rˆf3D Similarly to the 2D case, contain the residual data from the corresponding 3D inlier trajectories. An estimate for the magnitude of the noise that reflects the use of a 3D model can be obtained from the singular value decomposition of the covariance matrix of r3fD (as in Eq. 19). In this case, the largest singular value s1 captures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displacements due to relative depth. However, s2 captures deviations from the epipolar line, which in a rigid 3D object can only be attributed to noise, making σ2 = s2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) becomes the winning model for the current RANSAC run. The same penalized negative loglikelihood metric is used to determine the better model from across all RANSAC iterations. The winning model is added to the pool M, and the process is repeated M times, forming hthee p pool MM, a=n d{ tMhe1 , . . . , MssM is} r.e 5. Optimal Model Subset The next step is to find the model combination M? ⊂ M thhea tn mxta xstiempiz iess t prediction accuracy finora othne Mwhol⊂e trajectory mdaaxtaim Wize,s w phreiledi minimizing cmyod foerl complexity and modelling overlap. For this purpose, let Mj = {Mj,1 , . . . , Mj,N} be the j-th m thoisdel p ucorpmosbein,at lieotn, M Mand let {Mj} be the set o}f baell MheC jN-th = m N!(MM−!N)!) caotimonb,in aantdio lnest of N-sized models than can be draNw!(nM fr−oNm) M). The model soefle Nct-sioinze problem sis t hthanen c faonr bmeu dlartaewdn as M?= ar{gMmj}inOS(Mj), (20) 222222666422 where the objective is ?N ?P OS(Mj) = ??πp,nE (wp,Mj,n) ?n=1p?P=1 + λΦ?Φ(wp,Mj,n) + λΨ?Ψ(Mj,n). ?N (21) i?= ?1 n?= ?1 The first term accounts for prediction accuracy, the other two are regularization terms. Details follow. Prediction Accuracy In order to determine how well a model M predicts an arbitrary trajectory w, the affine transformations estimated by RANSAC could be re-used. However, the inherent noise in the control points, and the potentially short distance between them, often render this approach impractical, particularly when w is spatially distant from the control points (see §3. 1). Instead, model parametferorsm are computed owinittsh a efeac §to3r.1iz)a.ti Ionnst e baadse,d m [o1d0e]l mpaertahmode.Given the inlier labeling B in M, let WB be the subset of trajectories where bif = 1for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model can be determined by the 2 (or 3) left singular vectors of WB. Using S as the model’s motion matrices, prediction accuracy can be computed using: E(w, M) = ??SS?w − w??2 , (22) which is the sum of squared?? Euclidean d??eviations from the predictions (SS?w), to the observed data (w). Our experiments indicated that, although sensitive to outliers, these model predictions are much more robust to noise. Ownership variables Π ∈ {0, 1}[P×N] indicate whether trajectory p i ps explained by t {he0 ,n1-}th model (πp,n = 1) or not (πp,n = 0), and are determined by maximum prediction accuracy (i.e. minimum Euclidean deviation): πp,n=⎨⎧01,, oift hMerjw,nis=e. aMrg∈mMinjE(wp,M) (23) Regularization terms The second term from Eq. 21 penalizes situations where multiple models explain a trajectory (w) with relatively small residuals. For brevity, let M) = exp{−E(w, M)}, then: Eˆ(w, Φ(w,Mj) = −logMMm∈?∈aMMxjE ˆ ( w , MM)). (24) The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The constants λΦ and λΨ modulate the effect of the corresponding regularizer. Table 1: Accuracy and run-time for the H155 dataset. Naive RANSAC included as a baseline with overall accuracy and total computation time estimated using data from [12]. SOCARAPSulgrCAoNs[S r[31itA]4h]CmAverage89 A689 c. 71c 7695u racy[%]Compu1 t4a 237t1i506o70 n0 time[s] 6. Refinement The optimal model subset M? yields ownership variableTsh Πe o? wtimhicahl can already tb e M interpreted as a segmentation result. However, we found that segmentation accuracy can be improved by incorporating the labellings Πt from the top T subsets {Mt? | 1 ≤ t ≤ T} closest to optimal. Multiple labellings are incorporated oinsetos an affinity matrix F, where the fi,j entry indicates the frequency with which trajectory i is given the same label as trajectory j across all T labelli?ngs, weighted b?y the relative objective function O˜t = exp ?−OOSS((WW||MMt??))? for such a labelling: fi,j=?tT=11O˜tt?T=1?πit,:πjt,?:?O˜t (25) Note that the inne?r product between the label vectors (πi,:πj?,:) is equal to one only when the labels are the same. A spectral clustering method is applied on F to produce the method’s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 155 The Hopkins 155 (H155) dataset has been the standard evaluation metric for the problem of motion segmentation of trajectory data since 2007. It consists of checkerboard, traffic and articulated sequences with either 2 or 3 motions. Data was automatically tracked, but tracking errors were manually corrected; further details are available in [12]. The use of a standard dataset enables direct comparison of accuracy and run-time performance. Table 1 shows the relevant figures for the two most competitive algorithms that we are aware of. The data indicates that our algorithm has run-times that are close to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal accuracy loss. Computation times are measured in the same (or very similar) hardware architectures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [14], it remains unclear whether the same is true for the results reported in the original SSC paper. 222222666533 Figure 4: Accuracy error-bars across artificial H155 datasets with controlled levels of Gaussian noise. Artificial Noise The second experimental setup complements an unexplored dimension in the H155 dataset: noise. The goal is to determine the effects of noise of different magnitudes towards the segmentation accuracy of our method, in comparison with the state of the art. We noted that H155 contains structured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate such a dataset, ground-truth labels were used to compute a rank 3 reconstruction of (mean-subtracted) trajectories for each segment. Then, multiple versions of H155 were computed by contaminating the noise-free dataset with Gaussian noise of magnitudes σn ∈ {0.01, 0.25, 0.5, 1, 2, 4, 8}. Our method, as well as SSC a∈nd { 0C.0PA1, were run on t2h,e4s,e8 }n.oi Ose-ucro mnterothlloedd, datasets; results are shown in Figure 4. The error bars on SSC and Ours indicate one standard deviation, computed over 20 runs. The plot for CPA is generated with only one run for each dataset (running time: 11.95 days). The graph indicates that our method only compromises accuracy for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most competitive algorithms. KLT Tracking The last experimental setup evaluates the applicability of the algorithm in real world conditions using raw tracks from an off-the-shelf implementation [1] of the Kanade-Lucas-Tomasi algorithm. Several sequences were tracked and the resulting trajectories classified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequences. Challenges include very small relative motions, tracking noise, and a large presence of outliers. 8. Conclusions We introduced a computationally efficient motion segmentation algorithm for trajectory data. Efficiency comes from the use of a simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Run-time comparisons indicate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in accuracy. The robustness of our method to Gaussian noise tracks from four Formula 1 sequences. Italian Grand Prix ?c2012 Formula 1. In this figure, all trajectories are given a m?co2ti0o1n2 l Faoberml, including hoiust fliigeurrse. of different magnitudes was found competitive with state of the art, while retaining the inherent computational efficiency. The method was also found to be useful for motion segmentation of real-world, raw trajectory data. References [1] ht tp : / /www . ce s . c l emn s on . edu / ˜stb / k lt . 8 [2] J. P. Costeira and T. Kanade. A Multibody Factorization Method for Independently Moving Objects. IJCV, 1998. 1 [3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. CVPR, 2009. 2, 7 [4] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d scene analysis. Proc. ECCV, 1996. 2 [6] K. Kanatani. Motion segmentation by subspace separation: Model selection and reliability evaluation. International Journal Image Graphics, 2002. 1 [7] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, MA Fischler and O. Firschein, eds, 1987. 1 [8] K. Schindler, D. Suter, , and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. Proc. IJCV, 79(2): 159–177, 2008. 1 [9] C. Tomasi and T. Kanade. Shape and motion without depth. Proc. [10] [11] [12] [13] [14] [15] ICCV, 1990. 1 C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 1992. 1, 3, 7 P. Torr. Geometric motion segmentation and model selection. Phil. Tran. of the Royal Soc. of Lon. Series A: Mathematical, Physical and Engineering Sciences, 1998. 1 R. Tron and R. Vidal. A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms. In Proc. CVPR, 2007. 7 J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. PAMI, 2008. 1 L. Zappella, E. Provenzi, X. Llad o´, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. Proc. ACCV, 2011. 2, 7 L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications in multi-body and multi-sequence factorizations. Proc. CVPR, 2003. 1 222222666644
6 0.81918812 118 cvpr-2013-Detecting Pulse from Head Motions in Video
7 0.81768632 271 cvpr-2013-Locally Aligned Feature Transforms across Views
8 0.81751239 315 cvpr-2013-Online Robust Dictionary Learning
9 0.81734699 138 cvpr-2013-Efficient 2D-to-3D Correspondence Filtering for Scalable 3D Object Recognition
10 0.81506634 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
11 0.81505042 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
12 0.81435424 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
13 0.81366801 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
15 0.81223446 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
16 0.81174481 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
17 0.81145465 438 cvpr-2013-Towards Pose Robust Face Recognition
18 0.81116331 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
19 0.81053197 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
20 0.810359 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence