Author: Xin Guo, Dong Liu, Brendan Jou, Mojun Zhu, Anni Cai, Shih-Fu Chang

Abstract: Object co-detection aims at simultaneous detection of objects of the same category from a pool of related images by exploiting consistent visual patterns present in candidate objects in the images. The related image set may contain a mixture of annotated objects and candidate objects generated by automatic detectors. Co-detection differs from the conventional object detection paradigm in which detection over each test image is determined one-by-one independently without taking advantage of common patterns in the data pool. In this paper, we propose a novel, robust approach to dramatically enhance co-detection by extracting a shared low-rank representation of the object instances in multiple feature spaces. The idea is analogous to that of the well-known Robust PCA [28], but has not been explored in object co-detection so far. The representation is based on a linear reconstruction over the entire data set and the low-rank approach enables effective removal of noisy and outlier samples. The extracted low-rank representation can be used to detect the target objects by spectral clustering. Extensive experiments over diverse benchmark datasets demonstrate consistent and significant performance gains of the proposed method over the state-of-the-art object codetection method and the generic object detection methods without co-detection formulations.

1 cn , {dongl iu, b j ou , Abstract Object co-detection aims at simultaneous detection of objects of the same category from a pool of related images by exploiting consistent visual patterns present in candidate objects in the images. [sent-4, score-0.473]

2 Co-detection differs from the conventional object detection paradigm in which detection over each test image is determined one-by-one independently without taking advantage of common patterns in the data pool. [sent-6, score-0.246]

3 In this paper, we propose a novel, robust approach to dramatically enhance co-detection by extracting a shared low-rank representation of the object instances in multiple feature spaces. [sent-7, score-0.211]

4 Extensive experiments over diverse benchmark datasets demonstrate consistent and significant performance gains of the proposed method over the state-of-the-art object codetection method and the generic object detection methods without co-detection formulations. [sent-11, score-0.365]

5 Introduction Given an image and a target object category, the goal of object detection is to localize the instance of the given category within the image, often up to bounding box precision. [sent-13, score-0.993]

6 The classical approach to object detection is to train object detectors from manually labeled bounding boxes in a set of training images and then apply the detectors on the individual test images. [sent-15, score-1.182]

7 edu , Target: aeroplane automatically detected candidate regions and the training bounding box set, we represent them using K different features. [sent-19, score-0.86]

8 For each feature matrix, we perform linear reconstruction, representing each bounding box as a linear combination of other bounding boxes where the resulting coefficient matrix measures the mutual dependency of bounding boxes. [sent-20, score-2.39]

9 We derive a shared low-rank reconstruction matrix from the K reconstructions while removing the noisy and outlying bounding boxes in each feature matrix in a sparse residue matrix. [sent-21, score-1.414]

10 The low-rank reconstruction coefficient matrix is then fed into Normalized Cuts clustering to yield codetection results. [sent-22, score-0.478]

11 Given a target object category and a training corpus with bounding box annotations, we first train several state-of-the-art object detectors so that diverse appearances of the target object can be covered and the fam- ily of detectors can collectively reach a high recall in detection accuracy. [sent-34, score-1.366]

12 These detectors are then applied to the test images to obtain an initial bounding box candidate pool. [sent-35, score-0.901]

13 With the bounding boxes from the training set and the initial candidate pool over the test images, we extract K low-level features from each of them. [sent-36, score-1.105]

14 For each feature, we perform a linear reconstruction task to represent each bounding box as a linear combination of other bounding boxes such that the reconstruction coefficients represent the dependency of one bounding box to the others. [sent-37, score-2.395]

15 We seek to find a shared low-rank reconstruction coefficient matrix across these K reconstructions that captures the global structure of the object space while removing noise and outliers in each feature space via a sparse residue matrix. [sent-38, score-0.93]

16 However, the difference is that the unlabeled data is not given arbitrarily, but corresponds to potential bounding boxes generated by multiple detectors. [sent-42, score-0.766]

17 The use of low-rank constraints on the coefficient matrix is particularly important for discovering the mutual dependence that may exist between bounding boxes, which we refer to as the “global structure”. [sent-44, score-0.875]

18 To capture this structure on bounding boxes, we assume that the reconstruction coefficient vectors is dependent on each other. [sent-45, score-0.792]

19 While different features may yield different low-rank coefficient matrices, a shared low-rank coefficient matrix is necessary because it captures object dependency across these features and in so doing, ensures robustness. [sent-47, score-0.694]

20 Noise and outliers from each feature space can also be removed via a sparse residue matrix which reduces ambiguity that may have been introduced by each feature. [sent-48, score-0.37]

21 Our experiments on benchmark datasets used by [2] as well as on PASCAL VOC 2007 and 2009 show consistent and significant margins of improvement over generic object detectors using little prior and the state-of-the-art object co-detector. [sent-49, score-0.22]

22 Specifically, they represent an object category using part-based object representations and measure appearance consistency between objects by pairwise similarity matching. [sent-63, score-0.206]

23 In contrast, we focus on collectively discovering global structure from an object bounding box pool and concurrently removing outliers, which we believe leads to robust object co-detection, able to handle noise. [sent-66, score-1.105]

24 [15] proposed the low-rank representation method which can be used to discover the underlying subspace structures by imposing the low-rank constraint on the representation coefficient matrix while using ? [sent-71, score-0.332]

25 Our method is distinct in that we develop a low-rank coefficient matrix that is shared over multiple reconstructions derived from different features. [sent-74, score-0.41]

26 We note that related work can also be found in multi-task joint sparse representation [30], but it seeks to find stable training images across multiple features to classify test images rather than using them to discover the global structure as we do toward object localization in images. [sent-75, score-0.312]

27 We first present how we generate an initial pool of candidate bounding boxes of the target object and then de- scribe our problem formulation. [sent-78, score-1.14]

28 Finally, we explain how to use a learned low-rank coefficient matrix for co-detection. [sent-79, score-0.272]

29 Bounding Box Candidate Pool Generation Exhaustive window scanning will generate a massive number of bounding boxes that dramatically increases the computational burden of the object detector. [sent-82, score-0.827]

30 Therefore, an initial bounding box generation procedure is necessary to prune the windows that do not contain any target object. [sent-83, score-0.795]

31 Given a target object category and its associated training bounding boxes, we train two kinds of object detectors: Deformable Part-based Model (DPM) [12] and Ensemble of Exemplar-SVMs (ESVMs) [18]. [sent-84, score-0.767]

32 A similar bounding box candidate pool generation method was adopted in [2] using DPM. [sent-86, score-0.976]

33 We apply the detectors on each test image and select the top B bounding boxes with the highest detection scores as the potential localizations on that image. [sent-87, score-0.95]

34 We set B to be twice the average number of bounding boxes in the training images1 . [sent-88, score-0.809]

35 Because we have two detectors, there are 2B bounding box suggestions for each test image. [sent-89, score-0.735]

36 After removing the duplicate bounding boxes with non-maximum suppression, we obtain an initial bounding box pool with a high recall. [sent-90, score-1.708]

37 We note that other bounding box pool generation methods, such as objectness detection [1] may be also considered as alternatives to these two detectors. [sent-91, score-0.977]

38 Problem Formulation Given an object category, suppose we have l training bounding boxes and u potential bounding boxes from the 1Note that studying the optimal choice of B is problem but not the main focus of this paper. [sent-94, score-1.636]

39 a legitimate research initial bounding box pool. [sent-95, score-0.686]

40 We extract low-level features from each bounding box and obtain a feature matrix X = [x1, . [sent-96, score-0.817]

41 , xl+u], where xi ∈ Rm is the feature vector of the i-th bounding box (i = 1, . [sent-99, score-0.757]

42 , zl+u] ∈ R(l+u)×(l+u) is the reconstruction coefficient matrix] w ∈it hR zi ∈ Rl+u denoting the reconstruction coefficient vector of bounding box xi. [sent-106, score-1.236]

43 Notably, the j-th entry in vector zi is the contribution of the bounding box xj in reconstructing the bounding box xi, and measures the mutual dependence between xi and xj . [sent-107, score-1.522]

44 E is the reconstruction residue matrix of the given feature matrix X. [sent-108, score-0.461]

45 First, it finds the reconstruction coefficient vector for each bounding box individually, and hence does not take into account the global structure of the bounding boxes. [sent-110, score-1.51]

46 The minimization of rank(Z) forces the reconstruction coefficient matrix to have the lowest rank possible. [sent-127, score-0.454]

47 As a result, the reconstruction coefficient vectors of different bounding boxes influence each other in such a way as to encourage bounding boxes to be linearly spanned by only a few bases. [sent-128, score-1.807]

48 The matrix Z then represents the global structure of the bounding boxes. [sent-129, score-0.641]

49 By removing E from X, the feature representations of the bounding boxes become more compact, reducing potential ambiguity in the detection process. [sent-136, score-0.922]

50 In general, we require more than one feature to discover the global structure of the objects given their diverse visual appearance. [sent-138, score-0.2]

51 A more promising alternative is to find a reconstruction coefficient matrix shared across multiple features, whose entries can more precisely reflect 2Linear reconstruction has been successfully applied in several recent works on sparse representation [29], subspace clustering [15], etc. [sent-139, score-0.605]

52 333222000866 the degree of contribution from features on the mutual dependence between any two bounding boxes. [sent-143, score-0.603]

53 , xlk+u] be the feature matrix of all the bounding boxes where xik ∈ Rmk is the feature vector of the i-th bounding box (i = 1,∈ . [sent-150, score-1.673]

54 , K, where Ek is the residue matrix removed from Xk. [sent-163, score-0.297]

55 Note that the coefficient matrix Z is shared across K features. [sent-164, score-0.386]

56 Object Co-Detection with Matrix Z∗ After solving for the global structure matrix Z∗ from (4), we can use it to simultaneously detect all target objects from a bounding box collection consisting of the training annotations and the initial bounding box pool from Section 3. [sent-187, score-1.818]

57 We accomplish this task via a clustering procedure which partitions the bounding boxes so that each cluster contains objects with the same visual appearance. [sent-189, score-0.799]

58 Since the coefficient matrix Z∗ inherently captures the mutual dependence of the bounding boxes, it is natural to employ it as an affinity measure for clustering. [sent-190, score-0.923]

59 To ensure the symmetric property of affinity matrices, we convert Z∗ into a symmetric affinity matrix W via the relation [15]: ∗ W =21? [sent-191, score-0.188]

60 ploy Normalized Cuts [25] to segment bounding boxes into N clusters {C1, . [sent-197, score-0.81]

61 axj{∈|PP(C(CIiI)iW)|,i1j}, (6) where Ii is an indicator specifying the index of the cluster which the i-th test bounding box belongs to, P(Cq) is tteher wsehti cohf positive training bounding b beolxoensg sin t c,l Pust(eCr Cq, athned |e t· |o fde pnoositteivs eth tera cardinality oinfg a soxete. [sent-208, score-1.297]

62 This is accomplished by dividing the number of positive training samples in the same cluster as the i-th sample by the highest number of per-cluster positive training samples across all clusters. [sent-214, score-0.189]

63 The result is that clusters with more positive training samples have higher voting power and thus, the scores for test samples in those clusters are likely to have higher weight. [sent-215, score-0.214]

64 With these scores on test bounding boxes, we can then obtain a rank list in which the highest positive detections are ranked in the top positions. [sent-217, score-0.655]

65 These top ranking bounding boxes correspond to the result of our co-detection for that respective object category. [sent-218, score-0.827]

66 Multi-Feature Matching (MFM): We first generate an iMniutliatil- bounding baotcxh pool through WDeP fMirs ta gnde nEeSraVteM asn, then rank all the candidate bounding boxes based on their average similarity with respect to the l training bounding boxes. [sent-307, score-2.113]

67 y based on the k-th feature modality, d(xik ,xjk ) is the χ2 distance between xik and xjk, and σk is the mean value of all pairwise χ2 distances between the candidate and training bounding boxes. [sent-320, score-0.717]

68 We do not include these methods into our comparison, but emphasize that our detection framework is applicable to any generic feature and outper- ×× forms the generic detection methods with using any of those priors or contexts. [sent-322, score-0.237]

69 We extract three kinds of features from each bounding box including SIFT Bag-of-Words (BoW) [17], Gabor [19], and LBP [21] features. [sent-324, score-0.686]

70 We then train a codebook with 1, 024 codewords 6an ×d quantize htehen descriptors bino eoakc hw bounding cboodxe iwnotrod a 1, 024-dimension histogram. [sent-326, score-0.485]

71 For the Gabor feature, we partition each bounding box into 2 2 blocks and apply a set of Gtitaibonor e faiclhter bso over n4g s bcaolxes in atnod 2 6× o2r bielnoctaktiso annsd di na pepalcyh a a b sloetck o. [sent-327, score-0.686]

72 Following the evaluation method in PASCAL VOC challenge, a predicted bounding box is considered correct if it overlaps more than 50% with the ground-truth bounding box, otherwise it is considered a false detection. [sent-342, score-1.171]

73 The stereo image pairs are obtained from a stereo camera, meaning most images contain matching objects. [sent-394, score-0.21]

74 The authors only provide 354 test stereo pairs for Ford Car dataset while the other pairs are not publicly available. [sent-397, score-0.259]

75 To ensure a similar setting, we select 300 stereo pairs from the 354 available stereo pairs and select 300 random pairs from the whole dataset for testing on the Ford Car dataset. [sent-398, score-0.35]

76 For the Pedestrian dataset, since there are not any test pairs available, we follow the same stereo pair generation method as in the released pairs of the Ford Car dataset. [sent-399, score-0.302]

77 We select 200 stereo pairs from test frames with the constraint that each pair consists of two frames whose frame interval is at most three within the video sequence. [sent-400, score-0.189]

78 Figure 2 shows example incorrect bounding boxes successfully removed by our method (corresponding to bounding boxes with non-zero columns in the residue matrix). [sent-411, score-1.773]

79 On each dataset, we rank the bounding boxes via the score K1 ? [sent-418, score-0.853]

80 x Ek, and we pick the top three bounding boxes as examples here. [sent-422, score-0.766]

81 The average recall rate across the 20 categories in the bounding box candidate pool is 59. [sent-427, score-0.968]

82 7 bounding box candidates in each image after duplicate removal. [sent-429, score-0.745]

83 Example detection results and removed incorrect bounding boxes on PASVAL VOC 2007 te? [sent-436, score-0.932]

84 Incorrect bounding boxes are picked from the top two bounding boxes ranked by scores K1 ? [sent-439, score-1.532]

85 This is because MLRR infers a shared low-rank coefficient matrix and can aggregate evidences from multiple features, resulting in a more cohesive representation. [sent-575, score-0.383]

86 In Figure 3, we show some detection results and removed noisy bounding boxes by our method. [sent-576, score-0.896]

87 The average recall rate across the 20 categories in the initial bounding box candidate pool is 61. [sent-581, score-0.968]

88 1 bounding box candidates in each test im- age after duplicate removal. [sent-583, score-0.794]

89 The scalability of our method is dictated by the size of the bounding box candidate pool during the codetection process. [sent-590, score-1.044]

90 Though large-scale co-detection is not the primal focus of this current work, we note that there are ways for controlling the complexity by dividing the test bounding boxes into clusters with moderate size and applying our method within each cluster. [sent-591, score-0.859]

91 For a new image, it is possible to apply the traditional out-of-sample extensions from transductive learning [3] to acquire detection scores of bounding boxes. [sent-593, score-0.553]

92 When testing a new image, we begin by applying DPM and ESVMs on it to obtain its bounding box candidates as before. [sent-594, score-0.686]

93 For each candidate z, we can use its low-level feature to search a set of nearest neighbors {xi}iT=1 from all the bounding boxes in the original dataset, {wxhe}re xi is a neighbor of z and T is the total number of the neighbors. [sent-595, score-0.936]

94 The result is a detection score for an unseen bounding box. [sent-601, score-0.553]

95 Given a bounding box pool represented in multiple feature spaces, we perform multiple linear reconstructions, each of which produces a reconstruction coefficient matrix measuring the mutual dependency of the bounding boxes. [sent-604, score-1.867]

96 The co-detection problem is formulated as inferring a shared low-rank coefficient matrix across all reconstructions with noise and outlier removing constraints within each 333222111200 oicrePsn0 14. [sent-605, score-0.528]

97 The low-rank coefficient matrix captures the global structure of objects across these multiple features and can be used to produce the co-detections using spectral clustering. [sent-741, score-0.404]

98 Empirical experiment results on various object detection benchmarks show that our method outperforms the state-of-the-art generic object detection methods. [sent-742, score-0.289]

99 For future work, we will investigate inductive object co-detection methods which not only infers a reconstruction coefficient matrix to leverage global structure but also builds a decision function for bounding boxes unseen in the candidate pool. [sent-743, score-1.389]

100 A unified approach to salient object detection via low rank matrix recovery. [sent-884, score-0.308]

simIndex simValue paperId paperTitle

1 0.92780811 402 cvpr-2013-Social Role Discovery in Human Events

Author: Vignesh Ramanathan, Bangpeng Yao, Li Fei-Fei

Abstract: We deal with the problem of recognizing social roles played by people in an event. Social roles are governed by human interactions, and form a fundamental component of human event description. We focus on a weakly supervised setting, where we are provided different videos belonging to an event class, without training role labels. Since social roles are described by the interaction between people in an event, we propose a Conditional Random Field to model the inter-role interactions, along with person specific social descriptors. We develop tractable variational inference to simultaneously infer model weights, as well as role assignment to all people in the videos. We also present a novel YouTube social roles dataset with ground truth role annotations, and introduce annotations on a subset of videos from the TRECVID-MED11 [1] event kits for evaluation purposes. The performance of the model is compared against different baseline methods on these datasets.

2 0.91497505 358 cvpr-2013-Robust Canonical Time Warping for the Alignment of Grossly Corrupted Sequences

Author: Yannis Panagakis, Mihalis A. Nicolaou, Stefanos Zafeiriou, Maja Pantic

Abstract: Temporal alignment of human behaviour from visual data is a very challenging problem due to a numerous reasons, including possible large temporal scale differences, inter/intra subject variability and, more importantly, due to the presence of gross errors and outliers. Gross errors are often in abundance due to incorrect localization and tracking, presence of partial occlusion etc. Furthermore, such errors rarely follow a Gaussian distribution, which is the de-facto assumption in machine learning methods. In this paper, building on recent advances on rank minimization and compressive sensing, a novel, robust to gross errors temporal alignment method is proposed. While previous approaches combine the dynamic time warping (DTW) with low-dimensional projections that maximally correlate two sequences, we aim to learn two underlyingprojection matrices (one for each sequence), which not only maximally correlate the sequences but, at the same time, efficiently remove the possible corruptions in any datum in the sequences. The projections are obtained by minimizing the weighted sum of nuclear and ?1 norms, by solving a sequence of convex optimization problems, while the temporal alignment is found by applying the DTW in an alternating fashion. The superiority of the proposed method against the state-of-the-art time alignment methods, namely the canonical time warping and the generalized time warping, is indicated by the experimental results on both synthetic and real datasets.

3 0.9145059 18 cvpr-2013-A Max-Margin Riffled Independence Model for Image Tag Ranking

Author: Tian Lan, Greg Mori

Abstract: We propose Max-Margin Riffled Independence Model (MMRIM), a new method for image tag ranking modeling the structured preferences among tags. The goal is to predict a ranked tag list for a given image, where tags are ordered by their importance or relevance to the image content. Our model integrates the max-margin formalism with riffled independence factorizations proposed in [10], which naturally allows for structured learning and efficient ranking. Experimental results on the SUN Attribute and LabelMe datasets demonstrate the superior performance of the proposed model compared with baseline tag ranking methods. We also apply the predicted rank list of tags to several higher-level computer vision applications in image understanding and retrieval, and demonstrate that MMRIM significantly improves the accuracy of these applications.

4 0.87910283 146 cvpr-2013-Enriching Texture Analysis with Semantic Data

Author: Tim Matthews, Mark S. Nixon, Mahesan Niranjan

Abstract: We argue for the importance of explicit semantic modelling in human-centred texture analysis tasks such as retrieval, annotation, synthesis, and zero-shot learning. To this end, low-level attributes are selected and used to define a semantic space for texture. 319 texture classes varying in illumination and rotation are positioned within this semantic space using a pairwise relative comparison procedure. Low-level visual features used by existing texture descriptors are then assessed in terms of their correspondence to the semantic space. Textures with strong presence ofattributes connoting randomness and complexity are shown to be poorly modelled by existing descriptors. In a retrieval experiment semantic descriptors are shown to outperform visual descriptors. Semantic modelling of texture is thus shown to provide considerable value in both feature selection and in analysis tasks.

5 0.87615436 292 cvpr-2013-Multi-agent Event Detection: Localization and Role Assignment

Author: Suha Kwak, Bohyung Han, Joon Hee Han

Abstract: We present a joint estimation technique of event localization and role assignment when the target video event is described by a scenario. Specifically, to detect multi-agent events from video, our algorithm identifies agents involved in an event and assigns roles to the participating agents. Instead of iterating through all possible agent-role combinations, we formulate the joint optimization problem as two efficient subproblems—quadratic programming for role assignment followed by linear programming for event localization. Additionally, we reduce the computational complexity significantly by applying role-specific event detectors to each agent independently. We test the performance of our algorithm in natural videos, which contain multiple target events and nonparticipating agents.

same-paper 6 0.8672424 364 cvpr-2013-Robust Object Co-detection

7 0.86712778 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

8 0.83098495 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

9 0.82713509 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

10 0.82413715 231 cvpr-2013-Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment

11 0.82404393 172 cvpr-2013-Finding Group Interactions in Social Clutter

12 0.82398307 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

13 0.82293147 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

14 0.82257086 432 cvpr-2013-Three-Dimensional Bilateral Symmetry Plane Estimation in the Phase Domain

15 0.82132608 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

16 0.82112813 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation

17 0.82112503 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

18 0.82098776 414 cvpr-2013-Structure Preserving Object Tracking

19 0.82039499 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects

20 0.81890738 325 cvpr-2013-Part Discovery from Partial Correspondence