cvpr cvpr2013 cvpr2013-173 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joseph Tighe, Svetlana Lazebnik
Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. [sent-3, score-0.23]
2 The system combines region-level features with per-exemplar sliding window detectors. [sent-4, score-0.143]
3 Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. [sent-5, score-1.045]
4 The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels. [sent-6, score-0.144]
5 Our goal is achieving broad coverage the ability to recognize hundreds or thousands of object classes that commonly – occur in everyday street scenes and indoor environments. [sent-9, score-0.178]
6 But a much larger number of “thing” classes people, cars, dogs, mailboxes, vases, stop signs occupy a small percentage of image pixels and have relatively few instances each. [sent-13, score-0.158]
7 “Stuff” categories have no consistent shape but fairly consistent texture, so they can be adequately handled by image parsing systems based on pixel- or region-level features [5, 7, 8, 18, 21, 22, 25, 26, 27, 29]. [sent-14, score-0.387]
8 In order to improve performance on “things,” a few recent image parsing approaches [1, 10, 12, 14, 16] have attempted to incorporate sliding window detectors. [sent-16, score-0.403]
9 Many of these approaches rely on detectors like HOG templates [6] and deformable part-based models (DPMs) [9], which produce only bounding box hypotheses. [sent-17, score-0.333]
10 edu l ino s ing to infer a pixel-level segmentation from a bounding box is a complex and error-prone process. [sent-19, score-0.142]
11 None of these schemes are well suited for handling large numbers of sparsely-sampled classes with high intra-class variation. [sent-21, score-0.133]
12 In this paper, we propose an image parsing system that integrates region-based cues with the promising novel framework of per-exemplar detectors or exemplarSVMs [19]. [sent-22, score-0.673]
13 Per-exemplar detectors are more appropriate than traditional sliding window detectors for classes with few training samples and wide variability. [sent-23, score-0.695]
14 They also meet our need for pixel-level localization: when a per-exemplar detector fires on a test image, we can take the segmentation mask from the corresponding training exemplar and transfer it into the test image to form a segmentation hypothesis. [sent-24, score-0.524]
15 The idea of transferring object segmentation masks from training to test images – either whole or in “fragments” has been explored before in the literature see, e. [sent-25, score-0.271]
16 However, most existing work uses local feature matches to transfer mask hypotheses, and focuses on one class at a time. [sent-28, score-0.197]
17 To our knowledge, our approach is the first to transfer masks using per-exemplar detectors (Malisiewicz et al. [sent-29, score-0.372]
18 It combines the region-based parser from our earlier work [27] – – with a novel parser based on per-exemplar detectors. [sent-32, score-0.46]
19 Each parser produces a score or data term for each possible label at each pixel location, and the data terms are combined using a support vector machine (SVM) to generate the final labeling. [sent-33, score-0.357]
20 In particular, the LM+SUN dataset, with 45,676 images and 232 labels, has the broadest coverage of any image parsing benchmark to date. [sent-35, score-0.4]
21 The test image (a) contains a bus – a relatively rare “thing” class. [sent-42, score-0.182]
22 Our region-based parsing system [27] computes class likelihoods (b) based on superpixel features, and it correctly identifies “stuff” regions like sky, road, and trees, but is not able to get the bus (c). [sent-43, score-0.624]
23 To find “things” like bus and car, we run per-exemplar detectors [19] on the test image (d) and transfer masks corresponding to detected training exemplars (e). [sent-44, score-0.628]
24 Since the detectors are not well suited for “stuff,” the result of detector-based parsing (f) is poor. [sent-45, score-0.623]
25 However, combining region-based and detection-based data terms (g) gives the highest accuracy of all and correctly labels most of the bus and part of the car. [sent-46, score-0.153]
26 Method This section presents our hybrid image parsing method as illustrated in Figure 1. [sent-48, score-0.35]
27 Region-Based Parsing For region-based parsing, we use the scalable nonparametric system we have developed earlier [27]. [sent-55, score-0.212]
28 Given a query image, this system first uses global image descriptors to identify a retrieval set of training images similar to the query. [sent-56, score-0.269]
29 t For large-scale datasets with many labels (SIFT Flow and LM+SUN in our experiments), we obtain the loglikelihood ratio score based on nonparametric nearestneighbor estimates (see [27] for details). [sent-60, score-0.207]
30 For smaller-scale datasets with few classes (CamVid), we obtain it from the output of a boosted decision tree classifier. [sent-61, score-0.169]
31 Either way, we use this score to define our region-based data term ER for each pixel p and class c: ER(p, c) = L(sp, c) , (2) where sp is the region containing p. [sent-62, score-0.196]
32 While it may seem intuitive to only train detectors for “thing” categories, we train them for all categories, including ones seemingly inappropriate for a sliding window approach, such as “sky. [sent-67, score-0.415]
33 We follow the detector training procedure of [19], with negative mining done on all training images that do not contain an object of the same class. [sent-69, score-0.23]
34 For our largest LM+SUN dataset we only do negative mining on 1,000 training images most similar to the positive exemplar’s image (we have found that using more does not increase the detection accuracy). [sent-70, score-0.174]
35 At test time, given an image that needs to be parsed, we first obtain a retrieval set of globally similar training images as in Section 2. [sent-71, score-0.204]
36 Then we run the detectors associated with the first k instances of each class in that retrieval set (the instances are ranked in decreasing order of the similarity of their image to the test image, and different instances in the same image are ranked arbitrarily). [sent-73, score-0.619]
37 For each detection we project =the − a1s saosc siuagtegde object mask into the detected bounding box (Figure 2). [sent-76, score-0.174]
38 To compute the detector-based data term ED for a class c and pixel 333000000200 Figure2. [sent-77, score-0.159]
39 Foreachpositve detection (green bounding box) in the test image (middle row) we transfer the mask (red polygon) from the associated exemplar (top) into the test image. [sent-79, score-0.343]
40 The data term for “car” (bottom) is obtained by summing all the masks weighted by their detector responses. [sent-80, score-0.199]
41 p, we simply take the sum of all detection masks from that class weighted by their detection scores: ED(p,c) = X (wd− td), (3) d∈XDp,c where Dp,c is the set of all detections for class c whose transferred mask overlaps pixel p and wd is the detection score of d. [sent-81, score-0.376]
42 Note that the full training framework of [19] includes computationally intensive calibration and contextual pooling procedures that are meant to make scores of different per-exemplar detectors more consistent. [sent-83, score-0.316]
43 SVM Combination and MRF Smoothing Once we run the parsing systems of Sections 2. [sent-87, score-0.35]
44 2 on a test image, for each pixel p and each class c we end up with two data terms, ER(p, c) and ED (p, c), as defined by eqs. [sent-89, score-0.17]
45 To make SVM training feasible, we must subsample the data – a tricky task given the unbalanced class frequencies in our many-category datasets. [sent-95, score-0.201]
46 Conversely, subsampling the data so that each class has a roughly equal number of points produces a bias towards the rare classes. [sent-100, score-0.116]
47 For training one-vs-all SVMs, we normalize each feature dimension by its standard deviation and use fivefold crossvalidation to find the regularization constant. [sent-104, score-0.229]
48 Since it is infeasible to train a nonlinear SVM with the RBF kernel on our largest dataset, we approximate it by training a linear SVM on top of the random Fourier feature embedding [23]. [sent-107, score-0.314]
49 We set the dimensionality of the embedding to 4,000 and find the kernel bandwidth using fivefold cross-validation. [sent-108, score-0.207]
50 Let ESVM (pi, ci) denote the response of the SVM for class ci at pixel pi. [sent-111, score-0.147]
51 We smooth the labels with an MRF energy function similar to [18, 25] defined over the field of pixel labels c: J(c) = Xmax[0,M − ESVM(pi,ci)] pXi∈+Iλ X Esmooth(ci,cj), (piX X,pj X) ∈? [sent-113, score-0.118]
52 is the set is of adjacent pixels, M is the highest expected value of the SVM response (about 10 on our data), λ is a smoothing constant (we set λ = 16), and Esmooth(ci, cj) imposes a penalty when two adjacent pixels (pi, pj) are similar but are assigned different labels (ci, cj) (see eq. [sent-115, score-0.118]
53 It has 2,488 training images, 200 test images, and 33 labels. [sent-122, score-0.143]
54 Region + Thing uses the SVM trained on the full region data term and the subset of the detector data term corresponding to “thing” classes. [sent-126, score-0.199]
55 Note that training the exact RBF on the largest LM+SUN dataset was computationally infeasible. [sent-137, score-0.174]
56 We use the split of [27], which consists of 45,176 training and 500 test images. [sent-139, score-0.143]
57 For training detectors, we fit a bounding box and a segmentation mask to each connected component of the same label type. [sent-145, score-0.299]
58 [11], and use boosted decision tree classifiers instead of nonparametric likelihood estimates. [sent-150, score-0.116]
59 To obtain training data for the SVM, we compute the responses of the boosted decision tree classifiers on the same images on which they were trained (we have found this to work better than crossvalidation on this dataset). [sent-151, score-0.161]
60 On all datasets, we report the overall per-pixel rate (percent of test set pixels correctly labeled), which is dominated by the most common classes, as well as the average of perclass rates, which is dominated by the rarer classes. [sent-153, score-0.218]
61 On the LM+SUN dataset, which has the largest number of rare “thing” classes, the detector-based data term actually obtains higher per-class accuracy than the region-based one. [sent-160, score-0.149]
62 As observed in [27], MRF inference further raises the per-pixel rate, but often lowers the per-class rate by smoothing away some of the smaller objects. [sent-162, score-0.161]
63 Figure Classification rates of individual classes (ordered from most to least frequent) on the SIFT Flow dataset for region-based, detector-based, and combined parsing. [sent-165, score-0.249]
64 Figure Classification rates of the most common individual classes (ordered from most to least frequent) on the LM+SUN dataset for region-based, detector-based, and combined parsing. [sent-176, score-0.249]
65 Interestingly, the results for this setup are weaker than those of the full combined system using both “thing” and “stuff” detectors. [sent-183, score-0.15]
66 Figures 3 and 4 show the per-class rates of our system on the most common classes in the SIFT Flow and LM+SUN datasets, respectively. [sent-187, score-0.242]
67 As expected, adding detectors significantly improves many “thing” classes (including car, sign, and balcony) but also some “stuff” classes (road, sea, sidewalk, fence). [sent-188, score-0.419]
68 Figure 5 gives a close-up look at our performance on many small object categories, and Figure 6 shows several parsing examples on the LM+SUN dataset. [sent-189, score-0.35]
69 Table 3 compares our combined system to a number of state-of-the-art approaches on the SIFT Flow dataset. [sent-190, score-0.15]
70 We outperform them, in many cases beating the average perclass rate by up to 10% while maintaining or exceeding the per-pixel rates. [sent-191, score-0.12]
71 When their system is tuned to a per-pixel rate similar to ours, their average per-class rate drops significantly below ours. [sent-194, score-0.204]
72 On LM+SUN, which has an order of magnitude more images and labels than SIFT Flow, the only previously reported results are from our earlier region-based system [27]. [sent-195, score-0.173]
73 As Table 4 shows, by augmenting the region-based term with a novel detector-based data term and SVM inference, we are able to raise the per-pixel rate from 54. [sent-196, score-0.155]
74 When compared to our region-based system [27], we improve performance for every class except for building and sky, towards which the region-based parser seems to be overly biased. [sent-202, score-0.368]
75 Running Time Finally, we examine the computational requirements of our system on our largest dataset, LM+SUN, by timing our MATLAB implementation (feature extraction and file I/O excluded) on a six-core 3. [sent-208, score-0.144]
76 There are a total of 354,592 objects in the training set, and we train a per-exemplar detector for each of them. [sent-210, score-0.189]
77 For each class we show a crop of an image, the SVM combined output, and the smoothed final result. [sent-230, score-0.13]
78 The caption for each class shows: (# of training instances of that class) / (# of test instances) (per-pixel rate on the test set)%. [sent-231, score-0.395]
79 Leave-one-out parsing of the training set (see below for average region- and detector-based parsing times per image) takes 939 hours on a single CPU, or about two hours on the cluster. [sent-234, score-0.883]
80 Next, training a set of 232 onevs-all SVMs takes a total of one hour on a single machine for the linear SVM and ten hours for the approximate RBF. [sent-235, score-0.167]
81 Note that the respective feature dimensionalities are 464 and 4,000; this nearly tenfold dimensionality increase accounts for the tenfold increase in running time. [sent-236, score-0.249]
82 Tuning the SVM parameters by fivefold cross-validation on the cluster only increases the training time by a factor of two. [sent-237, score-0.189]
83 At test time, the region-based parsing takes an average of 27. [sent-238, score-0.41]
84 The detector-based parser runs an average of 4,842 detectors per image in 47. [sent-240, score-0.441]
85 9 seconds for the linear kernel and 124 seconds for the approximate RBF (once again, the tenfold increase in feature dimensionality and the overhead ofcomputing the embedding account for the increase in running time). [sent-243, score-0.417]
86 At test time, we would like to to reduce the number of detectors that need to be run per image. [sent-251, score-0.293]
87 Instead, we want to develop methods for dynamically selecting detectors for each test image based on context. [sent-255, score-0.293]
88 Also, SVM testing with the approximate RBF embedding imposes a heavy overhead in our current implementation. [sent-256, score-0.167]
89 Ultimately, we want our system to function on open universe datasets, such as LabelMe [24], that are constantly evolving and do not have a pre-defined list of classes of interest. [sent-258, score-0.183]
90 In principle, per-exemplar detectors are also compatible with the open-universe setting, since they can be trained independently as new exemplars come in. [sent-260, score-0.27]
91 Our SVM combination step is the only one that relies on batch offline training (including leave-one-out parsing of the entire training set). [sent-261, score-0.516]
92 Second through fourth columns: region-based data term (top), detector-based data term (middle), and SVM combination (bottom) for three selected class labels. [sent-266, score-0.168]
93 Fifth column: region-based parsing results (top) and detector-based parsing results (bottom) without SVM or MRF smoothing. [sent-267, score-0.7]
94 In (b), the system correctly identifies the wheels of the cars and the headlight of the left car. [sent-270, score-0.171]
95 In (c), the detectors correctly identify the wall and most of the bed. [sent-271, score-0.271]
96 Note that the region-based parser alone mislabels most of the bed as “sea”; the detector-based parser does much better but still mislabels part of the bed as “mountain. [sent-272, score-0.634]
97 ” In this example, the detector-based parser also finds two pictures and a lamp that do not survive in the final output. [sent-273, score-0.253]
98 Notice how the detectors are able to complete the car in (a) and (b). [sent-288, score-0.288]
99 Poselets: Body part detectors trained using 3d human pose annotations. [sent-308, score-0.233]
100 Scene parsing with multiscale feature learning, purity trees, and optimal covers. [sent-339, score-0.35]
wordName wordTfidf (topN-words)
[('parsing', 0.35), ('thing', 0.294), ('lm', 0.271), ('detectors', 0.233), ('parser', 0.208), ('camvid', 0.208), ('svm', 0.198), ('stuff', 0.194), ('mrf', 0.138), ('sun', 0.136), ('rbf', 0.107), ('fivefold', 0.106), ('tenfold', 0.106), ('classes', 0.093), ('system', 0.09), ('masks', 0.086), ('training', 0.083), ('nonparametric', 0.078), ('bus', 0.076), ('mask', 0.074), ('svms', 0.071), ('flow', 0.071), ('mislabels', 0.071), ('class', 0.07), ('things', 0.066), ('instances', 0.065), ('embedding', 0.065), ('road', 0.065), ('plate', 0.065), ('detector', 0.064), ('sift', 0.063), ('sea', 0.063), ('perclass', 0.063), ('retrieval', 0.061), ('combined', 0.06), ('test', 0.06), ('rates', 0.059), ('exemplarsvms', 0.058), ('lowers', 0.058), ('labelme', 0.057), ('rate', 0.057), ('hariharan', 0.056), ('car', 0.055), ('esvm', 0.055), ('farabet', 0.055), ('largest', 0.054), ('sliding', 0.053), ('transfer', 0.053), ('seconds', 0.052), ('loglikelihood', 0.052), ('fik', 0.052), ('esmooth', 0.052), ('coverage', 0.05), ('hours', 0.05), ('bounding', 0.05), ('box', 0.05), ('labeling', 0.05), ('term', 0.049), ('subsample', 0.048), ('td', 0.047), ('tighe', 0.047), ('smoothing', 0.046), ('exemplar', 0.046), ('rare', 0.046), ('inappropriate', 0.045), ('lamp', 0.045), ('suppresses', 0.044), ('ladick', 0.044), ('earlier', 0.044), ('grundmann', 0.043), ('cars', 0.043), ('train', 0.042), ('segmentation', 0.042), ('sky', 0.041), ('pixel', 0.04), ('crossvalidation', 0.04), ('sturgess', 0.04), ('suited', 0.04), ('labels', 0.039), ('alahari', 0.039), ('arxiv', 0.039), ('bed', 0.038), ('datasets', 0.038), ('correctly', 0.038), ('boosted', 0.038), ('categories', 0.037), ('running', 0.037), ('exemplars', 0.037), ('region', 0.037), ('dataset', 0.037), ('ci', 0.037), ('kernel', 0.036), ('wd', 0.036), ('er', 0.035), ('overhead', 0.035), ('indoor', 0.035), ('query', 0.035), ('approximate', 0.034), ('june', 0.034), ('imposes', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Author: Joseph Tighe, Svetlana Lazebnik
Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
2 0.18371437 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
Author: Gautam Singh, Jana Kosecka
Abstract: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.
3 0.15308757 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
4 0.14425994 152 cvpr-2013-Exemplar-Based Face Parsing
Author: Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, Jianchao Yang
Abstract: In this work, we propose an exemplar-based face image segmentation algorithm. We take inspiration from previous works on image parsing for general scenes. Our approach assumes a database of exemplar face images, each of which is associated with a hand-labeled segmentation map. Given a test image, our algorithm first selects a subset of exemplar images from the database, Our algorithm then computes a nonrigid warp for each exemplar image to align it with the test image. Finally, we propagate labels from the exemplar images to the test image in a pixel-wise manner, using trained weights to modulate and combine label maps from different exemplars. We evaluate our method on two challenging datasets and compare with two face parsing algorithms and a general scene parsing algorithm. We also compare our segmentation results with contour-based face alignment results; that is, we first run the alignment algorithms to extract contour points and then derive segments from the contours. Our algorithm compares favorably with all previous works on all datasets evaluated.
5 0.13810611 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels
Author: Guang Shu, Afshin Dehghan, Mubarak Shah
Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
6 0.13684262 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
7 0.12357303 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation
8 0.12355558 187 cvpr-2013-Geometric Context from Videos
9 0.12261891 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings
10 0.1162274 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
11 0.11579831 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
12 0.11344586 335 cvpr-2013-Poselet Conditioned Pictorial Structures
13 0.1124791 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
14 0.11169402 107 cvpr-2013-Deformable Spatial Pyramid Matching for Fast Dense Correspondences
15 0.11002173 364 cvpr-2013-Robust Object Co-detection
16 0.10913881 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
17 0.1075798 325 cvpr-2013-Part Discovery from Partial Correspondence
18 0.10497475 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
19 0.10302538 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
20 0.10166217 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
topicId topicWeight
[(0, 0.257), (1, -0.065), (2, 0.026), (3, -0.063), (4, 0.116), (5, 0.044), (6, 0.046), (7, 0.094), (8, -0.055), (9, -0.033), (10, 0.014), (11, -0.024), (12, 0.067), (13, -0.011), (14, 0.012), (15, -0.033), (16, 0.062), (17, -0.063), (18, -0.056), (19, -0.016), (20, 0.005), (21, -0.036), (22, 0.071), (23, 0.056), (24, 0.022), (25, 0.034), (26, 0.013), (27, 0.055), (28, -0.036), (29, -0.0), (30, -0.034), (31, -0.027), (32, 0.016), (33, 0.019), (34, -0.072), (35, -0.056), (36, 0.022), (37, -0.061), (38, -0.068), (39, 0.003), (40, -0.016), (41, -0.002), (42, -0.036), (43, -0.051), (44, 0.048), (45, -0.003), (46, 0.046), (47, 0.04), (48, 0.034), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.95440364 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Author: Joseph Tighe, Svetlana Lazebnik
Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
2 0.74606216 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations
Author: Payman Yadollahpour, Dhruv Batra, Gregory Shakhnarovich
Abstract: This paper introduces a two-stage approach to semantic image segmentation. In the first stage a probabilistic model generates a set of diverse plausible segmentations. In the second stage, a discriminatively trained re-ranking model selects the best segmentation from this set. The re-ranking stage can use much more complex features than what could be tractably used in the probabilistic model, allowing a better exploration of the solution space than possible by simply producing the most probable solution from the probabilistic model. While our proposed approach already achieves state-of-the-art results (48.1%) on the challenging VOC 2012 dataset, our machine and human analyses suggest that even larger gains are possible with such an approach.
3 0.73358184 247 cvpr-2013-Learning Class-to-Image Distance with Object Matchings
Author: Guang-Tong Zhou, Tian Lan, Weilong Yang, Greg Mori
Abstract: We conduct image classification by learning a class-toimage distance function that matches objects. The set of objects in training images for an image class are treated as a collage. When presented with a test image, the best matching between this collage of training image objects and those in the test image is found. We validate the efficacy of the proposed model on the PASCAL 07 and SUN 09 datasets, showing that our model is effective for object classification and scene classification tasks. State-of-the-art image classification results are obtained, and qualitative results demonstrate that objects can be accurately matched.
4 0.71752501 406 cvpr-2013-Spatial Inference Machines
Author: Roman Shapovalov, Dmitry Vetrov, Pushmeet Kohli
Abstract: This paper addresses the problem of semantic segmentation of 3D point clouds. We extend the inference machines framework of Ross et al. by adding spatial factors that model mid-range and long-range dependencies inherent in the data. The new model is able to account for semantic spatial context. During training, our method automatically isolates and retains factors modelling spatial dependencies between variables that are relevant for achieving higher prediction accuracy. We evaluate the proposed method by using it to predict 1 7-category semantic segmentations on sets of stitched Kinect scans. Experimental results show that the spatial dependencies learned by our method significantly improve the accuracy of segmentation. They also show that our method outperforms the existing segmentation technique of Koppula et al.
5 0.70970553 145 cvpr-2013-Efficient Object Detection and Segmentation for Fine-Grained Recognition
Author: Anelia Angelova, Shenghuo Zhu
Abstract: We propose a detection and segmentation algorithm for the purposes of fine-grained recognition. The algorithm first detects low-level regions that could potentially belong to the object and then performs a full-object segmentation through propagation. Apart from segmenting the object, we can also ‘zoom in ’ on the object, i.e. center it, normalize it for scale, and thus discount the effects of the background. We then show that combining this with a state-of-the-art classification algorithm leads to significant improvements in performance especially for datasets which are considered particularly hard for recognition, e.g. birds species. The proposed algorithm is much more efficient than other known methods in similar scenarios [4, 21]. Our method is also simpler and we apply it here to different classes of objects, e.g. birds, flowers, cats and dogs. We tested the algorithm on a number of benchmark datasets for fine-grained categorization. It outperforms all the known state-of-the-art methods on these datasets, sometimes by as much as 11%. It improves the performance of our baseline algorithm by 3-4%, consistently on all datasets. We also observed more than a 4% improvement in the recognition performance on a challenging largescale flower dataset, containing 578 species of flowers and 250,000 images.
6 0.70555896 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
7 0.70373052 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning
8 0.70273703 144 cvpr-2013-Efficient Maximum Appearance Search for Large-Scale Object Detection
9 0.70213705 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
10 0.70003629 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses
11 0.69936067 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
12 0.69812274 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
13 0.69562191 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
14 0.69492352 25 cvpr-2013-A Sentence Is Worth a Thousand Pixels
15 0.68887907 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
16 0.67582935 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
17 0.6670633 262 cvpr-2013-Learning for Structured Prediction Using Approximate Subgradient Descent with Working Sets
18 0.66576558 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
19 0.66398448 425 cvpr-2013-Tensor-Based High-Order Semantic Relation Transfer for Semantic Scene Segmentation
20 0.66098464 143 cvpr-2013-Efficient Large-Scale Structured Learning
topicId topicWeight
[(10, 0.109), (16, 0.035), (24, 0.153), (26, 0.088), (28, 0.021), (33, 0.26), (67, 0.096), (69, 0.055), (77, 0.011), (87, 0.095)]
simIndex simValue paperId paperTitle
1 0.92030013 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video
Author: Shaodi You, Robby T. Tan, Rei Kawakami, Katsushi Ikeuchi
Abstract: Raindrops adhered to a windscreen or window glass can significantly degrade the visibility of a scene. Detecting and removing raindrops will, therefore, benefit many computer vision applications, particularly outdoor surveillance systems and intelligent vehicle systems. In this paper, a method that automatically detects and removes adherent raindrops is introduced. The core idea is to exploit the local spatiotemporal derivatives ofraindrops. First, it detects raindrops based on the motion and the intensity temporal derivatives of the input video. Second, relying on an analysis that some areas of a raindrop completely occludes the scene, yet the remaining areas occludes only partially, the method removes the two types of areas separately. For partially occluding areas, it restores them by retrieving as much as possible information of the scene, namely, by solving a blending function on the detected partially occluding areas using the temporal intensity change. For completely occluding areas, it recovers them by using a video completion technique. Experimental results using various real videos show the effectiveness of the proposed method.
same-paper 2 0.88628924 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Author: Joseph Tighe, Svetlana Lazebnik
Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
3 0.88058794 311 cvpr-2013-Occlusion Patterns for Object Class Detection
Author: Bojan Pepikj, Michael Stark, Peter Gehler, Bernt Schiele
Abstract: Despite the success of recent object class recognition systems, the long-standing problem of partial occlusion remains a major challenge, and a principled solution is yet to be found. In this paper we leave the beaten path of methods that treat occlusion as just another source of noise instead, we include the occluder itself into the modelling, by mining distinctive, reoccurring occlusion patterns from annotated training data. These patterns are then used as training data for dedicated detectors of varying sophistication. In particular, we evaluate and compare models that range from standard object class detectors to hierarchical, part-based representations of occluder/occludee pairs. In an extensive evaluation we derive insights that can aid further developments in tackling the occlusion challenge. –
4 0.87983716 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem
Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.
5 0.87888432 384 cvpr-2013-Segment-Tree Based Cost Aggregation for Stereo Matching
Author: Xing Mei, Xun Sun, Weiming Dong, Haitao Wang, Xiaopeng Zhang
Abstract: This paper presents a novel tree-based cost aggregation method for dense stereo matching. Instead of employing the minimum spanning tree (MST) and its variants, a new tree structure, ”Segment-Tree ”, is proposed for non-local matching cost aggregation. Conceptually, the segment-tree is constructed in a three-step process: first, the pixels are grouped into a set of segments with the reference color or intensity image; second, a tree graph is created for each segment; and in the final step, these independent segment graphs are linked to form the segment-tree structure. In practice, this tree can be efficiently built in time nearly linear to the number of the image pixels. Compared to MST where the graph connectivity is determined with local edge weights, our method introduces some ’non-local’ decision rules: the pixels in one perceptually consistent segment are more likely to share similar disparities, and therefore their connectivity within the segment should be first enforced in the tree construction process. The matching costs are then aggregated over the tree within two passes. Performance evaluation on 19 Middlebury data sets shows that the proposed method is comparable to previous state-of-the-art aggregation methods in disparity accuracy and processing speed. Furthermore, the tree structure can be refined with the estimated disparities, which leads to consistent scene segmentation and significantly better aggregation results.
6 0.87519151 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image
7 0.87509871 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
8 0.87454641 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
9 0.87434447 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
10 0.87402672 433 cvpr-2013-Top-Down Segmentation of Non-rigid Visual Objects Using Derivative-Based Search on Sparse Manifolds
11 0.8735404 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
12 0.87095022 152 cvpr-2013-Exemplar-Based Face Parsing
13 0.87003392 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors
14 0.86971909 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
15 0.86905628 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
16 0.86852968 414 cvpr-2013-Structure Preserving Object Tracking
17 0.86844867 325 cvpr-2013-Part Discovery from Partial Correspondence
18 0.86840862 440 cvpr-2013-Tracking People and Their Objects
19 0.86815459 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis
20 0.86707455 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds