Author: Joseph Tighe, Svetlana Lazebnik
Abstract: This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
1 edu Abstract This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. [sent-3, score-0.23]
2 The system combines region-level features with per-exemplar sliding window detectors. [sent-4, score-0.143]
3 Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. [sent-5, score-1.045]
4 The proposed system achieves state-of-theart accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels. [sent-6, score-0.144]
5 Our goal is achieving broad coverage the ability to recognize hundreds or thousands of object classes that commonly – occur in everyday street scenes and indoor environments. [sent-9, score-0.178]
6 But a much larger number of “thing” classes people, cars, dogs, mailboxes, vases, stop signs occupy a small percentage of image pixels and have relatively few instances each. [sent-13, score-0.158]
7 “Stuff” categories have no consistent shape but fairly consistent texture, so they can be adequately handled by image parsing systems based on pixel- or region-level features [5, 7, 8, 18, 21, 22, 25, 26, 27, 29]. [sent-14, score-0.387]
8 In order to improve performance on “things,” a few recent image parsing approaches [1, 10, 12, 14, 16] have attempted to incorporate sliding window detectors. [sent-16, score-0.403]
9 Many of these approaches rely on detectors like HOG templates [6] and deformable part-based models (DPMs) [9], which produce only bounding box hypotheses. [sent-17, score-0.333]
10 edu l ino s ing to infer a pixel-level segmentation from a bounding box is a complex and error-prone process. [sent-19, score-0.142]
11 None of these schemes are well suited for handling large numbers of sparsely-sampled classes with high intra-class variation. [sent-21, score-0.133]
12 In this paper, we propose an image parsing system that integrates region-based cues with the promising novel framework of per-exemplar detectors or exemplarSVMs [19]. [sent-22, score-0.673]
13 Per-exemplar detectors are more appropriate than traditional sliding window detectors for classes with few training samples and wide variability. [sent-23, score-0.695]
14 They also meet our need for pixel-level localization: when a per-exemplar detector fires on a test image, we can take the segmentation mask from the corresponding training exemplar and transfer it into the test image to form a segmentation hypothesis. [sent-24, score-0.524]
15 The idea of transferring object segmentation masks from training to test images – either whole or in “fragments” has been explored before in the literature see, e. [sent-25, score-0.271]
16 However, most existing work uses local feature matches to transfer mask hypotheses, and focuses on one class at a time. [sent-28, score-0.197]
17 To our knowledge, our approach is the first to transfer masks using per-exemplar detectors (Malisiewicz et al. [sent-29, score-0.372]
18 It combines the region-based parser from our earlier work [27] – – with a novel parser based on per-exemplar detectors. [sent-32, score-0.46]
19 Each parser produces a score or data term for each possible label at each pixel location, and the data terms are combined using a support vector machine (SVM) to generate the final labeling. [sent-33, score-0.357]
20 In particular, the LM+SUN dataset, with 45,676 images and 232 labels, has the broadest coverage of any image parsing benchmark to date. [sent-35, score-0.4]
21 The test image (a) contains a bus – a relatively rare “thing” class. [sent-42, score-0.182]
22 Our region-based parsing system [27] computes class likelihoods (b) based on superpixel features, and it correctly identifies “stuff” regions like sky, road, and trees, but is not able to get the bus (c). [sent-43, score-0.624]
23 To find “things” like bus and car, we run per-exemplar detectors [19] on the test image (d) and transfer masks corresponding to detected training exemplars (e). [sent-44, score-0.628]
24 Since the detectors are not well suited for “stuff,” the result of detector-based parsing (f) is poor. [sent-45, score-0.623]
25 However, combining region-based and detection-based data terms (g) gives the highest accuracy of all and correctly labels most of the bus and part of the car. [sent-46, score-0.153]
26 Method This section presents our hybrid image parsing method as illustrated in Figure 1. [sent-48, score-0.35]
27 Region-Based Parsing For region-based parsing, we use the scalable nonparametric system we have developed earlier [27]. [sent-55, score-0.212]
28 Given a query image, this system first uses global image descriptors to identify a retrieval set of training images similar to the query. [sent-56, score-0.269]
29 t For large-scale datasets with many labels (SIFT Flow and LM+SUN in our experiments), we obtain the loglikelihood ratio score based on nonparametric nearestneighbor estimates (see [27] for details). [sent-60, score-0.207]
30 For smaller-scale datasets with few classes (CamVid), we obtain it from the output of a boosted decision tree classifier. [sent-61, score-0.169]
31 Either way, we use this score to define our region-based data term ER for each pixel p and class c: ER(p, c) = L(sp, c) , (2) where sp is the region containing p. [sent-62, score-0.196]
32 While it may seem intuitive to only train detectors for “thing” categories, we train them for all categories, including ones seemingly inappropriate for a sliding window approach, such as “sky. [sent-67, score-0.415]
33 We follow the detector training procedure of [19], with negative mining done on all training images that do not contain an object of the same class. [sent-69, score-0.23]
34 For our largest LM+SUN dataset we only do negative mining on 1,000 training images most similar to the positive exemplar’s image (we have found that using more does not increase the detection accuracy). [sent-70, score-0.174]
35 At test time, given an image that needs to be parsed, we first obtain a retrieval set of globally similar training images as in Section 2. [sent-71, score-0.204]
36 Then we run the detectors associated with the first k instances of each class in that retrieval set (the instances are ranked in decreasing order of the similarity of their image to the test image, and different instances in the same image are ranked arbitrarily). [sent-73, score-0.619]
37 For each detection we project =the − a1s saosc siuagtegde object mask into the detected bounding box (Figure 2). [sent-76, score-0.174]
38 To compute the detector-based data term ED for a class c and pixel 333000000200 Figure2. [sent-77, score-0.159]
39 Foreachpositve detection (green bounding box) in the test image (middle row) we transfer the mask (red polygon) from the associated exemplar (top) into the test image. [sent-79, score-0.343]
40 The data term for “car” (bottom) is obtained by summing all the masks weighted by their detector responses. [sent-80, score-0.199]
41 p, we simply take the sum of all detection masks from that class weighted by their detection scores: ED(p,c) = X (wd− td), (3) d∈XDp,c where Dp,c is the set of all detections for class c whose transferred mask overlaps pixel p and wd is the detection score of d. [sent-81, score-0.376]
42 Note that the full training framework of [19] includes computationally intensive calibration and contextual pooling procedures that are meant to make scores of different per-exemplar detectors more consistent. [sent-83, score-0.316]
43 SVM Combination and MRF Smoothing Once we run the parsing systems of Sections 2. [sent-87, score-0.35]
44 2 on a test image, for each pixel p and each class c we end up with two data terms, ER(p, c) and ED (p, c), as defined by eqs. [sent-89, score-0.17]
45 To make SVM training feasible, we must subsample the data – a tricky task given the unbalanced class frequencies in our many-category datasets. [sent-95, score-0.201]
46 Conversely, subsampling the data so that each class has a roughly equal number of points produces a bias towards the rare classes. [sent-100, score-0.116]
47 For training one-vs-all SVMs, we normalize each feature dimension by its standard deviation and use fivefold crossvalidation to find the regularization constant. [sent-104, score-0.229]
48 Since it is infeasible to train a nonlinear SVM with the RBF kernel on our largest dataset, we approximate it by training a linear SVM on top of the random Fourier feature embedding [23]. [sent-107, score-0.314]
49 We set the dimensionality of the embedding to 4,000 and find the kernel bandwidth using fivefold cross-validation. [sent-108, score-0.207]
50 Let ESVM (pi, ci) denote the response of the SVM for class ci at pixel pi. [sent-111, score-0.147]
51 We smooth the labels with an MRF energy function similar to [18, 25] defined over the field of pixel labels c: J(c) = Xmax[0,M − ESVM(pi,ci)] pXi∈+Iλ X Esmooth(ci,cj), (piX X,pj X) ∈? [sent-113, score-0.118]
52 is the set is of adjacent pixels, M is the highest expected value of the SVM response (about 10 on our data), λ is a smoothing constant (we set λ = 16), and Esmooth(ci, cj) imposes a penalty when two adjacent pixels (pi, pj) are similar but are assigned different labels (ci, cj) (see eq. [sent-115, score-0.118]
53 It has 2,488 training images, 200 test images, and 33 labels. [sent-122, score-0.143]
54 Region + Thing uses the SVM trained on the full region data term and the subset of the detector data term corresponding to “thing” classes. [sent-126, score-0.199]
55 Note that training the exact RBF on the largest LM+SUN dataset was computationally infeasible. [sent-137, score-0.174]
56 We use the split of [27], which consists of 45,176 training and 500 test images. [sent-139, score-0.143]
57 For training detectors, we fit a bounding box and a segmentation mask to each connected component of the same label type. [sent-145, score-0.299]
58 [11], and use boosted decision tree classifiers instead of nonparametric likelihood estimates. [sent-150, score-0.116]
59 To obtain training data for the SVM, we compute the responses of the boosted decision tree classifiers on the same images on which they were trained (we have found this to work better than crossvalidation on this dataset). [sent-151, score-0.161]
60 On all datasets, we report the overall per-pixel rate (percent of test set pixels correctly labeled), which is dominated by the most common classes, as well as the average of perclass rates, which is dominated by the rarer classes. [sent-153, score-0.218]
61 On the LM+SUN dataset, which has the largest number of rare “thing” classes, the detector-based data term actually obtains higher per-class accuracy than the region-based one. [sent-160, score-0.149]
62 As observed in [27], MRF inference further raises the per-pixel rate, but often lowers the per-class rate by smoothing away some of the smaller objects. [sent-162, score-0.161]
63 Figure Classification rates of individual classes (ordered from most to least frequent) on the SIFT Flow dataset for region-based, detector-based, and combined parsing. [sent-165, score-0.249]
64 Figure Classification rates of the most common individual classes (ordered from most to least frequent) on the LM+SUN dataset for region-based, detector-based, and combined parsing. [sent-176, score-0.249]
65 Interestingly, the results for this setup are weaker than those of the full combined system using both “thing” and “stuff” detectors. [sent-183, score-0.15]
66 Figures 3 and 4 show the per-class rates of our system on the most common classes in the SIFT Flow and LM+SUN datasets, respectively. [sent-187, score-0.242]
67 As expected, adding detectors significantly improves many “thing” classes (including car, sign, and balcony) but also some “stuff” classes (road, sea, sidewalk, fence). [sent-188, score-0.419]
68 Figure 5 gives a close-up look at our performance on many small object categories, and Figure 6 shows several parsing examples on the LM+SUN dataset. [sent-189, score-0.35]
69 Table 3 compares our combined system to a number of state-of-the-art approaches on the SIFT Flow dataset. [sent-190, score-0.15]
70 We outperform them, in many cases beating the average perclass rate by up to 10% while maintaining or exceeding the per-pixel rates. [sent-191, score-0.12]
71 When their system is tuned to a per-pixel rate similar to ours, their average per-class rate drops significantly below ours. [sent-194, score-0.204]
72 On LM+SUN, which has an order of magnitude more images and labels than SIFT Flow, the only previously reported results are from our earlier region-based system [27]. [sent-195, score-0.173]
73 As Table 4 shows, by augmenting the region-based term with a novel detector-based data term and SVM inference, we are able to raise the per-pixel rate from 54. [sent-196, score-0.155]
74 When compared to our region-based system [27], we improve performance for every class except for building and sky, towards which the region-based parser seems to be overly biased. [sent-202, score-0.368]
75 Running Time Finally, we examine the computational requirements of our system on our largest dataset, LM+SUN, by timing our MATLAB implementation (feature extraction and file I/O excluded) on a six-core 3. [sent-208, score-0.144]
76 There are a total of 354,592 objects in the training set, and we train a per-exemplar detector for each of them. [sent-210, score-0.189]
77 For each class we show a crop of an image, the SVM combined output, and the smoothed final result. [sent-230, score-0.13]
78 The caption for each class shows: (# of training instances of that class) / (# of test instances) (per-pixel rate on the test set)%. [sent-231, score-0.395]
79 Leave-one-out parsing of the training set (see below for average region- and detector-based parsing times per image) takes 939 hours on a single CPU, or about two hours on the cluster. [sent-234, score-0.883]
80 Next, training a set of 232 onevs-all SVMs takes a total of one hour on a single machine for the linear SVM and ten hours for the approximate RBF. [sent-235, score-0.167]
81 Note that the respective feature dimensionalities are 464 and 4,000; this nearly tenfold dimensionality increase accounts for the tenfold increase in running time. [sent-236, score-0.249]
82 Tuning the SVM parameters by fivefold cross-validation on the cluster only increases the training time by a factor of two. [sent-237, score-0.189]
83 At test time, the region-based parsing takes an average of 27. [sent-238, score-0.41]
84 The detector-based parser runs an average of 4,842 detectors per image in 47. [sent-240, score-0.441]
85 9 seconds for the linear kernel and 124 seconds for the approximate RBF (once again, the tenfold increase in feature dimensionality and the overhead ofcomputing the embedding account for the increase in running time). [sent-243, score-0.417]
86 At test time, we would like to to reduce the number of detectors that need to be run per image. [sent-251, score-0.293]
87 Instead, we want to develop methods for dynamically selecting detectors for each test image based on context. [sent-255, score-0.293]
88 Also, SVM testing with the approximate RBF embedding imposes a heavy overhead in our current implementation. [sent-256, score-0.167]
89 Ultimately, we want our system to function on open universe datasets, such as LabelMe [24], that are constantly evolving and do not have a pre-defined list of classes of interest. [sent-258, score-0.183]
90 In principle, per-exemplar detectors are also compatible with the open-universe setting, since they can be trained independently as new exemplars come in. [sent-260, score-0.27]
91 Our SVM combination step is the only one that relies on batch offline training (including leave-one-out parsing of the entire training set). [sent-261, score-0.516]
92 Second through fourth columns: region-based data term (top), detector-based data term (middle), and SVM combination (bottom) for three selected class labels. [sent-266, score-0.168]
93 Fifth column: region-based parsing results (top) and detector-based parsing results (bottom) without SVM or MRF smoothing. [sent-267, score-0.7]
94 In (b), the system correctly identifies the wheels of the cars and the headlight of the left car. [sent-270, score-0.171]
95 In (c), the detectors correctly identify the wall and most of the bed. [sent-271, score-0.271]
96 Note that the region-based parser alone mislabels most of the bed as “sea”; the detector-based parser does much better but still mislabels part of the bed as “mountain. [sent-272, score-0.634]
97 ” In this example, the detector-based parser also finds two pictures and a lamp that do not survive in the final output. [sent-273, score-0.253]
98 Notice how the detectors are able to complete the car in (a) and (b). [sent-288, score-0.288]
99 Poselets: Body part detectors trained using 3d human pose annotations. [sent-308, score-0.233]
100 Scene parsing with multiscale feature learning, purity trees, and optimal covers. [sent-339, score-0.35]
