Author: Gautam Singh, Jana Kosecka

Abstract: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.

1 Nonparametric scene parsing with adaptive feature relevance and semantic context Gautam Singh Jana Koˇ seck a´ George Mason University Fairfax, VA {gs inghc ,kosecka} @ cs . [sent-1, score-0.729]

2 edu Abstract This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. [sent-3, score-0.676]

3 We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. [sent-4, score-0.457]

4 To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. [sent-5, score-1.082]

5 The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches. [sent-6, score-0.542]

6 Introduction The problem of semantic labelling, requires simultaneous segmentation of an image into regions and categorization of all the image pixels. [sent-8, score-0.38]

7 With the increasing complexity and size of the datasets used for evaluation of semantic segmentation, nonparametric techniques [15, 26] combined with various context driven retrieval strategies have demonstrated notable improvement in the performance. [sent-11, score-0.848]

8 These methods typically start with an oversegmentation of an image into superpixels followed by the computation of a rich set of features characterizing both appearance and local geometry at the superpixel level. [sent-12, score-0.479]

9 In the proposed work, we follow a nonparametric approach and make the following contributions: (i) We forgo the use of large superpixels and complex features and tackle the problem of semantic segmentation using local patches characterized by gradient orientation, color and location features. [sent-14, score-0.849]

10 The proposed approach is validated extensively on several semantic segmentation datasets consistently showing improved performance over the state of the art methods. [sent-16, score-0.486]

11 Related Work In recent years, a large number of approaches for semantic segmentation have been proposed. [sent-18, score-0.38]

12 Context is often captured by a retrieval set of images similar to the query and methods developed for establishing matches between image regions (at pixel or superpixel level) for labelling the image. [sent-28, score-0.951]

13 Authors in [26] work at the superpixel-level and retrieve similar images using global image features which is followed by superpixellevel matching using local features and a Markov random field (MRF) to incorporate neighbourhood context. [sent-30, score-0.345]

14 The work of [26] was extended by [4] by training per superpixel per feature weights and also by incorporating superpixellevel semantic context. [sent-31, score-0.743]

15 A set of partially similar images is used in [3 1] by searching for matches for each region of the query image and then using the retrieval set for label transfer. [sent-32, score-0.538]

16 A nonparametric method which avoids the construction of a retrieval set is [8] which instead addresses the problem of semantic labelling by building a graph of patch correspondences across image sets and transfers annotations to unlabeled images using the established correspondences. [sent-33, score-1.059]

17 Our work is closely related to the work of [26, 4] in that we also pursue nonparametric approach, but differ in the choice of elementary regions, features, feature relevance learning and the method for computing the retrieval set for k-NN classification. [sent-35, score-0.631]

18 In our case, the retrieval set is obtained in a feedback manner using a novel semantic label descriptor computed from the initial semantic segmentation. [sent-36, score-1.132]

19 Similarly to [4], we follow the observation that a single global distance metric is often not sufficient for handling the large variations within a class and propose to compute weights for individual features channels. [sent-37, score-0.314]

20 The computation of the feature relevance we adopt falls into a broad class of distance metric learning techniques which have been shown to be beneficial for many problems like image classification [5], object segmentation [17] and image annotation [9]. [sent-39, score-0.372]

21 Approach In this section, we will describe our baseline approach, followed by the method of weight computation in Section 4 and semantic contextual retrieval in Section 5. [sent-42, score-0.709]

22 Problem Formulation We formulate the semantic segmentation of an image segmented into small superpixels. [sent-45, score-0.38]

23 The output of the seman- tic segmentation is a labelling L = (l1, l2, . [sent-46, score-0.367]

24 The posterior probability of a labelling L given the observed appearance feature vectors A = [a1, a2 , . [sent-57, score-0.408]

25 (1) We estimate the labelling L as a Maximum A Posteriori Probability (MAP), argmax P(L|A) = argmax P(A|L) P(L) . [sent-61, score-0.384]

26 Superpixels and features For an image, we extract superpixels utilizing a segmentation method [29] where superpixel boundaries are obtained as watersheds on a negative absolute Laplacian image with LoG extremas as seeds. [sent-65, score-0.479]

27 These blob-based superpixels are efficient to compute and naturally consistent with the boundaries. [sent-66, score-0.272]

28 Similarly to [18], for each superpixel, we compute a 133-dimensional feature vector ai comprised of SIFT descriptor (128 dimensions), color mean over the pixels of an individual superpixel in Lab color space (3 dimensions) and the location of the superpixel centroid (2 dimensions). [sent-67, score-0.898]

29 The SIFT descriptor for a superpixel is com- puted at a fixed scale and orientation using publicly available code [27]. [sent-68, score-0.362]

30 The individual label likelihood P(ai |lj) for a superpixel si is obtained using a k-NN method. [sent-77, score-0.462]

31 S|inlce a superpixel is uniquely represented by its feature vector, we use the symbols si and ai interchangeably. [sent-78, score-0.347]

32 We compute the normalized label likelihood score using the individual label likelihood: • P(ai|lj) =nLL(ai,lj) ? [sent-81, score-0.412]

33 1 A straightforward way to compute the neighbourhood Nik is to use the concatenated feature ai (Section 3. [sent-86, score-0.386]

34 2) and retrieve the k nearest points by computing distance to superpixels in G. [sent-87, score-0.367]

35 Such a retrieval can be efficiently performed by the use of approximate nearest neighbour methods like k-d trees [19]. [sent-88, score-0.422]

36 (2) can be rewritten in log-space and the optimal labelling L∗ achieved as argLmin? [sent-93, score-0.32]

37 For example, when trying to label a seaside image, it is more helpful if we search for the nearest neighbours in images of beaches and discard views from street scenes. [sent-108, score-0.336]

38 It helps discard images which are dissimilar to the query image and provides a scene-level context which can help improve the labelling performance. [sent-110, score-0.589]

39 The retrieval subset will serve as the source of image annotations which will be used to label the query image. [sent-111, score-0.538]

40 All the images in the training set T are ranked for each individual global image feature in ascending order of the Euclidean distance from the query image. [sent-113, score-0.465]

41 Finally, we select a subset of images Tg from the training set T as the retrieval set. [sent-115, score-0.269]

42 In the next two sections, we describe in detail the two contributions of this work: a method for weighting different feature channels and the strategy for improving the retrieval set. [sent-120, score-0.37]

43 Weighted k-NN The baseline k-NN approach uses Euclidean distance to compute the neighbourhood around the point. [sent-122, score-0.275]

44 We propose to use a weighted k-NN method to compute the neighbourhood of a query point. [sent-123, score-0.483]

45 To compute a weighted distance between two superpixels ai and aj, we split the feature vector into three feature channels of gradient orientation, color and location and first compute distances in individual feature spaces: = [dicj,dsij, (7) difj disj,dilj dilj]? [sent-124, score-1.004]

46 where dicj, are the Euclidean distances between the color, SIFT and location channels of the feature vectors ai and aj of the two superpixels respectively. [sent-125, score-0.534]

47 We now define a weighted distance between the two superpixels as diwj = w? [sent-126, score-0.31]

48 (8), we can now obtain the neighbourhood Nik around a superpixel by applying it to the feature distance vector difj between ai and aj ∈ G to compute the label likelihood scores in Eq. [sent-130, score-0.941]

49 Weight computation With the varying nature of the retrieval set for individual query images, we use the locally adaptive metric approach of [3] for the weight computation. [sent-133, score-0.709]

50 In our setting, the test points are the individual superpixels of the query image. [sent-135, score-0.501]

51 The goal is to estimate the relevance of a feature channel iby evaluating its ability to predict class posterior probabilites locally at a query point. [sent-136, score-0.522]

52 For the query point x0, the relevance for feature ican be computed by averaging the ri (z) ’s in its neighbourhood r¯i(x0) =|N(1x0)|z∈N? [sent-140, score-0.599]

53 1 where m is the number of individual feature channels (three in our case), c is a parameter which determines the influence of r¯i (at c = 0, all three feature channels have equal weights) and Ri (x0) = maxpm=1 { ¯rp(x0)} −¯ r i (x0). [sent-145, score-0.349]

54 Semantic Contextual Retrieval The semantic labelling of an image, even if inaccurate provides a strong cue about the presence and absence of different categories in the image. [sent-150, score-0.653]

55 While the idea of using context to improve the labelling has been explored in the past for image superpixels [20, 4], here we examine the effectiveness of this idea in the stage of improving the entire retrieval set. [sent-151, score-0.86]

56 In order to do so, we propose a global descriptor derived from the intial labelling of the image which will be used to improve the retrieval set. [sent-152, score-0.707]

57 To summarize the semantic label information of a labeled image, we introduce the semantic label descriptor for a labelled image. [sent-153, score-1.102]

58 Our proposed descriptor helps encode the positional information of each category in the image and can be used for semantic contextual retrieval. [sent-156, score-0.515]

59 ls of the layout more precisely but be more prone to classification errors while a lower value for n would be less sensitive to errors in the labelling but does not encode the spatial position of the semantic categories as well. [sent-167, score-0.731]

60 This approach of computing a semantic label-based descriptor is similar to [10]. [sent-168, score-0.454]

61 Our method also differs from [4] who compute a superpixel-level semantic context descriptor as a normalized label histogram of neighbouring regions. [sent-171, score-0.734]

62 Semantic Retrieval Set Global image features (GIST, color histograms and spatial pyramid over SIFT) were used to build retrieval set Tg in Section 3. [sent-174, score-0.337]

63 We now use the semantic label descriptor fseman introduced above to help us refine the quality of the retrieval set by exploiting the semantic context. [sent-176, score-1.233]

64 Using the resultant semantic image labelling, we generate its corresponding semantic label descriptor fskeman. [sent-178, score-0.897]

65 Similarily, for the query view Iq, we label it using WKNN-MRF method and compute the corresponding semantic label descriptor. [sent-179, score-0.821]

66 We generate a new set of ranking for the images in training set T based on the distance between their semantic label descriptor and that of the query image. [sent-180, score-0.854]

67 The ranking is computed in an ascending order of the semantic label descriptor distances. [sent-181, score-0.636]

68 Using the new retrieval set Ts, we once again perform semantic labelling on the image by the process described in Section 3. [sent-184, score-0.888]

69 The WLKNN refers to a weighted k-NN using a retrieval set built using the label descriptor only. [sent-188, score-0.514]

70 We also experiment with using the semantic layout descriptor with all the other three global image features for the building of the retrieval set and denote this method WAKNN-MRF. [sent-189, score-0.768]

71 The evaluation criterion for the methods is the per pixel accuracy (percentage of pixels correctly labelled) and per class accuracy (the average of semantic category accuracies). [sent-192, score-0.369]

72 For Stanford Background and Google Street View datasets, we selected 10% of the training images as the size of our retrieval set. [sent-193, score-0.269]

73 Computation of the feature weights required an average of four minutes for a single query image. [sent-200, score-0.304]

74 To help speed up the computation of the weights, we approximate the neighbourhood construction of [3] through k-d trees [19]. [sent-201, score-0.289]

75 For the query view, we index the individual features from the retrieval set in a k-d tree, constructing one k-d tree per feature channel. [sent-202, score-0.593]

76 The neighbourhood computation is then approximated using the set union of the k-NN from different feature channels. [sent-203, score-0.299]

77 (11) adaptively changing the nearest neighbours in the weighted neighbourhood space. [sent-205, score-0.373]

78 SiftFlow SiftFlow is a large dataset of 2688 images with 33 semantic categories. [sent-211, score-0.333]

79 When we incorporate semantic context to obtain a refined retrieval set, our system achieves the best performance for both per-pixel and per-class accuracies. [sent-216, score-0.644]

80 The categories which saw an increase of more than 10% after the use of semantic context include field, car, river, plant, sidewalk, bridge, door, crosswalk. [sent-217, score-0.409]

81 These are categories which do not occur very frequently but achieved improved labelling with × the context. [sent-218, score-0.32]

82 For example, identifying road and highways helps label cars, sidewalk and crosswalk. [sent-219, score-0.299]

83 3l4580oawsdte We also experimented with replacing the SIFT feature for the superpixel with a HOG feature [2]. [sent-224, score-0.309]

84 H TOhGe individual HOG cell descriptors were averaged to compute the superpixel feature. [sent-226, score-0.358]

85 SUN09 SUN09 dataset [1] has fully labelled per-pixel ground truth for a set of 107 semantic categories. [sent-229, score-0.428]

86 Using the semantic context helped obtain an improvement of 3. [sent-232, score-0.473]

87 It was observed that the per-pixel labelling accuracy of outdoor scenes was more than 11% better than indoor scenes highlighting the challenge of labelling indoor views. [sent-235, score-0.64]

88 Examples (a)-(c) are instances of semantic context improving the labelling as trees and mountains are predicted in the initial labelling. [sent-251, score-0.772]

89 In comparison to the other methods, our performance was in the top-two for the per-pixel accuracy and for two semantic categories. [sent-261, score-0.333]

90 Stanford-Background This dataset contains 715 images with two separate label sets; semantic and geometric. [sent-262, score-0.443]

91 We conducted our experiments for predicting the semantic category only. [sent-263, score-0.333]

92 The semantic classes include seven background classes and a generic foreground class. [sent-264, score-0.333]

93 The use of semantic context leads to an improvement of only 0. [sent-267, score-0.409]

94 The lack of significant improvement with the use of semantic context here can be explained by the nature of the dataset as more than 90% of the images contain 4 or more of the 8 semantic categories. [sent-269, score-0.742]

95 Conclusions We have presented an approach for nonparametric scene parsing using a k-NN method. [sent-272, score-0.274]

96 A locally adaptive distance metric is learned at query time to compute the relevance of individual feature channels. [sent-274, score-0.645]

97 Using the initial 333 111555446 labelling as a contextual cue for presence or absence of objects in the scene, we proposed a semantic context descriptor which helped refine the quality of the retrieval set which is a key component of nonparametric methods. [sent-275, score-1.42]

98 For future work, we would like to explore better methods for incorporating spatial information at the patch level and also explore learning semantic concepts for scene understanding. [sent-278, score-0.333]

99 Partial similarity based nonparametric scene parsing in certain environment. [sent-484, score-0.274]

100 Supervised label transfer for semantic segmentation of street scenes. [sent-490, score-0.59]

