Author: Subhransu Maji, Gregory Shakhnarovich
Abstract: We study the problem of part discovery when partial correspondence between instances of a category are available. For visual categories that exhibit high diversity in structure such as buildings, our approach can be used to discover parts that are hard to name, but can be easily expressed as a correspondence between pairs of images. Parts naturally emerge from point-wise landmark matches across many instances within a category. We propose a learning framework for automatic discovery of parts in such weakly supervised settings, and show the utility of the rich part library learned in this way for three tasks: object detection, category-specific saliency estimation, and fine-grained image parsing.
1 Part Discovery from Partial Correspondence Subhransu Maji Gregory Shakhnarovich Toyota Technological Institute at Chicago, IL, USA Abstract We study the problem of part discovery when partial correspondence between instances of a category are available. [sent-1, score-0.528]
2 For visual categories that exhibit high diversity in structure such as buildings, our approach can be used to discover parts that are hard to name, but can be easily expressed as a correspondence between pairs of images. [sent-2, score-0.555]
3 Parts naturally emerge from point-wise landmark matches across many instances within a category. [sent-3, score-0.394]
4 We propose a learning framework for automatic discovery of parts in such weakly supervised settings, and show the utility of the rich part library learned in this way for three tasks: object detection, category-specific saliency estimation, and fine-grained image parsing. [sent-4, score-1.155]
5 Introduction Many visual categories have inherent structure: body parts of animals, architectural elements in a building, components of mechanical devices, etc. [sent-6, score-0.412]
6 In this paper, we study the problem of discovering such structure with only a weak form of supervision: partial correspondence between pairs of instances within an object category. [sent-9, score-0.296]
7 Notion of parts is important to computer vision because much of recent work on visual recognition relies on the idea of representing a category as a composition of smaller fragments (or parts) arranged in variety of layouts. [sent-10, score-0.419]
8 The parts act as diagnostic elements for the category; their presence and arrangement provides rich information regarding the presence and location of the object, its pose, size and finegrained properties, e. [sent-11, score-0.468]
9 , a church building may or may not have a spire, an airplane may have four, two or no visible engines. [sent-20, score-0.294]
10 Furthermore, instances of these parts could differ drastically in their appearance, e. [sent-21, score-0.391]
11 We leverage this ability through a recently introduced annotation paradigm that relies on people marking such correspondences, and propose a novel approach to construction of a library of parts driven by such annotations. [sent-25, score-0.488]
12 Such annotations can enable discovery of parts that are aligned to human-semantics for categories that are otherwise hard to annotate using traditional methods of named keypoints, and part bounding boxes. [sent-26, score-0.85]
13 We show the utility of the rich part library learned in this way for three tasks: object detection, category-specific saliency estimation, and fine-grained image parsing. [sent-27, score-0.685]
14 Example annotations collected on Amazon’s Mechanical Turk (left), which are much more semantic in nature than matches obtained using SIFT descriptors (right). [sent-32, score-0.401]
15 A large library of parts (poselets) is then formed by finding repeatable and detectable configurations of these keypoints. [sent-35, score-0.45]
16 In contrast, in many models the parts are learned automatically. [sent-36, score-0.376]
17 This idea goes back to constellation models [22, 21] where the parts were learned via clustering of patches. [sent-37, score-0.376]
18 In such models parts are learned as a byproduct of optimizing the discriminative objective, involving reasoning about part appearance as well as their joint location relative to the object. [sent-39, score-0.568]
19 A very different approach is taken in [19, 4], where parts are learned and selected in an iterative framework, with the objective to optimize specificity/sensitivity tradeoff. [sent-41, score-0.376]
20 Our work differs in its use of the correspondence annotations, used very efficiently via semantic graph defined in the next section. [sent-42, score-0.387]
21 The idea of using pairwise correspondences as source of learning parts was introduced in [15], along with an intuitive interface for collecting such correspondences. [sent-44, score-0.448]
22 However, in [15] parts were learned in a rather na¨ ıve fashion, and no framework for selecting the parts was proposed, nor was the utility of the learned parts demonstrated on any task. [sent-45, score-1.171]
23 The latter work describes learning correspondence between patches that described the same element (part) of an urban scene. [sent-50, score-0.234]
24 As we show in Section 5, using generic interest point operators is inferior to using category-specific parts learned using our proposed approach. [sent-52, score-0.421]
25 Finally, a relevant body of work [5, 18, 7] addresses learning a good set of parts or attributes–which are often parts in disguise. [sent-53, score-0.65]
26 The focus there is usually either on unsupervised learning, or on learning nameable parts; our work, in contrast, occupies the middle ground in which we rely on semantic meaning of parts perceived by humans without forcing a potentially contrived nameable nomenclature. [sent-54, score-0.727]
27 In the sections below we describe the procedure for learning a basic library of parts for a category using pairwise correspondences, and then proceed with a description of applying the part library to three tasks: object detection, landmark prediction and fine-grained image parsing. [sent-55, score-1.01]
28 From partial correspondence to parts In this section we describe the framework for learning a library of parts using the correspondence annotations. [sent-57, score-1.109]
29 1; how these can be used to define a “semantic graph” between images that enables part discovery in Section 2. [sent-59, score-0.249]
30 Obtaining correspondence annotations Following [15] we obtain correspondence annotations by presenting subjects with pairs of images, and asking them to click on pairs of matching points in the two instances of the category. [sent-64, score-0.91]
31 They were given concise in- structions, asking them to annotate “landmarks”, defined as “any interesting feature of a church building”. [sent-66, score-0.348]
32 Then, each person was presented with a sequence of image pairs, each containing a prominent church building. [sent-68, score-0.248]
33 They can click on any number of landmark pairs that they deem corresponding between the two images. [sent-69, score-0.328]
34 999993333322000 Using this interface, we have collected annotations for 1000 pairs among 288 images of church buildings downloaded from Flickr. [sent-70, score-0.603]
35 Landmark pairs, a few examples of which are shown in Figure 2 (left), include a variety of semantic matches: identical structural elements of buildings (windows, spires, corners, and gables), and vaguely defined yet consistent matches, the likes of “the mid-point of roof slope”. [sent-71, score-0.341]
36 Semantic graph of correspondence Figure 3 illustrates how landmark correspondences between instances can be used to estimate the corresponding bounding boxes of parts in the two images. [sent-75, score-0.983]
37 We estimate the similarity transform (translation and scaling) that maps the landmarks within the box from one image to another. [sent-76, score-0.284]
38 If there are less than two landmarks within the box we set the scale as the relative scale of the two objects (determined by the bounding box of the entire set of landmarks in each image). [sent-77, score-0.647]
39 The correspondence can be propagated beyond explicitly clicked landmark pairs using the semantic graph [15]. [sent-78, score-0.729]
40 In this way, we can “trace” a part along a path in semantic graph from an image in left column to an image in the right column, even though we do not have explicit annotation for that pair of images. [sent-80, score-0.399]
41 Figure 4 shows various parts found from the source image by propagating the correspondence in the semantic graph in a breadth first manner. [sent-81, score-0.749]
42 There are multiple ways to reach the same image by traversing different intermediate images and landmark pairs and we maintain a set of nonoverlapping windows for each image. [sent-82, score-0.367]
43 We sample parts around the clicked landmarks in each image. [sent-88, score-0.608]
44 The landmarks represent parts of the whole that are partially matched across instances. [sent-89, score-0.545]
45 , parts that are matched frequently across each image are likely to be sampled frequently. [sent-94, score-0.362]
46 Correspondence propagation in the semantic graph from the image on the left to the image on the right in each row. [sent-97, score-0.257]
47 Next, we propagate the correspondence from the seed window using breadth-first search in the semantic graph as shown in Figure 4. [sent-104, score-0.524]
48 Since the correspondence is sparse, the estimated location and scale of these initial hypothesized matches is likely noisy. [sent-112, score-0.396]
49 location and scale near the initial estimate obtained using the semantic graph that maximize the response of w(t−1) : (Li(t), si(t)) = argmax L,s∈N(Li(0) ,si(0)) ? [sent-118, score-0.267]
50 For each of three parts shown, the top row contains the initial hypothesized matches found using semantic graph (ordered by depth at which they were found). [sent-126, score-0.727]
51 During training we only use the semantic graph edges entirely contained in the training set (church-corr-train), resulting in 617 correspondence pairs, each labelled with an average of five landmarks. [sent-138, score-0.465]
52 The test set (church-corr-test) is used to evaluate the utility of parts for predicting the location of the human-clicked landmarks, a “semantic saliency” prediction task described in Section 5. [sent-139, score-0.502]
53 Since the church-corr dataset contains church buildings that occupy most of the image, we collected an additional set of 127 images where the church building occupies a small portion of the image to test the utility of parts for localizing them (Section 4). [sent-140, score-1.194]
54 For these images we also obtained bounding box annotations and the set if further divided into a training set of 64 images and a test set of 63 images. [sent-142, score-0.326]
55 We compare various methods of learning parts: (1) Exemplar LDA (random seeds): randomly sampled seeds w/o graph (2) Exemplar LDA (landmark seeds): seeds sampled on landmarks w/o graph (3) Latent LDA: seeds sampled on landmarks w/ graph (4) Discriminative patches [19]. [sent-145, score-1.372]
56 The second simply uses the landmarks to bias the seed sampling step, hopefully resulting in fewer “wasted” seeds. [sent-148, score-0.357]
57 The third (our proposed method) additionally uses the correspondence annotations to find “similar” patches in the training set using the procedure described in Section 2. [sent-149, score-0.417]
58 In comparison to [19], this step is computationally much more efficient since the search for “similar” patches is restricted to a small fraction windows in the entire set using the semantic graph. [sent-151, score-0.3]
59 We trained a set of 200 parts for various methods on the church-corr-train subset. [sent-152, score-0.398]
60 (Left) Learned HOG filter along with the top 10 locations of each part found using the semantic graph (top row for each part) and the latent search procedure (bottom row for each part) described in Section 2. [sent-155, score-0.429]
61 Detecting church buildings The parts learned in the previous step can be utilized for localizing objects. [sent-160, score-0.813]
62 Specifically, we use the top 10 detections on the church-loc-train set to estimate the mean offsets in scale and location ofthe object bounding box relative to the part bounding box. [sent-162, score-0.564]
63 Votes from multiple part detections are combined in a greedy manner. [sent-165, score-0.257]
64 For each image, part detections are sorted by their detection score (after normalizing to [0, 1] using the sigmoid function) and considered one by one to find clusters of parts that belong together (based on the overlap of their predicted bounding boxes being greater than τ=0. [sent-166, score-0.815]
65 Each cluster represents a detection, from which we predict the overall bounding box as the weighted average of the predictions of each member and score as the sum oftheir detection scores. [sent-169, score-0.244]
66 Bounding box predictions that overlap the ground truth bounding box (defined by the intersection over union) greater than τ are considered correct detections, while multiple detections of the same object are considered false positives. [sent-178, score-0.446]
67 We compare various methods for training parts individually and as a combination for localizing church buildings on the churchloc-test set. [sent-180, score-0.838]
68 This can be seen in Figure 6 (left) which plots the performance of various parts sorted by the detection AP. [sent-182, score-0.505]
69 Moreover, the performance is better than the parts obtained using [19]. [sent-186, score-0.325]
70 We used the same seeds for the “exemplar LDA” and “latent LDA” parts during training, hence we can compare the performance of each part individually for both these methods. [sent-187, score-0.593]
71 This can be seen in Figure 6 (middle) which plots the performance of the 200 parts individually. [sent-188, score-0.362]
72 We combine the predictions of the top 30 parts using the method described in Section 3 and evaluate it on church-loc-test. [sent-192, score-0.363]
73 Out of the various DPM detectors we found that the single “root only” detector performed the best, hinting that a simple tree model of the parts is inadequate for capturing the variety in part layouts. [sent-207, score-0.559]
74 We believe that a better modeling of the part layouts can help with the bounding box prediction task. [sent-212, score-0.327]
75 Figure 8 shows high scoring detections on the churchloc-test set along with the locations of parts shown in different colors. [sent-213, score-0.513]
76 In addition to using the parts as a building block for a detector, we are interested in exploring their role in other scene parsing tasks. [sent-215, score-0.423]
77 Landmark saliency prediction A landmark saliency map is a function s(x, y) → [0, 1] , ? [sent-218, score-0.78]
78 of a given set of ground truth landmark locations under the saliency map as a measure of its predictive quality. [sent-223, score-0.515]
79 y)∈Skms|S(xk,|y)⎠⎞ According to this definition, the uniform saliency map MAL = 1since s(x, y) = 1/m, ∀x, y. [sent-231, score-0.264]
80 Ou=r saliency (dxe,teyc)to =r uses t ∀hex top 30 parts sorted cording to their part detection accuracy on the training Given an image, the highest scoring detections above (2) has acset. [sent-232, score-0.991]
81 Each detection contributes saliency proportional to the detection score to the center of the detection window. [sent-234, score-0.453]
82 The contributions are accumulated across all detections to obtain the initial saliency map. [sent-235, score-0.417]
83 01d, where d is the length of the image diagonal, and normalized to sum to one, to obtain the final saliency map. [sent-237, score-0.264]
84 Our approach can be seen as “category-specific interest points”, and we compare this approach to a baseline that uses standard unsupervised scale-space interest point detectors based on Differences of Gaussians (DoG) and the Itti and Koch saliency model [12]. [sent-239, score-0.393]
85 According to our saliency maps, the landmarks are 6. [sent-241, score-0.484]
86 2nt× ×L mDAor”e parts outperform tbtio athn tdhe K “exemplar LyD. [sent-245, score-0.325]
87 Figure 9 shows example saliency maps for a few images for a variety of methods. [sent-247, score-0.312]
88 As one might expect, our part-based saliency tends to be sharply localized near doors, windows, and towers. [sent-248, score-0.297]
89 Fine-grained image parsing Beyond the standard classification and detection tasks, the rich library of correspondence-driven parts allows us to reason about fine-grained structure of visual categories. [sent-253, score-0.612]
90 For instance, we can attach semantic meaning to a set of parts at almost no cost by simply showing a human a few high-scoring detections. [sent-254, score-0.47]
91 If the parts appear to correspond to a coherent visual concept with a name, say, “window” or “tower”, the name for the concept is recorded. [sent-255, score-0.376]
92 (Middle) Comparison of parts using “latent LDA” and “exemplar LDA” using the same seeds. [sent-258, score-0.325]
93 These semantic labels can be visualized on new images by pooling the part detections across models that correspond to the same label. [sent-268, score-0.435]
94 Conclusions and discussion We have described a method for semi-supervised discovery of semantically meaningful parts from pairwise correspondence annotations: pairs of landmark in images that are deemed matching. [sent-272, score-0.949]
95 A library of parts can be discovered from such annotations by a discriminative algorithm that learns an appearance model for each part. [sent-273, score-0.635]
96 On a category of church buildings, these parts are useful in a variety of ways: as building blocks for a part-based object detector, as category-specific interest point operators, and as a tool for fine-grained visual parsing for applications such as retrieval by attributes. [sent-274, score-0.81]
97 To exploit the rich part library discovered with the proposed framework for detection and segmentation, one likely needs an appropriate layout model connecting many parts into a coherent category model, beyond the simplistic stargraph model used in our experiments. [sent-275, score-0.71]
98 999993333377555 arch left arch upper window window on tower tower Fwingadtruoc whineloudrapfe1wctoh0r. [sent-277, score-0.383]
99 Onthieadorwc ntlpodewfrin dawocerh awutinorlpedf wrtaobelsignwcrhaduoptwerindaotsbwyindrachowurcloemihwntaf drluceonwtphfdiruswoteap cnhrduopetwbi nmdoawrtchnloerfwtauipnedorlwcaiz labels obtained by pooling the corresponding part detections on images. [sent-279, score-0.257]
100 From left to right – images shown with the landmarks; saliency maps from our parts, Difference of Gaussian (DoG) interest point operator, and the Itti and Koch model. [sent-281, score-0.346]
