nips nips2012 nips2012-357 knowledge-graph by maker-knowledge-mining

357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

Source: pdf

Author: Shulin Yang, Liefeng Bo, Jue Wang, Linda G. Shapiro

Abstract: Fine-grained recognition refers to a subordinate level of recognition, such as recognizing different species of animals and plants. It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape and structure shared cross different categories, and the differences are in the details of object parts. We suggest that the key to identifying the ﬁne-grained differences lies in ﬁnding the right alignment of image regions that contain the same object parts. We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the cooccurrence relation of the shape patterns. Once the image regions are aligned, extracted features are used for classiﬁcation. Learning of the template model is efﬁcient, and the recognition results we achieve signiﬁcantly outperform the stateof-the-art algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape and structure shared cross different categories, and the differences are in the details of object parts. [sent-9, score-0.386]

2 We suggest that the key to identifying the ﬁne-grained differences lies in ﬁnding the right alignment of image regions that contain the same object parts. [sent-10, score-0.38]

3 We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the cooccurrence relation of the shape patterns. [sent-11, score-1.039]

4 Learning of the template model is efﬁcient, and the recognition results we achieve signiﬁcantly outperform the stateof-the-art algorithms. [sent-13, score-0.79]

5 Cognitive research study has suggested that basic-level recognition is based on comparing the shape of the objects and their parts, whereas subordinate-level recognition is based on comparing appearance details of certain object parts [1]. [sent-19, score-0.595]

6 For basic-level recognition tasks, spatial pyramid matching [2] is a popular choice that aligns object parts by partitioning the whole image into multiple-level spatial cells. [sent-21, score-0.828]

7 However, spatial pyramid matching may not be the best choice for ﬁne-grained object recognition, since falsely aligned object parts can lead to inaccurate comparisons, as shown in Figure 1. [sent-22, score-0.693]

8 Our key observation is that in a ﬁne-grained task, different object categories share commonality in their shape or structure, and the alignment of object parts can be greatly improved by discovering such common 1 Figure 1: Region alignment by spatial pyramid matching and our approach. [sent-24, score-0.814]

9 Spatial pyramid matching partitions the whole image into regions, without considering visual appearance. [sent-25, score-0.296]

10 Our approach aims to align the image regions containing the same object parts (red squares). [sent-29, score-0.486]

11 This motivates us to decompose a ﬁne-grained object recognition problem into two sub-problems: 1) aligning image regions that contain the same object part and 2) extracting image features within the aligned image regions. [sent-33, score-1.031]

12 To this end, we propose a template model to align object parts. [sent-34, score-0.85]

13 In our model, a template represents a shape pattern, and the relationship between two shape patterns is captured by the relationship between templates, which reﬂects the probability of their co-occurrence in the same image. [sent-35, score-0.839]

14 This model is learned using an alternative algorithm, which iterates between detecting aligned image regions, and updating the template model. [sent-36, score-0.918]

15 Kernel descriptor features [3, 4] are then extracted from image regions aligned by the learned templates. [sent-37, score-0.369]

16 Our experimental results suggest that the proposed template model is capable of detecting image regions that correspond to meaningful object parts, and our template-based algorithm outperforms the state-of-the-art ﬁne-grained object recognition algorithms in terms of accuracy. [sent-39, score-1.321]

17 In [9], a random forest is proposed for ﬁne-grained object recognition that uses different depths of the tree to capture dense spatial information. [sent-42, score-0.394]

18 We discuss the framework of our template based object recognition, describe our template model, and propose an alternative algorithm for learning model parameters. [sent-49, score-1.475]

19 1 Template-Based Fine-Grained Object Recognition Over the last decades, computer vision researchers have done a lot of work in designing effective and efﬁcient patch-level features for object recognition [16, 17, 18, 19, 20, 21, 22]. [sent-51, score-0.332]

20 In the training stage, a template model is learned from training images using Algorithm 1. [sent-53, score-0.772]

21 In the recognition stage, the learned templates are applied to each test image, resulting in aligned image regions. [sent-54, score-0.899]

22 of the most successful features, allowing an image or object to be represented as a bag of SIFT features [16]. [sent-57, score-0.341]

23 This is even more important for a ﬁne-grained recognition task since common features can be shared by instances from both the same and different object classes. [sent-62, score-0.332]

24 Here, we use a template model to ﬁnd correctly-aligned regions from different images, so that comparisons between them are more meaningful. [sent-65, score-0.744]

25 A template represents one type of common shape pattern of an object part, while an object part can be represented by several different templates. [sent-66, score-1.051]

26 Certain shape patterns of two object parts (for instance, a head facing the left and a tail pointing to the right) can frequently be observed in the same image. [sent-67, score-0.375]

27 Our template model is designed to capture both properties of templates and their relationships among templates. [sent-68, score-1.207]

28 Once the templates and their relationship are learned, the ﬁne-grained differences can be aligned based on these quantities. [sent-73, score-0.593]

29 The framework of our template based ﬁne-grained object recognition is illustrated in Figure 2. [sent-74, score-0.947]

30 In the recognition stage (from left to right in Figure 2), aligned image regions are extracted from each image using our template detection algorithm. [sent-76, score-1.302]

31 2 Template Model We start by deﬁning a template model that represents the common shape patterns of object parts and their relationships. [sent-81, score-1.016]

32 A template is an entity that contains features that will match image features for region detection. [sent-82, score-0.887]

33 Let M = {T, W} be a model that contains a group of templates T = {T1 , T2 , . [sent-83, score-0.528]

34 When wij = 0, the two templates Ti and Tj have no co-occurrence relationship. [sent-90, score-0.623]

35 When a template model is matched to a given image, not all templates within the model are necessarily used. [sent-91, score-1.187]

36 This is because different templates can be associated with the same object part, but 3 one part only occurs at most once in an image. [sent-92, score-0.685]

37 Our model captures this intuition by making the templates inactive that do not match images very well. [sent-93, score-0.606]

38 Fitness: We deﬁne a matching score sf (Ti , xI ) to measure the similarity between a template Ti and i an image region at location xI in image I i sf (Ti , xI ) = 1 − Ti − R(xI ) i i 2 s. [sent-95, score-1.217]

39 |xI − xI | ≤ α i i (1) xI ; i R(xI ) i xI i represents the features of the sub-image in I centered at the location where is an initial location associated with the template Ti and α is an upper bound of location variation. [sent-97, score-0.823]

40 We ﬁrst run the Berkeley edge detector [25] to compute the edge map of an image, and then treat it as a grayscale image and extract color kernel descriptors [3] over it. [sent-102, score-0.318]

41 Summing up the matching score sf (Ti , xI ) for all templates that are used for image I, we obtain a i ﬁtness term K I vi sf (Ti , xI ) i S f (T, X I , V I ) = (2) i=1 I I I where V I = {v1 , . [sent-104, score-0.958]

42 , vK } represents the selected template subset for image I, vi = 1 means that the I I I template Ti is used for image I, and X = {x1 , . [sent-107, score-1.67]

43 , xK } represents the locations of all templates on image I. [sent-110, score-0.708]

44 The more templates that are used, the higher the score is. [sent-111, score-0.562]

45 Co-occurrence: With the observation that certain shape patterns of two or more object parts coexist frequently in the same image, it is desired that templates that have a high chance of co-occurring are selected together. [sent-112, score-0.905]

46 For a given image, the co-occurrence term is used to encourage selecting two templates together, which have a large relation parameter wij . [sent-113, score-0.666]

47 Meanwhile, a L1 penalty term is used to ensure sparsity of the template relation. [sent-114, score-0.697]

48 In particular, their locations should not be too close to each other, because we want the learned templates to be diverse, so that they can cover a large range of image shape patterns. [sent-118, score-0.821]

49 , V |D| } are template indicators, X = {X 1 , X 2 , . [sent-125, score-0.659]

50 , X |D| } are template locations, and |D| is the number of images in the set D. [sent-128, score-0.737]

51 The templates and their relations are learned by maximizing the score function S(T, W, X , V, D) on an image collection D. [sent-129, score-0.766]

52 , TK } with training data; initialize wij = 0; iter = 0 for iter < maxiter do update X I , V I for all I ∈ D based on equation (6) I I update T by: Ti = I∈D vi R(xI )/ I∈D vi (as in (8)) i update W to optimize (9) if i |∆ Ti | < then break end if iter ← iter + 1 end for 3. [sent-134, score-0.398]

53 Template detection: Given a template model {T, W}, the goal of template detection is to ﬁnd the template subset V and their locations X for all images to maximize equation (5). [sent-137, score-2.148]

54 Fixing the locations of all previously selected templates, the next template and its location can be chosen in a similar manner. [sent-140, score-0.759]

55 Template feature learning: The goal of template feature learning is to optimize the templates T given the relation parameters W and current template detection results V, X . [sent-142, score-1.942]

56 The algorithm starts by initiating K templates with various sizes and initial locations that are are evenly spaced in an image. [sent-153, score-0.568]

57 In each iteration, template detection, template feature learning, and template relation learning are alternated. [sent-154, score-2.02]

58 The iteration continues until the total change of template {Ti }K is smaller than a threshold . [sent-155, score-0.659]

59 Our experiments suggest that the proposed template model is able to detect the meaningful parts and outperforms the previous work in terms of accuracy. [sent-158, score-0.777]

60 1 Features and Settings We use kernel descriptors (KDES) to capture low-level image statistics: color, shape and texture [3]. [sent-160, score-0.358]

61 Color and normalized color kernel descriptors are extracted over RGB images, and gradient and shape kernel descriptors are extracted over gray scale images transformed from the original RGB images. [sent-162, score-0.427]

62 For template relation learning, we use a publicly available L1 regularization solver 2 . [sent-164, score-0.702]

63 To learn the template model, we use 34 templates with different sizes. [sent-166, score-1.187]

64 The template size is measured by its ratio to the original image size, such as 1/2 or 1/3. [sent-167, score-0.799]

65 Our model has 9 templates with size 1/2 and 25 with size 1/3. [sent-168, score-0.528]

66 The initial locations of templates with each template size are evenly spaced grid points in an image. [sent-169, score-1.227]

67 html 6 Table 1: The table in the left show the classiﬁcation accuracies (%) obtained by templates with different sizes and numbers on a subset of a full dataset. [sent-182, score-0.528]

68 The accuracy is improved with an increasing template number at the beginning, and become saturated when enough templates are used. [sent-183, score-1.207]

69 With the best template number choices, the combination of templates with different sizes are tested. [sent-184, score-1.187]

70 The combination of 9 templates with size 1/2 and 25 templates with size 1/3 performs best (selected using the training set). [sent-186, score-1.056]

71 Notice that there is a slight difference between template detection in the learning phase and in the recognition phase. [sent-231, score-0.843]

72 In the learning phase, only a subset of templates are detected for each image. [sent-232, score-0.577]

73 This is because not all templates can be observed in all images, and each image usually contains only a subset of all possible templates. [sent-233, score-0.668]

74 But in the recognition phase, all templates are selected for detection in order to avoid missing features. [sent-234, score-0.732]

75 2 Bird Recognition Caltech-UCSD Bird-200 [8] is a commonly used dataset for evaluating ﬁne-grained object recognition algorithms. [sent-236, score-0.307]

76 The feature in each template consists of a vector of real numbers. [sent-241, score-0.659]

77 As can be seen, the learned templates successfully ﬁnd the meaningful parts of birds, though the appearances of these parts are very different. [sent-242, score-0.779]

78 For examples, the head parts detected by T1 have quite different colors and textures, suggesting the robustness of the proposed template model. [sent-243, score-0.824]

79 If λ = 0, there is no penalty on the relation parameters W, thus all weights wij are set to 1 when the template model is learned. [sent-245, score-0.817]

80 In both these cases, the template models are equivalent to a simpliﬁed model without the co-occurrence term in (3). [sent-247, score-0.659]

81 Template size and number choices: We tested the effect of the number and size of the templates on the recognition accuracy. [sent-250, score-0.659]

82 When the template size is 1, the accuracy is the same with an arbitrary template number, because template detection will return the same results. [sent-252, score-2.05]

83 For templates whose size is smaller than 1, the results obtained with different numbers of templates are shown in Table 1 left. [sent-253, score-1.056]

84 Based on these results, we selected a template number for each template size for further experiments: one template with size 1, 9 templates with size 1/2, 25 templates with size 1/3, and 25 templates with size 1/4. [sent-254, score-3.581]

85 The results obtained by the combinations of templates with different sizes (each with its optimal template number) on the full dataset are shown in Table 1 right. [sent-255, score-1.206]

86 Our template model is compared to the recently proposed ﬁne-grained recognition algorithms. [sent-257, score-0.79]

87 We give the results of the proposed template model with two types of templates: edge templates and texture templates. [sent-267, score-1.24]

88 9 combination of 9 templates with size 1/2 and 25 templates with size 1/3. [sent-272, score-1.056]

89 Our further experiments suggest that adding more templates only slightly improves the recognition accuracy. [sent-273, score-0.659]

90 In the test stage, it takes 3 ∼ 5 seconds to process each image, including template detection, feature extraction and classiﬁcation. [sent-276, score-0.659]

91 We observe that KDES with spatial pyramid works well on this dataset, and the proposed template model works even better. [sent-280, score-0.835]

92 This accuracy is comparable with the recently proposed pose pooling approach [12] where labeled parts are used to train and test models; this is not required for our template model. [sent-283, score-0.828]

93 For the dog datasets, we also tried using the local binary pattern KDES to learn templates instead of the edge KDES due to the relative consistent textures in dog images. [sent-291, score-0.672]

94 Our experiments show that the template learning with the edge KDES works better than that with the local binary pattern KDES, suggesting that the edge information is a stable cue to learn templates. [sent-292, score-0.719]

95 Notice that the accuracy achieved by our template model is 16 percent higher than the best published results so far. [sent-293, score-0.679]

96 5 Conclusion We have proposed a template model for ﬁne-grained object recognition. [sent-294, score-0.816]

97 The template model learns a group of templates by jointly considering ﬁtness, co-occurrence and diversity between the templates and images, and the learned templates are used to align image regions that contain the same object parts. [sent-295, score-2.696]

98 Our experiments show that the proposed template model has achieved higher accuracy than the state-of-the-art ﬁne-grained object recognition algorithms on the two standard benchmarks: CaltechUCSD Bird-200 and Standford Dogs. [sent-296, score-0.967]

99 In the future, we plan to learn the features that are suitable for detecting object parts and incorporate the geometric information into the template relationships. [sent-297, score-0.958]

100 : Linear spatial pyramid matching using sparse coding for image classiﬁcation. [sent-383, score-0.375]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('template', 0.659), ('templates', 0.528), ('kdes', 0.164), ('object', 0.157), ('image', 0.14), ('recognition', 0.131), ('ti', 0.12), ('parts', 0.098), ('wij', 0.095), ('pyramid', 0.09), ('spatial', 0.086), ('sf', 0.082), ('images', 0.078), ('shape', 0.078), ('descriptors', 0.065), ('aligned', 0.065), ('bird', 0.058), ('regions', 0.057), ('cvpr', 0.053), ('detection', 0.053), ('xi', 0.052), ('vi', 0.052), ('tness', 0.05), ('detected', 0.049), ('birds', 0.047), ('dog', 0.046), ('dogs', 0.045), ('features', 0.044), ('iter', 0.043), ('species', 0.043), ('relation', 0.043), ('wah', 0.041), ('location', 0.04), ('locations', 0.04), ('seattle', 0.04), ('matching', 0.04), ('learned', 0.035), ('align', 0.034), ('categories', 0.034), ('score', 0.034), ('branson', 0.034), ('wa', 0.033), ('kernel', 0.032), ('perona', 0.031), ('negrained', 0.031), ('percents', 0.031), ('shapiro', 0.031), ('diversity', 0.03), ('edge', 0.03), ('stage', 0.029), ('ponce', 0.029), ('maximizing', 0.029), ('pooling', 0.029), ('comparisons', 0.028), ('extracted', 0.028), ('vj', 0.028), ('belongie', 0.028), ('yao', 0.028), ('emk', 0.027), ('farrell', 0.027), ('subordinate', 0.027), ('maxiter', 0.027), ('schroff', 0.027), ('recognizing', 0.026), ('alignment', 0.026), ('visual', 0.026), ('stanford', 0.026), ('patterns', 0.024), ('bo', 0.024), ('poselets', 0.024), ('texture', 0.023), ('malik', 0.022), ('khosla', 0.022), ('welinder', 0.022), ('categorization', 0.022), ('attributes', 0.022), ('humans', 0.022), ('pose', 0.022), ('commonality', 0.022), ('llc', 0.022), ('boureau', 0.022), ('textures', 0.022), ('sift', 0.021), ('color', 0.021), ('acc', 0.021), ('penalty', 0.02), ('capture', 0.02), ('cross', 0.02), ('meaningful', 0.02), ('accuracy', 0.02), ('selected', 0.02), ('coding', 0.019), ('rgb', 0.019), ('dataset', 0.019), ('updating', 0.019), ('labs', 0.018), ('darrell', 0.018), ('yang', 0.018), ('sparsity', 0.018), ('head', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

Author: Shulin Yang, Liefeng Bo, Jue Wang, Linda G. Shapiro

2 0.14321464 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D conﬁguration of objects in real-world images with heavy occlusion and clutter. We focus on the application of ﬁnding and analyzing cars. We do so with a two-stage model; the ﬁrst stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then reﬁned by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

3 0.12552498 344 nips-2012-Timely Object Recognition

Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell

Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method signiﬁcantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classiﬁers as black boxes and learns from execution traces using reinforcement learning. 1

4 0.11743027 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efﬁciency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach signiﬁcantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

5 0.10828745 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit signiﬁcant shape and appearance variability, and ﬁnd that it obtains results that are comparable to the state-of-the-art. There has been signiﬁcant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The signiﬁcance of parts in computer vision has been recognized since the earliest days of the ﬁeld (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added beneﬁt of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and ﬁnd that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efﬁcient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a ﬁxed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the ﬁxed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to deﬁne probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are speciﬁed by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efﬁciently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efﬁciently deﬁne higher-order potentials that take into account the conﬁguration of larger groups of image pixels. The introduction of such hidden variables provides a way to efﬁciently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and deﬁnes a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the ﬁrst and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the ﬁrst layer, local receptive ﬁelds are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of ﬁrst-layer hidden units and patches, and purposely restrict the number of units in h2 . These modiﬁcations signiﬁcantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W ﬁxed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution deﬁned on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } speciﬁes the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we ﬁnd the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efﬁcient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the ﬁnal values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we ﬁnd it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefﬁcients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to ﬁnetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A signiﬁcant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit signiﬁcant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and deﬁnes distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The ﬁrst dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also ﬂipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are deﬁned on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We ﬁnd that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the ﬁnal segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspeciﬁc algorithm of [21], and that inference in the model signiﬁcantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reﬂects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The ﬁnal column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The ﬁnal column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiproﬁle view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We ﬁnd that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such ﬁne structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efﬁcient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

6 0.10664184 201 nips-2012-Localizing 3D cuboids in single-view images

7 0.10621398 168 nips-2012-Kernel Latent SVM for Visual Recognition

8 0.10493435 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

9 0.093641721 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

10 0.090606719 210 nips-2012-Memorability of Image Regions

11 0.083984397 303 nips-2012-Searching for objects driven by context

12 0.0829935 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

13 0.080163933 185 nips-2012-Learning about Canonical Views from Internet Image Collections

14 0.078476898 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

15 0.077240244 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

16 0.075573236 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

17 0.075379297 193 nips-2012-Learning to Align from Scratch

18 0.074168682 62 nips-2012-Burn-in, bias, and the rationality of anchoring

19 0.074168682 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

20 0.068533115 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.137), (1, 0.028), (2, -0.191), (3, -0.019), (4, 0.126), (5, -0.076), (6, 0.007), (7, -0.075), (8, 0.022), (9, -0.022), (10, -0.038), (11, 0.043), (12, 0.087), (13, -0.113), (14, 0.023), (15, 0.136), (16, 0.013), (17, -0.085), (18, -0.054), (19, -0.058), (20, 0.064), (21, -0.078), (22, 0.001), (23, -0.004), (24, -0.044), (25, 0.038), (26, 0.008), (27, 0.032), (28, -0.031), (29, 0.039), (30, 0.006), (31, -0.041), (32, 0.018), (33, -0.041), (34, 0.025), (35, 0.051), (36, -0.004), (37, 0.006), (38, 0.041), (39, -0.027), (40, -0.048), (41, -0.008), (42, -0.006), (43, 0.027), (44, 0.063), (45, -0.05), (46, -0.004), (47, -0.026), (48, -0.012), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96324897 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

Author: Shulin Yang, Liefeng Bo, Jue Wang, Linda G. Shapiro

2 0.84113747 201 nips-2012-Localizing 3D cuboids in single-view images

Author: Jianxiong Xiao, Bryan Russell, Antonio Torralba

Abstract: In this paper we seek to detect rectangular cuboids and localize their corners in uncalibrated single-view images depicting everyday scenes. In contrast to recent approaches that rely on detecting vanishing points of the scene and grouping line segments to form cuboids, we build a discriminative parts-based detector that models the appearance of the cuboid corners and internal edges while enforcing consistency to a 3D cuboid model. Our model copes with different 3D viewpoints and aspect ratios and is able to detect cuboids across many different object categories. We introduce a database of images with cuboid annotations that spans a variety of indoor and outdoor scenes and show qualitative and quantitative results on our collected database. Our model out-performs baseline detectors that use 2D constraints alone on the task of localizing cuboid corners. 1

3 0.78959745 210 nips-2012-Memorability of Image Regions

Author: Aditya Khosla, Jianxiong Xiao, Antonio Torralba, Aude Oliva

Abstract: While long term human visual memory can store a remarkable amount of visual information, it tends to degrade over time. Recent works have shown that image memorability is an intrinsic property of an image that can be reliably estimated using state-of-the-art image features and machine learning algorithms. However, the class of features and image information that is forgotten has not been explored yet. In this work, we propose a probabilistic framework that models how and which local regions from an image may be forgotten using a data-driven approach that combines local and global images features. The model automatically discovers memorability maps of individual images without any human annotation. We incorporate multiple image region attributes in our algorithm, leading to improved memorability prediction of images as compared to previous works. 1

4 0.78022498 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

5 0.77471727 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

6 0.74780971 185 nips-2012-Learning about Canonical Views from Internet Image Collections

7 0.7158736 8 nips-2012-A Generative Model for Parts-based Object Segmentation

8 0.71119392 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

9 0.67945397 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

10 0.65655339 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

11 0.63875955 344 nips-2012-Timely Object Recognition

12 0.63029867 303 nips-2012-Searching for objects driven by context

13 0.61546868 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

14 0.6071552 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

15 0.60236639 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

16 0.59344876 146 nips-2012-Graphical Gaussian Vector for Image Categorization

17 0.58649462 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

18 0.58500385 202 nips-2012-Locally Uniform Comparison Image Descriptor

19 0.55685496 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

20 0.55193698 168 nips-2012-Kernel Latent SVM for Visual Recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (21, 0.043), (38, 0.074), (39, 0.022), (42, 0.02), (44, 0.022), (54, 0.029), (55, 0.031), (74, 0.171), (76, 0.136), (80, 0.047), (92, 0.052), (94, 0.197)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86507523 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation

Author: Christian Mayr, Paul Stärke, Johannes Partzsch, Love Cederstroem, Rene Schüffny, Yao Shuai, Nan Du, Heidemarie Schmidt

Abstract: Memristive devices have recently been proposed as efﬁcient implementations of plastic synapses in neuromorphic systems. The plasticity in these memristive devices, i.e. their resistance change, is deﬁned by the applied waveforms. This behavior resembles biological synapses, whose plasticity is also triggered by mechanisms that are determined by local waveforms. However, learning in memristive devices has so far been approached mostly on a pragmatic technological level. The focus seems to be on ﬁnding any waveform that achieves spike-timing-dependent plasticity (STDP), without regard to the biological veracity of said waveforms or to further important forms of plasticity. Bridging this gap, we make use of a plasticity model driven by neuron waveforms that explains a large number of experimental observations and adapt it to the characteristics of the recently introduced BiFeO3 memristive material. Based on this approach, we show STDP for the ﬁrst time for this material, with learning window replication superior to previous memristor-based STDP implementations. We also demonstrate in measurements that it is possible to overlay short and long term plasticity at a memristive device in the form of the well-known triplet plasticity. To the best of our knowledge, this is the ﬁrst implementations of triplet plasticity on any physical memristive device. 1

same-paper 2 0.86501396 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

Author: Shulin Yang, Liefeng Bo, Jue Wang, Linda G. Shapiro

3 0.75820184 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

Author: Angela Eigenstetter, Bjorn Ommer

Abstract: Category-level object detection has a crucial need for informative object representations. This demand has led to feature descriptors of ever increasing dimensionality like co-occurrence statistics and self-similarity. In this paper we propose a new object representation based on curvature self-similarity that goes beyond the currently popular approximation of objects using straight lines. However, like all descriptors using second order statistics, ours also exhibits a high dimensionality. Although improving discriminability, the high dimensionality becomes a critical issue due to lack of generalization ability and curse of dimensionality. Given only a limited amount of training data, even sophisticated learning algorithms such as the popular kernel methods are not able to suppress noisy or superﬂuous dimensions of such high-dimensional data. Consequently, there is a natural need for feature selection when using present-day informative features and, particularly, curvature self-similarity. We therefore suggest an embedded feature selection method for SVMs that reduces complexity and improves generalization capability of object models. By successfully integrating the proposed curvature self-similarity representation together with the embedded feature selection in a widely used state-of-the-art object detection framework we show the general pertinence of the approach. 1

4 0.75701988 202 nips-2012-Locally Uniform Comparison Image Descriptor

Author: Andrew Ziegler, Eric Christiansen, David Kriegman, Serge J. Belongie

Abstract: Keypoint matching between pairs of images using popular descriptors like SIFT or a faster variant called SURF is at the heart of many computer vision algorithms including recognition, mosaicing, and structure from motion. However, SIFT and SURF do not perform well for real-time or mobile applications. As an alternative very fast binary descriptors like BRIEF and related methods use pairwise comparisons of pixel intensities in an image patch. We present an analysis of BRIEF and related approaches revealing that they are hashing schemes on the ordinal correlation metric Kendall’s tau. Here, we introduce Locally Uniform Comparison Image Descriptor (LUCID), a simple description method based on linear time permutation distances between the ordering of RGB values of two image patches. LUCID is computable in linear time with respect to the number of pixels and does not require ﬂoating point computation. 1

5 0.74785411 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease

Author: Jonathan Huang, Daniel Alexander

Abstract: Accurate and detailed models of neurodegenerative disease progression are crucially important for reliable early diagnosis and the determination of effective treatments. We introduce the ALPACA (Alzheimer’s disease Probabilistic Cascades) model, a generative model linking latent Alzheimer’s progression dynamics to observable biomarker data. In contrast with previous works which model disease progression as a ﬁxed event ordering, we explicitly model the variability over such orderings among patients which is more realistic, particularly for highly detailed progression models. We describe efﬁcient learning algorithms for ALPACA and discuss promising experimental results on a real cohort of Alzheimer’s patients from the Alzheimer’s Disease Neuroimaging Initiative. 1

6 0.7477445 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.7465353 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

8 0.74384713 40 nips-2012-Analyzing 3D Objects in Cluttered Images

9 0.73775762 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

10 0.72677881 201 nips-2012-Localizing 3D cuboids in single-view images

11 0.72391135 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

12 0.71969944 185 nips-2012-Learning about Canonical Views from Internet Image Collections

13 0.71856999 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

14 0.71819961 210 nips-2012-Memorability of Image Regions

15 0.70998132 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

16 0.70739192 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

17 0.70579672 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

18 0.69583285 8 nips-2012-A Generative Model for Parts-based Object Segmentation

19 0.6929394 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

20 0.69218725 303 nips-2012-Searching for objects driven by context