nips nips2012 nips2012-137 knowledge-graph by maker-knowledge-mining

137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects


Source: pdf

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. [sent-7, score-0.845]

2 Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods. [sent-11, score-0.362]

3 1 Introduction Mesh segmentation methods decompose a three-dimensional (3D) mesh, or a collection of aligned meshes, into their constituent parts. [sent-12, score-0.242]

4 This well-studied problem has numerous applications in computational graphics and vision, including texture mapping, skeleton extraction, morphing, and mesh registration and simplification. [sent-13, score-0.648]

5 We focus in particular on the problem of segmenting an articulated object, given aligned 3D meshes capturing various object poses. [sent-14, score-0.636]

6 The meshes we consider are complete surfaces described by a set of triangular faces, and we seek a segmentation into spatially coherent parts whose spatial transformations capture object articulations. [sent-15, score-1.117]

7 Applied to various poses of human bodies as in Figure 1, our approach identifies regions of the mesh that deform together, and thus provides information which could inform applications such as the design of protective clothing. [sent-16, score-0.838]

8 Mesh segmentation has been most widely studied as a static clustering problem, where a single mesh is segmented into “semantic” parts using low-level geometric cues such as distance and curvature [1, 2]. [sent-17, score-1.045]

9 While supervised training data can sometimes lead to improved results [3], there are many applications where such data is unavailable, and the proper way to partition a single mesh is inherently ambiguous. [sent-18, score-0.573]

10 By searching for parts which deform consistently across many meshes, we create a better-posed problem whose solution is directly useful for modeling objects in motion. [sent-19, score-0.259]

11 First, the number of parts comprising an articulated object is unknown a priori, and must be inferred from the observed deformations. [sent-21, score-0.428]

12 Second, mesh faces exhibit strong spatial correlations, and the inferred parts must be contiguous. [sent-22, score-0.986]

13 This spatial connectivity is needed to discover parts which correspond with physical object structure, and required by target applications such as skeleton extraction. [sent-23, score-0.319]

14 A segmentation of the human body should take into account this range of variability in the popula1 Figure 1: Human body segmentation. [sent-26, score-0.309]

15 Left: Reference poses for two female bodies, and those bodies captured in five other poses. [sent-27, score-0.207]

16 Right: A manual segmentation used to align these meshes [6], and the segmentation inferred by our ddCRP model from 56 poses. [sent-28, score-0.757]

17 The ddCRP segmentation discovers parts whose motion is nearly rigid, and includes small parts such as elbows and knees absent from the manual segmentation. [sent-29, score-0.691]

18 To our knowledge, no previous methods for segmenting meshes combine information about deformation from multiple bodies to address this corpus segmentation problem. [sent-31, score-0.753]

19 We adapt the distance dependent Chinese restaurant process (ddCRP) [4] to model spatial dependencies among mesh triangles, and enforce spatial contiguity of the inferred parts [5]. [sent-33, score-1.025]

20 Unlike most previous mesh segmentation methods, our Bayesian nonparametric approach allows data-driven inference of an appropriate number of parts, and uses a affine transformation-based likelihood to accommodate object instances of varying shape. [sent-34, score-0.789]

21 After developing our model in Section 2, Section 3 develops a Gibbs sampler which efficiently marginalizes the latent affine transformations defining part deformation. [sent-35, score-0.22]

22 We conclude in Section 4 with results examining meshes of humans and other articulated objects, where we introduce a metric for quantitative evaluation of deformation-based segmentations. [sent-36, score-0.509]

23 For some input mesh j, we let yjn ∈ R3 denote the 3D location of the center of triangular face n, and Yj = [yj1 , . [sent-38, score-0.693]

24 Each mesh j has an associated N -triangle reference mesh, indexed by bj . [sent-42, score-0.675]

25 We let xbn ∈ R4 denote the location of triangle n in reference mesh b, expressed in homogeneous coordinates (xbn (4) = 1). [sent-43, score-0.798]

26 In our later experiments, Yj encodes the 3D mesh for a person in pose j, and Xbj is the reference pose for the same individual. [sent-48, score-0.871]

27 We estimate aligned correspondences between the triangular faces of the input pose meshes Yj , and the reference meshes Xb , using a recently developed method [6]. [sent-49, score-1.17]

28 This approach robustly handles 3D data capturing varying shapes and poses, and outputs meshes which have equal numbers of faces in one-to-one alignment. [sent-50, score-0.499]

29 Our segmentation model does not depend on the details of this alignment method, and could be applied to data produced by other correspondence algorithms. [sent-51, score-0.235]

30 By placing prior probability mass on partitions with arbitrary numbers of parts, it allows data-driven inference of the true number of mostly-rigid parts underlying the observed data. [sent-54, score-0.276]

31 In addition, by choosing an appropriate distance function we can encourage spatially adjacent triangles to lie in the same part, and guarantee that all inferred parts are spatially contiguous [5]. [sent-55, score-0.554]

32 The Chinese restaurant process (CRP) is a distribution on all possible partitions of a set of objects (in our case, mesh triangles). [sent-56, score-0.673]

33 2 Figure 2: Left: A reference mesh in which links (yellow arrows) currently define three parts (connected components). [sent-60, score-0.939]

34 Although described sequentially, the CRP induces an exchangeable distribution on partitions, for which the segmentation probability is invariant to the order in which triangle allocations are sampled. [sent-63, score-0.236]

35 This is inappropriate for mesh data, in which nearby triangles are far more likely to lie in the same part. [sent-64, score-0.645]

36 We define the distance between two triangles as the minimal number of hops, between adjacent faces, required to reach one triangle from the other. [sent-70, score-0.195]

37 However, it does guarantee that only spatially contiguous parts have non-zero probability under the prior. [sent-73, score-0.314]

38 2 Modeling Part Deformation via Affine Transformations Articulated object deformation is naturally described via the spatial transformations of its constituent parts. [sent-76, score-0.309]

39 We expect the triangular faces within a part to deform according to a coherent part-specific transformation, up to independent face-specific noise. [sent-77, score-0.285]

40 We concisely denote the transformation from a reference triangle to an observed triangle via a matrix A ∈ R3×4 . [sent-79, score-0.283]

41 The fourth column of A encodes translation of the corresponding reference triangle via homogeneous coordinates xbn , and the other entries encode rotation, scaling, and shearing. [sent-80, score-0.225]

42 Our construction allows transformations to be analytically marginalized when learning our part-based segmentation, but retains the flexibility to later estimate transformations if desired. [sent-83, score-0.236]

43 Applied to mesh data, these parameters have physical interpretations and can be estimated from the data collection process. [sent-88, score-0.573]

44 Allocating a different affine transformation for the motion of each part in each pose (Figure 2), the overall generative model can be summarized as follows: 1. [sent-90, score-0.219]

45 For each triangle n, sample an associated link cn ∼ ddCRP (α, f, D). [sent-91, score-0.216]

46 For each pose j of each part k, sample an affine transformation Ajk and residual noise covariance Σjk from the matrix normal-inverse-Wishart prior of Equation (2). [sent-97, score-0.216]

47 Given these pose-specific affine transformations and assignments of mesh faces to parts, independently sample the observed location of each pose triangle relative to its corresponding reference triangle, yjn ∼ N (Ajzn xbjn , Σjzn ). [sent-99, score-1.166]

48 Note that Σjk governs the degree of non-rigid deformation of part k in pose j. [sent-100, score-0.205]

49 There is a single reference mesh Xb for each object instance b, and Yj captures a single deformed pose of Xbj . [sent-112, score-0.817]

50 3 Previous Work Previous work has also sought to segment a mesh into parts based on observed articulations [8, 12, 13, 14]. [sent-114, score-0.813]

51 Several other segmentation procedures [12, 14] lack coherent probabilistic models, and thus have difficulty quantifying uncertainty and determining appropriate segmentation resolutions. [sent-117, score-0.365]

52 [8] define a global probabilistic model, and use the EM algorithm to jointly estimate parts and their transformations. [sent-119, score-0.214]

53 They explicitly model spatial dependencies among mesh faces, but their Markov random field cannot ensure that parts are spatially connected; a separate connected components process is required. [sent-120, score-0.923]

54 Ambitious recent work has considered a model for joint mesh alignment and segmentation [9]. [sent-122, score-0.778]

55 However, this approach suffers from many of the issues noted above: the number of parts must be specified a priori, parts may not be contiguous, and their EM inference appears prone to local optima. [sent-123, score-0.428]

56 3 Inference We seek the constituent parts of an articulated model, given observed data (X, Y, and b). [sent-124, score-0.383]

57 These parts are characterized by the posterior distribution of the customer links c. [sent-125, score-0.308]

58 (4) Here, z(c) is the clustering into parts defined by the customer links c. [sent-127, score-0.337]

59 (8) Instead of explicitly sampling from Equation (4), a more efficient sampler [4] can be derived by observing that different realizations of the link cn only make a small change to the partition structure. [sent-130, score-0.208]

60 Sampling new (new) realizations of cn will give rise to new partitions z(c−n ∪ cn ), which may either be identical to z(c−n ) or contain one less part, due to a merge of two existing parts. [sent-132, score-0.289]

61 Note that if the mesh segmentation c is the only quantity of interest, the analytically marginalized affine transformations Ajk need not be directly estimated. [sent-135, score-0.863]

62 Because “ground truth” parts are unavailable for the real body pose datasets of primary interest, we propose an alternative evaluation metric based on the prediction of held-out object poses, and show that the mesh-ddcrp performs favorably against competing approaches. [sent-139, score-0.404]

63 For quantitative tests, we employ 12 meshes of each of six different female subjects [15] (Figure 4). [sent-141, score-0.435]

64 For each subject, a mesh in a canonical pose is chosen as the reference mesh (Figure 1). [sent-142, score-1.346]

65 1 Hyperparameter Specification and MCMC Learning The hyperparameters that regularize our mesh-ddcrp prior have intuitive interpretations, and can be specified based on properties of the mesh data under consideration. [sent-145, score-0.603]

66 The mean affine transformation M is set to 5 the identity transformation, because on average we expect mesh faces to undergo small deformations. [sent-150, score-0.773]

67 The expected part variance S0 captures the degree of non-rigidity which we expect parts to demonstrate, as well as noise from the mesh alignment process. [sent-152, score-0.855]

68 The correspondence error in our human meshes is approximately 0. [sent-153, score-0.417]

69 Our settings make this nearly identity for most components, but the translation components of A have variance which is an order of magnitude larger, so that the expected scale of the translation parameters matches that of the mesh coordinates. [sent-159, score-0.573]

70 The first is a modified agglomerative clustering technique [16] which enforces spatial contiguity of the faces within each part. [sent-164, score-0.256]

71 Adjacent parts on the mesh are then merged based on the squared error in describing their motion by affine transformations. [sent-166, score-0.82]

72 Only adjacent parts are considered in these merge steps, so that parts remain spatially connected. [sent-167, score-0.533]

73 Our second baseline is based on a publicly available implementation of spectral clustering methods [17], a popular approach which has been previously used for mesh segmentation [18]. [sent-168, score-0.799]

74 The affinity √ σ + m between two mesh faces u, v is defined as Cuv = exp{− uv S 2 uv }, where muv = J12 j δuvj , 1 ¯ 2 δuvj is the Euclidean distance between u and v in pose j, σuv = j (δuvj − δuv ) is the J √ 1 corresponding standard deviation, and S = M u,v σuv + muv for all M pairs of faces u, v. [sent-170, score-1.099]

75 For the agglomerative and spectral clustering approaches, the number of parts must be externally specified; we experimented with K = 5, 10, 15, 20, 25, 30 parts. [sent-171, score-0.328]

76 We also consider a Bayesian nonparametric baseline which replaces the ddCRP prior over mesh partitions with a standard CRP prior. [sent-172, score-0.635]

77 The resulting mesh-crp model may estimate the number of parts, but doesn’t model mesh structure or enforce part contiguity. [sent-173, score-0.608]

78 The expected number of parts under the CRP prior is roughly α log N ; we set α = 2 so that the expected number of mesh-crp parts is similar to the number of parts discovered by the mesh-ddcrp. [sent-174, score-0.672]

79 These meshes contain about 31,000 and 38,000 triangular faces, respectively. [sent-179, score-0.437]

80 Figure 3 displays the segmentations of the Tosca meshes inferred by mesh-ddcrp. [sent-180, score-0.578]

81 The inferred parts largely correspond to groups of mesh faces which undergo similar transformations. [sent-181, score-0.971]

82 Figure 4 displays the results produced by the ddCRP, as well as our baseline methods, on the human mesh data. [sent-182, score-0.681]

83 Note that in addition to capturing the head and limbs, the segmentation successfully segregates distinctly moving small regions such as knees, elbows, shoulders, biceps, and triceps. [sent-184, score-0.198]

84 In all, the mesh-ddcrp detects 20 distinctly moving parts for one half of the body. [sent-185, score-0.24]

85 We now introduce a quantitative measure of segmentation quality: segmentations are evaluated by their ability to explain the articulations of test meshes with novel shapes and poses. [sent-186, score-0.702]

86 Given a collection of T test meshes Yt with corresponding reference meshes Xbt , and a candidate segmentation into K parts, we compute E= 1 T T K t=1 k=1 ||Ytk − A∗ Xbt k ||2 . [sent-187, score-1.026]

87 tk 6 (12) Figure 3: Segmentations produced by mesh-ddcrp on synthetic Tosca meshes [20]. [sent-188, score-0.406]

88 The first mesh in each row displays the chosen reference mesh. [sent-189, score-0.712]

89 Note that Equation (12) is trivially zero for a degenerate solution wherein each mesh face is assigned to its own part. [sent-192, score-0.573]

90 Mesh-ddcrp is significantly better than all other methods, including for settings of K which allocate 50% more parts to competing approaches, according to a Wilcoxon’s signed rank test (5% significance level). [sent-197, score-0.214]

91 We selected an illustrative articulated pose for each of the two training subjects in addition to their respective reference poses (Figure 4). [sent-199, score-0.459]

92 The meshes were then segmented both independently for the two subjects and jointly sharing information across subjects. [sent-201, score-0.461]

93 However, sharing information among subjects results in parts which correspond well with physical human bodies. [sent-203, score-0.309]

94 Note that with only two articulated poses, we are able to generate meaningful segmentations in about an hour of computation. [sent-204, score-0.261]

95 This data-limited scenario also demonstrates the benefits of the ddCRP prior: as shown in Figure 5, the parts extracted by mesh-crp are “patchy”, spatially disconnected, and physically implausible. [sent-205, score-0.286]

96 5 Discussion Adapting the ddCRP to collections of 3D meshes, we have developed an effective approach for the discovery an unknown number of parts underlying articulated object motion. [sent-206, score-0.391]

97 Unlike previous methods, our model guarantees that parts are spatially connected, and uses transformations to model instances with potentially varying body shapes. [sent-207, score-0.452]

98 Experiments with dozens of real human body poses provide strong quantitative evidence that our approach produces state-of-the-art segmentations with many potential applications. [sent-210, score-0.338]

99 Each row displays the reference pose, an illustrative articulated pose, mesh-crp and meshddcrp segmentations produced by independently segmenting the pair of poses of each individual, and mesh-crp and mesh-ddcrp segmentations produced by jointly segmenting the chosen poses from both subjects. [sent-227, score-0.876]

100 Groupvalued regularization framework for motion segmentation of dynamic non-rigid shapes. [sent-306, score-0.205]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mesh', 0.573), ('meshes', 0.376), ('ddcrp', 0.25), ('parts', 0.214), ('segmentation', 0.172), ('xbj', 0.162), ('sxx', 0.147), ('articulated', 0.133), ('segmentations', 0.128), ('faces', 0.123), ('af', 0.12), ('transformations', 0.118), ('yjk', 0.118), ('cn', 0.117), ('ajk', 0.112), ('syx', 0.103), ('reference', 0.102), ('pose', 0.098), ('poses', 0.095), ('jk', 0.092), ('bodies', 0.084), ('spatially', 0.072), ('deformation', 0.072), ('crp', 0.072), ('triangles', 0.072), ('restaurant', 0.068), ('triangle', 0.064), ('triangular', 0.061), ('xbn', 0.059), ('yjn', 0.059), ('graphics', 0.053), ('transformation', 0.053), ('sy', 0.051), ('links', 0.05), ('uv', 0.049), ('segmenting', 0.049), ('body', 0.048), ('bronstein', 0.048), ('chinese', 0.047), ('deform', 0.045), ('object', 0.044), ('nk', 0.044), ('iw', 0.044), ('tosca', 0.044), ('uvj', 0.044), ('customer', 0.044), ('xb', 0.043), ('human', 0.041), ('spatial', 0.039), ('inferred', 0.037), ('displays', 0.037), ('xbt', 0.036), ('constituent', 0.036), ('agglomerative', 0.036), ('link', 0.035), ('part', 0.035), ('aligned', 0.034), ('marginalizes', 0.034), ('adjacent', 0.033), ('alignment', 0.033), ('motion', 0.033), ('sampler', 0.033), ('partitions', 0.032), ('subjects', 0.031), ('segmented', 0.031), ('produced', 0.03), ('prior', 0.03), ('ajzn', 0.029), ('contiguity', 0.029), ('dmn', 0.029), ('elbows', 0.029), ('hirshberg', 0.029), ('jzn', 0.029), ('knees', 0.029), ('muv', 0.029), ('rachlin', 0.029), ('rosman', 0.029), ('xbjn', 0.029), ('clustering', 0.029), ('female', 0.028), ('contiguous', 0.028), ('yj', 0.027), ('distance', 0.026), ('anguelov', 0.026), ('distinctly', 0.026), ('articulations', 0.026), ('meters', 0.026), ('dozens', 0.026), ('ytk', 0.026), ('customers', 0.026), ('connected', 0.025), ('spectral', 0.025), ('undergo', 0.024), ('externally', 0.024), ('realizations', 0.023), ('sharing', 0.023), ('skeleton', 0.022), ('ron', 0.021), ('coherent', 0.021), ('quantitatively', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

2 0.10215084 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

3 0.095275797 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

4 0.09297739 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

Author: Won H. Kim, Deepti Pachauri, Charles Hatt, Moo. K. Chung, Sterling Johnson, Vikas Singh

Abstract: Hypothesis testing on signals defined on surfaces (such as the cortical surface) is a fundamental component of a variety of studies in Neuroscience. The goal here is to identify regions that exhibit changes as a function of the clinical condition under study. As the clinical questions of interest move towards identifying very early signs of diseases, the corresponding statistical differences at the group level invariably become weaker and increasingly hard to identify. Indeed, after a multiple comparisons correction is adopted (to account for correlated statistical tests over all surface points), very few regions may survive. In contrast to hypothesis tests on point-wise measurements, in this paper, we make the case for performing statistical analysis on multi-scale shape descriptors that characterize the local topological context of the signal around each surface vertex. Our descriptors are based on recent results from harmonic analysis, that show how wavelet theory extends to non-Euclidean settings (i.e., irregular weighted graphs). We provide strong evidence that these descriptors successfully pick up group-wise differences, where traditional methods either fail or yield unsatisfactory results. Other than this primary application, we show how the framework allows performing cortical surface smoothing in the native space without mappint to a unit sphere. 1

5 0.087048523 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Author: Dan Ciresan, Alessandro Giusti, Luca M. Gambardella, Jürgen Schmidhuber

Abstract: We address a central problem of neuroanatomy, namely, the automatic segmentation of neuronal structures depicted in stacks of electron microscopy (EM) images. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or nonmembrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succession of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific postprocessing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer. 1

6 0.05422258 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

7 0.052395862 193 nips-2012-Learning to Align from Scratch

8 0.052369885 62 nips-2012-Burn-in, bias, and the rationality of anchoring

9 0.052369885 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

10 0.049379807 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

11 0.048510395 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

12 0.047557004 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

13 0.044483937 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

14 0.041085262 344 nips-2012-Timely Object Recognition

15 0.039952643 40 nips-2012-Analyzing 3D Objects in Cluttered Images

16 0.039683435 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

17 0.039626386 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.038121153 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

19 0.037570711 26 nips-2012-A nonparametric variable clustering model

20 0.037103571 201 nips-2012-Localizing 3D cuboids in single-view images


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.109), (1, 0.033), (2, -0.079), (3, -0.002), (4, -0.013), (5, -0.039), (6, 0.007), (7, -0.047), (8, 0.02), (9, -0.009), (10, -0.014), (11, -0.025), (12, -0.012), (13, -0.087), (14, 0.028), (15, 0.017), (16, -0.012), (17, 0.017), (18, -0.011), (19, -0.003), (20, -0.002), (21, -0.017), (22, -0.024), (23, 0.042), (24, 0.015), (25, 0.003), (26, -0.064), (27, 0.052), (28, -0.043), (29, -0.002), (30, 0.011), (31, -0.014), (32, -0.033), (33, 0.024), (34, 0.016), (35, -0.04), (36, 0.006), (37, -0.018), (38, -0.059), (39, 0.01), (40, -0.059), (41, -0.036), (42, -0.004), (43, -0.099), (44, 0.079), (45, 0.049), (46, 0.025), (47, -0.032), (48, 0.038), (49, -0.097)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90910369 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

2 0.57356942 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

3 0.54233706 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

4 0.54179138 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

Author: Qirong Ho, Junming Yin, Eric P. Xing

Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1

5 0.53179163 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

Author: Won H. Kim, Deepti Pachauri, Charles Hatt, Moo. K. Chung, Sterling Johnson, Vikas Singh

Abstract: Hypothesis testing on signals defined on surfaces (such as the cortical surface) is a fundamental component of a variety of studies in Neuroscience. The goal here is to identify regions that exhibit changes as a function of the clinical condition under study. As the clinical questions of interest move towards identifying very early signs of diseases, the corresponding statistical differences at the group level invariably become weaker and increasingly hard to identify. Indeed, after a multiple comparisons correction is adopted (to account for correlated statistical tests over all surface points), very few regions may survive. In contrast to hypothesis tests on point-wise measurements, in this paper, we make the case for performing statistical analysis on multi-scale shape descriptors that characterize the local topological context of the signal around each surface vertex. Our descriptors are based on recent results from harmonic analysis, that show how wavelet theory extends to non-Euclidean settings (i.e., irregular weighted graphs). We provide strong evidence that these descriptors successfully pick up group-wise differences, where traditional methods either fail or yield unsatisfactory results. Other than this primary application, we show how the framework allows performing cortical surface smoothing in the native space without mappint to a unit sphere. 1

6 0.51455498 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

7 0.51345974 201 nips-2012-Localizing 3D cuboids in single-view images

8 0.47434688 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

9 0.46891946 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

10 0.46343186 2 nips-2012-3D Social Saliency from Head-mounted Cameras

11 0.44195887 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

12 0.43152547 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

13 0.4249635 303 nips-2012-Searching for objects driven by context

14 0.42189652 146 nips-2012-Graphical Gaussian Vector for Image Categorization

15 0.42004091 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

16 0.4123058 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

17 0.41162694 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

18 0.40871546 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

19 0.40717244 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

20 0.40345135 298 nips-2012-Scalable Inference of Overlapping Communities


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.049), (21, 0.07), (38, 0.052), (39, 0.023), (42, 0.038), (44, 0.012), (54, 0.025), (55, 0.039), (59, 0.016), (74, 0.056), (76, 0.118), (80, 0.071), (86, 0.291), (92, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74579442 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

Author: Soumya Ghosh, Matthew Loper, Erik B. Sudderth, Michael J. Black

Abstract: We develop a method for discovering the parts of an articulated object from aligned meshes of the object in various three-dimensional poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying 3D shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better deformation predictions than conventional clustering methods.

2 0.64706743 303 nips-2012-Searching for objects driven by context

Author: Bogdan Alexe, Nicolas Heess, Yee W. Teh, Vittorio Ferrari

Abstract: The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image, exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while achieving higher object detection performance. 1

3 0.6463576 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

4 0.57048106 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1

5 0.52552897 275 nips-2012-Privacy Aware Learning

Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1

6 0.50643629 193 nips-2012-Learning to Align from Scratch

7 0.50103068 210 nips-2012-Memorability of Image Regions

8 0.50045407 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

9 0.50014925 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

10 0.49455467 60 nips-2012-Bayesian nonparametric models for ranked data

11 0.49371922 188 nips-2012-Learning from Distributions via Support Measure Machines

12 0.49286562 197 nips-2012-Learning with Recursive Perceptual Representations

13 0.49178383 168 nips-2012-Kernel Latent SVM for Visual Recognition

14 0.4915753 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

15 0.49053591 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

16 0.48905176 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

17 0.48802048 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.48731989 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

19 0.48693028 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

20 0.48596621 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies