nips nips2012 nips2012-230 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli
Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. [sent-4, score-0.407]
2 Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. [sent-5, score-0.103]
3 The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. [sent-6, score-0.245]
4 In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. [sent-7, score-0.308]
5 Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. [sent-9, score-0.462]
6 1 Introduction A number of problems in Computer Vision, Natural Language Processing and Computational Biology involve predictions over complex but structured interdependent outputs, also known as structured-output prediction. [sent-10, score-0.202]
7 ˆ ˆ Notice that the algorithm always makes a single prediction yi and pays a penalty (yi , yi ) for that ˆ ˆ prediction. [sent-13, score-0.579]
8 The goal of interactive machine-learning algorithms is to produce an output for an expert or a user in the loop. [sent-16, score-0.26]
9 Popular examples include tools for interactive image segmentation (where the system produces a cutout of an object from a picture [5, 25]), systems for image processing/manipulation tasks such as image denoising and deblurring (e. [sent-17, score-0.379]
10 These problems are typically modeled using structured probabilistic models and involve computing the Maximum a Posteriori (MAP) solution. [sent-22, score-0.101]
11 In order to minimize user interactions, the interface could show not just a single prediction but a small set of diverse predictions, and simply let the user pick the best one. [sent-23, score-0.337]
12 In such a setting, at the initial stages it is 1 not necessary to make the perfect prediction, rather the goal is to make a set of plausible predictions, which may then be re-ranked or combined by a secondary mechanism. [sent-27, score-0.098]
13 For instance, in Computer Vision, this is the case for state-of-the-art methods for human-pose estimation which produce multiple predictions that are then refined by employing a temporal model [23, 3]. [sent-28, score-0.203]
14 In Natural Language Processing, this is the case for sentence parsing [8] and machine translation [26], where an initial system produces a list of M -Best hypotheses [12, 24] (also called k-best lists in the NLP literature), which are then re-ranked. [sent-29, score-0.238]
15 The common principle in both scenarios is that we need to generate a set of plausible hypotheses for an algorithm/expert downstream to evaluate. [sent-30, score-0.191]
16 Traditionally, this is accomplished by learning a single-output model and then producing M -Best hypotheses from it (also called the M -Best MAP problem [20, 11, 29] or the Diverse M -Best problem [3] in the context of graphical models). [sent-31, score-0.184]
17 The key motivating question for this paper is – can we learn to produce a set of plausible hypotheses? [sent-36, score-0.129]
18 We refer to such a setting as Multiple Choice Learning (MCL) because the learner must learn to produce multiple choices for an expert or other algorithm. [sent-37, score-0.156]
19 This paper presents an algorithm for MCL, formulated as multiple-output structuredoutput learning, where given an input sample xi the algorithm produces a set of M hypotheses {ˆi , . [sent-39, score-0.266]
20 We first present a meaningful loss function for this task that effectively captures the y1 ˆM setup of the problem. [sent-43, score-0.088]
21 Next, we present a max-margin formulation for training this M -tuple predictor that minimizes an upper-bound on the loss-function. [sent-44, score-0.118]
22 Despite the popularity of M -Best approaches, to the best our knowledge, this is the first attempt to directly model the M -Best prediction problem. [sent-45, score-0.114]
23 Experimental results on the problems of image segmentation and protein side-chain prediction show that our method outperforms conventional M -Best prediction approaches used for this scenario and leads to substantial improvements in prediction accuracy. [sent-47, score-0.552]
24 2 Preliminaries: (Single-Output) Structured-Output Prediction We begin by reviewing classical (single-output) structured-output prediction and establishing the notation used in the paper. [sent-49, score-0.09]
25 Given a training dataset of input-output pairs {(xi , yi ) | i ∈ [n], xi ∈ X , yi ∈ Y}, we are interested in learning a mapping f : X → Y from an input space X to a structured output space Y that is finite but typically exponentially large (e. [sent-55, score-0.626]
26 The quality of the prediction yi = f (xi ) is measured by a task-specific loss function : Y × Y → ˆ R+ , where (yi , yi ) denotes the cost of predicting yi when the correct label is yi . [sent-60, score-1.075]
27 Some examples ˆ ˆ of loss functions are the intersection/union criteria used by the PASCAL Visual Object Category Segmentation Challenge [10], and the BLEU score used to evaluate machine translations [22]. [sent-61, score-0.105]
28 [28] proposed to optimize a regularized surrogate loss function: min w 1 2 ||w||2 + C 2 2 i (w) i∈[n] (1) where C is a positive multiplier and i (w) = max y i (·) is the structured hinge-loss: (yi , y) + wT φ(xi , y) − wT φ(xi , yi ). [sent-64, score-0.374]
29 wT φ(xi , yi ) − wT φ(xi , y) ≥ (yi , y) − ξi ξi ≥ 0 ∀y ∈ Y \ yi (3b) (3c) This formulation is known as the margin-rescaled n-slack SSVM [28]. [sent-70, score-0.46]
30 Intuitively, we can see that it minimizes the squared-norm of w subject to constraints that enforce a soft-margin between the score of the ground-truth yi and the score of all other predictions. [sent-71, score-0.314]
31 A multiple-output SSVM is a mapping from the input space X to an M -tuple1 of structured outputs Yi = {ˆi , . [sent-80, score-0.134]
32 , yi | yi ∈ Y}, y1 ˆM ˆ M T given by g : X → Y , where g(x) = argmaxY ∈Y M W Φ(x, Y ). [sent-83, score-0.432]
33 We make a mean-field-like simplifying assumption that the set score factors into independent predictor scores, i. [sent-88, score-0.139]
34 , f M (x) , where f m (x) = T argmaxy∈Y wm φm (x, y). [sent-97, score-0.228]
35 , yi } be the set of predicted outputs for input xi , i. [sent-106, score-0.327]
36 In the singley1 ˆM ˆm output SSVM, there typically exists a ground-truth output yi for each datapoint, and the quality of yi w. [sent-109, score-0.571]
37 For our multiple-output predictor, we need to define a task-specific ˆ loss function that can measure the quality of any set of predictions Yi ∈ Y M . [sent-114, score-0.185]
38 Ideally, the quality of these predictions should be evaluated by the secondary mechanism that uses these predictions. [sent-115, score-0.172]
39 For ˆ instance, in an interactive setting where they are shown to a user, the quality of Yi could be measured by how much it reduces the user-interaction time. [sent-116, score-0.129]
40 In the M-best hypotheses re-ranking scenario, the ˆ accuracy of the top single output after re-ranking could be used as the quality measure for Yi . [sent-117, score-0.232]
41 While multiple options exist, in order to provide a general formulation and to isolate our approach, we propose the “oracle” or “hindsight” set-loss as a surrogate: ˆ L(Yi ) = min (yi , yi ) ˆ y i ∈ Yi ˆ ˆ (4) 1 Our formulation is described with a nominal ordering of the predictions. [sent-118, score-0.352]
42 , the set of predictions Yi only pays a loss for the most accurate prediction contained in this set (e. [sent-122, score-0.278]
43 , the best segmentation of an image, or the best translation of a sentence). [sent-124, score-0.201]
44 This loss has the desirable behaviour that predicting a set that contains even a single accurate output is better than predicting a set that has none. [sent-125, score-0.198]
45 Moreover, only being penalized for the most accurate prediction allows an ensemble to hedge its bets without having to pay for being too diverse (this is opposite to the effect that replacing min with max or avg. [sent-126, score-0.285]
46 However, this also makes the setloss rather poorly conditioned – if even a single prediction in the ensemble is the ground-truth, the set-loss is 0, no matter what else is predicted. [sent-128, score-0.16]
47 2 Coordinate Descent for Learning Multiple Predictors We now present our algorithm for learning a multiple-output SSVM by minimizing the regularized min-hinge loss: min W 1 2 ||W||2 + C 2 ˜ Hi (W) (7) i∈[n] We begin by rewriting the min-hinge loss in terms of indicator “flag” variables, i. [sent-146, score-0.084]
48 ρi,m i (wm ) (8a) i∈[n] m∈[M ] ∀i ∈ [n] (8b) ∀i ∈ [n], m ∈ [M ] (8c) ρi,m = 1 m∈[M ] ρi,m ∈ {0, 1} where ρi,m is a flag variable that indicates which predictor produces the smallest hinge-loss. [sent-150, score-0.124]
49 This decomposes into n independent problems, which simply identify the best predictor for each datapoint according to the current hinge-losses, i. [sent-155, score-0.157]
50 Overall, the block-coordinate descent algorithm above iteratively assigns each datapoint to a particular predictor (Step 1) and then independently trains each predictor with just the points that were assigned to it (Step 2). [sent-163, score-0.276]
51 Notice that at K = M , all predictors learn the same mapping. [sent-172, score-0.12]
52 On the other hand, in our setting there is a single ground-truth label for each instance and the learner makes multiple guesses, all of which are evaluated against that single ground-truth. [sent-179, score-0.104]
53 To the best of our knowledge, the only other work that explicitly addresses the task of predicting multiple structured outputs is multi-label structured prediction (MLSP) [19]. [sent-184, score-0.443]
54 This work may be seen as a technique to output predictions in the power-set of Y (2Y ) with a learning cost comparable to algorithms for prediction over Y. [sent-185, score-0.233]
55 , multiple guesses by the algorithm where each guess is a set of bounding boxes of objects in an image). [sent-192, score-0.086]
56 Note that in this case there is a set of ground-truth ˆ ˆ labels and the model’s prediction is a single label (evaluated against the closest ground-truth). [sent-195, score-0.116]
57 Latent variable models typically maximize or marginalize the model score across the latent variables, while MCL uses the flag variables as a representation of the oracle loss. [sent-198, score-0.108]
58 However, the key difference is that ensemble methods attempt to combine outputs from multiple weak predictors to ultimately make a single prediction. [sent-200, score-0.278]
59 We are interested in making multiple predictions which will all 5 (c) 3. [sent-201, score-0.153]
60 34% Figure 1: Each row shows the: (a) input image (b) ground-truth segmentation and (c-h) the set of predictions produced by MCL (M = 6). [sent-213, score-0.263]
61 We can see that the predictors produce different plausible foreground hypotheses, e. [sent-217, score-0.25]
62 We tested algorithm MCL on two problems: i) foreground-background segmentation in image collections and ii) protein side-chain prediction. [sent-224, score-0.247]
63 In both problems making a single perfect prediction is difficult due to inherent ambiguity in the tasks. [sent-225, score-0.116]
64 The goal of our experiments is to study how much predicting a set of plausible hypotheses helps. [sent-229, score-0.228]
65 Our experiments will show that MCL is able to produce sets of hypotheses which contain more accurate predictions than other algorithms and baselines aimed at producing multiple hypotheses. [sent-230, score-0.442]
66 The segmentation task is modeled as a binary pairwise MRF where each node corresponds to a superpixel [1] in the image. [sent-246, score-0.208]
67 We compare our algorithm against three alternatives for producing multiple predictions: i) Single SSVM + M -Best MAP [29], ii) Single SSVM + Diverse M -Best MAP [3] and iii) Clustering + Multiple SSVMs. [sent-251, score-0.1]
68 For the first two baselines, we used all training images to learn a single SSVM and then produced multiple segmentations via M -Best MAP and Diverse M -Best MAP [3]. [sent-252, score-0.214]
69 The third baseline, Clustering + Multiple SSVM (C-SSVM), involves first clustering the training images into M clusters and then training M SSVMs independently on each cluster. [sent-256, score-0.081]
70 For each algorithm we varied the number of predictors M ∈ {1, 2, . [sent-258, score-0.096]
71 We used the output of k-means clustering as the initial assignment of images to predictors, so MCL’s first coordinate descent iteration produces the same results as C-SSVM. [sent-263, score-0.266]
72 44% Figure 2: In each column: first row shows input images; second shows ground-truth; third shows segmentation produced by the single SSVM baseline; and the last two rows show the best MCL predictions (M = 6) at the end of the first and last coordinate descent iteration. [sent-285, score-0.381]
73 in this experiment ( ) is the percentage of incorrectly labeled pixels, and the evaluation metric is the set-loss, L = minyi ∈Yi (yi , yi ), i. [sent-286, score-0.216]
74 , the pixel error of the best segmentation among all predictions. [sent-288, score-0.226]
75 3a show the performance of various algorithms as a function of the number of predictors M . [sent-291, score-0.096]
76 We observed that M -Best MAP produces nearly identical predictions and thus the error drops negligibly as M is increased. [sent-292, score-0.175]
77 On the other hand, the diverse M -Best predictions output by D IV MB EST [3] lead to a substantial drop in the set-loss. [sent-293, score-0.266]
78 MCL outperforms both D IV MB EST and C-SSVM, confirming our hypothesis that it is beneficial to learn a collection of predictors, rather than learning a single predictor and making diverse predictions from it. [sent-294, score-0.364]
79 3b shows the MCL objective and train/test errors as a function of the coordinate descent steps. [sent-297, score-0.138]
80 We verify that the objective function is improved at every iteration and notice a nice correlation between the objective and the train/test errors. [sent-298, score-0.095]
81 We observe a monotonic reduction in error as K decreases, which suggests there is a natural clustering of the data and thus learning a single SSVM is detrimental. [sent-306, score-0.095]
82 1 shows example images, ground-truth segmentations, and the predictions made by M = 6 predictors. [sent-309, score-0.101]
83 We observe that the M hypotheses are both diverse and plausible. [sent-310, score-0.259]
84 The evolution of the best prediction with coordinate descent iterations can be seen in Fig. [sent-311, score-0.223]
85 Given a protein backbone structure, the task here is to predict the amino acid side-chain configurations. [sent-315, score-0.117]
86 5 2 25 MCL MCL (train) 20 15 9 10 1 Task−Loss vs K; M = 6 MCL MCL (train) C−SSVM M−Best DivM−Best Pixel Error % Objective Pixel Error % 20 Task−Loss vs C; M = 6 M = 6; C = 0. [sent-323, score-0.084]
87 Task−Loss vs M Task−Loss vs C; M = 4 M = 4; C = 1 30 0. [sent-339, score-0.084]
88 For this application there is no natural analogue to the C-SSVM baseline and thus we used a boosting-like baseline where we first train an SSVM on the entire training data; use the training instances with high error to train a second SSVM, and so on. [sent-362, score-0.19]
89 4a confirms that multiple predictors are beneficial, and that MCL is able to outperform the boosting-like baseline. [sent-370, score-0.148]
90 4b shows the progress of the MCL objective and test loss with coordinate descent iterations; we again observe a positive correlation between the objective and the loss. [sent-372, score-0.223]
91 6 Discussion and Conclusions We presented an algorithm for producing a set of structured outputs and argued that in a number of problems it is beneficial to generate a set of plausible and diverse hypotheses. [sent-375, score-0.36]
92 Typically, this is accomplished by learning a single-output model and then producing M -best hypotheses from it. [sent-376, score-0.184]
93 This causes a disparity between the way the model is trained (to produce a single output) and the way it is used (to produce multiple outputs). [sent-377, score-0.217]
94 Our proposed algorithm (MCL) provides a principled way to directly optimize the multiple prediction min-set-loss. [sent-378, score-0.142]
95 While we evaluated performance of all algorithms in terms of oracle set-loss, it would be interesting to measure the impact of MCL and other baselines on user experience or final stage performance in cascaded algorithms. [sent-380, score-0.179]
96 Further, our model assumes a modular scoring function S(Y ) = WT Φ(x, Y ) = T m m m∈[M ] wm φ (x, y ), i. [sent-381, score-0.228]
97 In a number of situations, the score S(Y ) might be a submodular function. [sent-384, score-0.087]
98 Such scoring functions often arise when we want the model to explicitly reward diverse subsets. [sent-385, score-0.123]
99 8 Acknowledgments: We thank David Sontag for his assistance with the protein data. [sent-387, score-0.085]
100 Contextual sequence prediction with application to control library optimization. [sent-450, score-0.09]
wordName wordTfidf (topN-words)
[('mcl', 0.648), ('ssvm', 0.355), ('wm', 0.228), ('yi', 0.216), ('hypotheses', 0.136), ('diverse', 0.123), ('segmentation', 0.121), ('interactive', 0.101), ('predictions', 0.101), ('predictors', 0.096), ('prediction', 0.09), ('predictor', 0.09), ('icoseg', 0.089), ('ssvms', 0.089), ('protein', 0.085), ('batra', 0.081), ('mlsp', 0.079), ('structured', 0.074), ('hcrf', 0.067), ('wt', 0.063), ('outputs', 0.06), ('segmentations', 0.06), ('coordinate', 0.056), ('loss', 0.056), ('baselines', 0.055), ('plausible', 0.055), ('superpixel', 0.055), ('cascaded', 0.055), ('descent', 0.053), ('images', 0.052), ('multiple', 0.052), ('map', 0.052), ('argmaxy', 0.051), ('xi', 0.051), ('produce', 0.05), ('ag', 0.049), ('score', 0.049), ('foreground', 0.049), ('est', 0.049), ('producing', 0.048), ('mb', 0.045), ('divm', 0.045), ('structuredoutput', 0.045), ('ensemble', 0.044), ('secondary', 0.043), ('datapoint', 0.043), ('energy', 0.043), ('vs', 0.042), ('output', 0.042), ('tsochantaridis', 0.042), ('hi', 0.042), ('image', 0.041), ('pixel', 0.041), ('error', 0.04), ('pami', 0.039), ('reranking', 0.039), ('cplex', 0.039), ('bleu', 0.039), ('hue', 0.039), ('disparity', 0.039), ('train', 0.039), ('submodular', 0.038), ('user', 0.037), ('predicting', 0.037), ('crf', 0.037), ('notice', 0.037), ('potts', 0.036), ('sentence', 0.036), ('baseline', 0.036), ('scenario', 0.035), ('guesses', 0.034), ('produces', 0.034), ('dey', 0.033), ('boykov', 0.033), ('task', 0.032), ('oracle', 0.032), ('translation', 0.032), ('pays', 0.031), ('superpixels', 0.031), ('yanover', 0.031), ('iv', 0.031), ('expert', 0.03), ('kohli', 0.03), ('crfs', 0.03), ('objective', 0.029), ('clustering', 0.029), ('kolmogorov', 0.029), ('quality', 0.028), ('min', 0.028), ('formulation', 0.028), ('reminiscent', 0.027), ('typically', 0.027), ('interdependent', 0.027), ('boost', 0.026), ('single', 0.026), ('ibm', 0.025), ('learn', 0.024), ('mrf', 0.024), ('violated', 0.024), ('best', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli
Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1
2 0.095380515 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
Author: Dan Ciresan, Alessandro Giusti, Luca M. Gambardella, Jürgen Schmidhuber
Abstract: We address a central problem of neuroanatomy, namely, the automatic segmentation of neuronal structures depicted in stacks of electron microscopy (EM) images. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or nonmembrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succession of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific postprocessing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer. 1
3 0.094750471 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic
Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1
4 0.084819496 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
Author: Xianghang Liu, James Petterson, Tibério S. Caetano
Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1
5 0.080648996 8 nips-2012-A Generative Model for Parts-based Object Segmentation
Author: S. Eslami, Christopher Williams
Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9
6 0.078216709 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
7 0.073390096 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
8 0.067686066 200 nips-2012-Local Supervised Learning through Space Partitioning
9 0.066350758 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
10 0.064681344 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
11 0.063184224 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
12 0.062170941 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
13 0.061252505 204 nips-2012-MAP Inference in Chains using Column Generation
14 0.061020024 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
15 0.061019093 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
16 0.059316512 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
17 0.058451157 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
18 0.056165203 38 nips-2012-Algorithms for Learning Markov Field Policies
19 0.055006623 361 nips-2012-Volume Regularization for Binary Classification
20 0.055006068 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
topicId topicWeight
[(0, 0.176), (1, 0.009), (2, -0.046), (3, -0.034), (4, 0.074), (5, -0.053), (6, 0.004), (7, 0.008), (8, -0.035), (9, 0.024), (10, -0.031), (11, 0.01), (12, 0.04), (13, 0.063), (14, 0.097), (15, -0.011), (16, -0.052), (17, -0.013), (18, -0.002), (19, -0.018), (20, 0.008), (21, -0.016), (22, -0.023), (23, 0.057), (24, 0.029), (25, -0.021), (26, -0.045), (27, -0.017), (28, 0.034), (29, 0.03), (30, 0.075), (31, -0.014), (32, -0.011), (33, 0.016), (34, 0.01), (35, -0.007), (36, -0.046), (37, -0.018), (38, 0.034), (39, -0.02), (40, 0.033), (41, -0.082), (42, -0.056), (43, -0.018), (44, 0.087), (45, 0.057), (46, 0.019), (47, -0.028), (48, -0.057), (49, -0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.92137223 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli
Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1
2 0.64003319 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
Author: Dan Ciresan, Alessandro Giusti, Luca M. Gambardella, Jürgen Schmidhuber
Abstract: We address a central problem of neuroanatomy, namely, the automatic segmentation of neuronal structures depicted in stacks of electron microscopy (EM) images. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or nonmembrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succession of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific postprocessing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer. 1
3 0.56807411 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
4 0.56765801 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
Author: Ashwini Shukla, Aude Billard
Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.
5 0.56732881 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction
Author: Christoph H. Lampert
Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. 1
6 0.56713641 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
7 0.55180728 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
8 0.5448907 361 nips-2012-Volume Regularization for Binary Classification
9 0.53785521 279 nips-2012-Projection Retrieval for Classification
10 0.53571546 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
11 0.53482705 303 nips-2012-Searching for objects driven by context
12 0.5347088 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
13 0.52987361 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
14 0.52370745 202 nips-2012-Locally Uniform Comparison Image Descriptor
15 0.52273971 8 nips-2012-A Generative Model for Parts-based Object Segmentation
16 0.52237767 72 nips-2012-Cocktail Party Processing via Structured Prediction
17 0.51939648 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
18 0.51896286 39 nips-2012-Analog readout for optical reservoir computers
19 0.51862437 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover
20 0.51859051 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
topicId topicWeight
[(0, 0.033), (9, 0.246), (21, 0.027), (38, 0.17), (39, 0.011), (42, 0.031), (53, 0.011), (54, 0.032), (55, 0.025), (74, 0.077), (76, 0.083), (80, 0.129), (92, 0.039)]
simIndex simValue paperId paperTitle
1 0.8108362 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process
Author: Lloyd Elliott, Yee W. Teh
Abstract: We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by the splitting and merging of their clusters. Our model can be thought of as a discrete analogue of the continuous fragmentation-coagulation process [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining accuracies comparable to other state-of-the-art genotype imputation methods. 1
same-paper 2 0.80404502 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli
Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1
3 0.80074358 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees
Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade
Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1
4 0.79159874 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
Author: Minjie Xu, Jun Zhu, Bo Zhang
Abstract: We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efficient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages. 1
5 0.72563338 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
Author: Argyris Kalogeratos, Aristidis Likas
Abstract: Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that can be used as a wrapper around any iterative clustering algorithm of k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as an individual ‘viewer’ and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of distances between the viewer and the cluster members. Important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations. Experimental results on artificial and real datasets indicate the effectiveness of our method and its superiority over analogous approaches.
6 0.68297827 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
7 0.67900163 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
8 0.67874449 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
9 0.67859137 227 nips-2012-Multiclass Learning with Simplex Coding
10 0.6783309 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
11 0.67627198 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
12 0.67511922 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
13 0.67323244 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
14 0.67136449 65 nips-2012-Cardinality Restricted Boltzmann Machines
15 0.67082036 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
16 0.67036623 358 nips-2012-Value Pursuit Iteration
17 0.67026597 293 nips-2012-Relax and Randomize : From Value to Algorithms
18 0.67019123 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
19 0.66924053 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
20 0.66890991 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models