nips nips2012 nips2012-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. [sent-5, score-1.856]
2 Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. [sent-6, score-0.783]
3 Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. [sent-7, score-0.612]
4 1 Introduction A Deep Boltzmann Machine (DBM) is a type of binary pairwise Markov Random Field with multiple layers of hidden random variables. [sent-8, score-0.272]
5 Multiple layers of hidden units make learning in DBM’s far more difficult [13]. [sent-10, score-0.364]
6 Learning meaningful DBM models, particularly when modelling high-dimensional data, relies on the heuristic greedy pretraining procedure introduced by [7], which is based on learning a stack of modified Restricted Boltzmann Machines (RBMs). [sent-11, score-0.807]
7 Unfortunately, unlike the pretraining algorithm for Deep Belief Networks (DBNs), the existing procedure lacks a proof that adding additional layers improves the variational bound on the log-probability that the model assigns to the training data. [sent-12, score-0.946]
8 In this paper, we first show that under certain conditions, the pretraining algorithm improves a variational lower bound of a two-layer DBM. [sent-13, score-0.704]
9 This result gives a much deeper understanding of the relationship between the pretraining algorithms for Deep Boltzmann Machines and Deep Belief Networks. [sent-14, score-0.577]
10 Using this understanding, we introduce a new pretraining procedure for DBMs and show that it allows us to learn better generative models of handwritten digits and 3D objects. [sent-15, score-0.661]
11 It contains a set of visible units v ∈ {0, 1}D , and a series of layers of hidden units h(1) ∈ {0, 1}F1 , h(2) ∈ {0, 1}F2 ,. [sent-17, score-0.491]
12 The top two layers of a DBN form an undirected graph and the remaining layers form a belief net with directed, top-down connections. [sent-25, score-0.383]
13 Right Pretraining a DBM with three hidden layers consists of learning a stack of RBMs that are then composed to create a DBM. [sent-27, score-0.419]
14 The first and last RBMs in the stack need to be modified by using asymmetric weights. [sent-28, score-0.139]
15 Given the variational parameters µ, the model parameters θ are then updated to maximize the variational bound using stochastic approximation (for details see [7, 11, 14, 15]). [sent-35, score-0.179]
16 Hidden units in higher layers are very under-constrained so there is no consistent learning signal for their weights. [sent-37, score-0.253]
17 To alleviate this problem, [7] introduced a layer-wise pretraining algorithm based on learning a stack of “modified” Restricted Boltzmann Machines (RBMs). [sent-38, score-0.677]
18 When learning parameters of the first layer “RBM”, the bottom-up weights are constrained to be twice the top-down weights (see Fig. [sent-40, score-0.264]
19 Intuitively, using twice the weights when inferring the states of the hidden units h(1) compensates for the initial lack of top-down feedback. [sent-42, score-0.283]
20 Conversely, when pretraining the last “RBM” in the stack, the top-down weights are constrained to be twice the bottom-up weights. [sent-43, score-0.659]
21 This heuristic pretraining algorithm works surprisingly well in practice. [sent-46, score-0.564]
22 2 useful insights into what is happening during the pretraining stage. [sent-48, score-0.564]
23 Furthermore, unlike the pretraining algorithm for Deep Belief Networks (DBNs), it lacks a proof that each time a layer is added to the DBM, the variational bound improves. [sent-49, score-0.798]
24 1 Pretraining Algorithm for Deep Belief Networks We first briefly review the pretraining algorithm for Deep Belief Networks [2], which will form the basis for developing a new pretraining algorithm for Deep Boltzmann Machines. [sent-51, score-1.128]
25 Consider pretraining a two-layer DBN using a stack of RBMs. [sent-52, score-0.677]
26 The second RBM in the stack attempts to replace the prior p(h(1) ; W(1) ) by a better model p(h(1) ; W(2) ) = h(2) p(h(1) , h(2) ; W(2) ), thus improving the fit to the training data. [sent-54, score-0.181]
27 More formally, for any approximating distribution Q(h(1) |v), the DBN’s log-likelihood has the following variational lower bound on the log probability of the training data {v1 , . [sent-55, score-0.154]
28 (3) n=1 h(1) This is equivalent to training the second layer RBM with vectors drawn from Q(h(1) |v; W(1) ) as data. [sent-62, score-0.138]
29 Hence, the second RBM in the stack learns a better model of the mixture over all N training cases: 1/N n Q(h(1) |vn ; W(1) ), called the “aggregated posterior”. [sent-63, score-0.142]
30 Observe that during the pretraining stage the whole prior of the lower-layer RBM is replaced by the next RBM in the stack. [sent-65, score-0.603]
31 This leads to the hybrid Deep Belief Network model, with the top two layers forming a Restricted Boltzmann Machine, and the lower layers forming a directed sigmoid belief network (see Fig. [sent-66, score-0.439]
32 2 A Variational Bound for Pretraining a Two-layer Deep Boltzmann Machine Consider a simple two-layer DBM with tied weights W(2) = W(1) , as shown in Fig. [sent-69, score-0.151]
33 (4) h(1) ,h(2) Similar to DBNs, for any approximate posterior Q(h(1) |v), we can write a variational lower bound on the log probability that this DBM assigns to the training data: N EQ(h(1) |vn ) log P (vn |h(1) ; W(1) ) − log P (vn ) ≥ n=1 KL Q(h(1) |vn )||P (h(1) ; W(1) ) . [sent-71, score-0.186]
34 b) The second RBM with two sets of replicated hidden units, which will replace half of the 1st RBM’s prior. [sent-75, score-0.2]
35 Right: The DBM with tied weights is trained to model the data using one-step contrastive divergence. [sent-77, score-0.214]
36 In particular, another RBM with two sets of replicated hidden units and tied weights P (h(1) ; W(2) ) = h(2a) ,h(2b) P (h(1) , h(2a) , h(2b) ; W(2) ) is trained to be a better model 1 of the aggregated variational posterior N n Q(h(1) |vn ; W(1) ) of the first model (see Fig. [sent-79, score-0.527]
37 5 improves because replacing half of the prior by a better model reduces the KL divergence from the variational posterior: KL Q(h(1) |vn )||P (h(1) ; W(1) , W(2) ) ≤ n KL Q(h(1) |vn )||P (h(1) ; W(1) ) . [sent-89, score-0.159]
38 (10) n Due to the convexity of asymmetric divergence, this is guaranteed to improve the variational bound of the training data by at least half as much as fully replacing the original prior. [sent-90, score-0.235]
39 The procedure for adding an extra layer to a DBN replaces the full prior over the previous top layer, whereas the procedure for adding an extra layer to a DBM only replaces half of the prior. [sent-92, score-0.392]
40 So in a DBM, the weights of the bottom level RBM perform much more of the work than in a DBN, where the weights are only used to define the last stage of the generative process P (v|h(1) ; W(1) ). [sent-93, score-0.185]
41 This result also suggests that adding layers to a DBM will give diminishing improvements in the variational bound, compared to adding layers to a DBN. [sent-94, score-0.422]
42 This may explain why DBMs with three hidden layers typically perform worse than the DBMs with two hidden layers [7, 8]. [sent-95, score-0.56]
43 On the other hand, the disadvantage of the pretraining procedure for Deep Belief Networks is that the top-layer RBM is forced to do most of the modelling work. [sent-96, score-0.697]
44 This may also explain the need to use a large number of hidden units in the top-layer RBM [2]. [sent-97, score-0.203]
45 There is, however, a way to design a new pretraining algorithm that would spread the modelling work more equally across all layers, hence bypassing shortcomings of the existing pretraining algorithms for DBNs and DBMs. [sent-98, score-1.246]
46 Right: The corresponding practical implementation of the pretraining algorithm that uses asymmetric weights. [sent-103, score-0.59]
47 3 Controlling the Amount of Modelling Work done by Each Layer Consider a slightly modified two-layer DBM with two groups of replicated 2nd -layer units, h(2a) and h(2b) , and tied weights (see Fig. [sent-105, score-0.199]
48 The model’s marginal distribution over h(1) is the product of three identical RBM distributions, defined by h(1) and v, h(1) and h(2a) , and h(1) and h(2b) : P (h(1) ; W(1) ) = 1 Z(W(1) ) ev W(1) h(1) (2a) W(1) h(1) eh v h(2a) (2b) W(1) h(1) eh . [sent-107, score-0.174]
49 h(2b) During the pretraining stage, we keep one of these RBMs and replace the other two by a better prior P (h(1) ; W(2) ). [sent-108, score-0.603]
50 2, we train another RBM, but with three sets of hidden units and tied weights (see Fig. [sent-111, score-0.407]
51 The variational bound on the training data is guaranteed to improve by at least 2/3 as much as fully replacing the original prior. [sent-114, score-0.185]
52 Hence in this slightly modified DBM model, the second layer performs 2/3 of the modelling work compared to the first layer. [sent-115, score-0.203]
53 Clearly, controlling the number of replicated hidden groups allows us to easily control the amount of modelling work left to the higher layers in the stack. [sent-116, score-0.425]
54 We now specify how one would train this initial set of tied weights W(1) . [sent-119, score-0.188]
55 If we knew the initial state vector h(2) , we could train this DBM using one-step contrastive divergence (CD) with mean field reconstructions of both the visible states v and the top-layer states h(2) , as shown in Fig. [sent-122, score-0.165]
56 Using mean-field reconstructions for v and h(2) , one-step CD is exactly equivalent to training a modified “RBM” with only one hidden layer but with bottom-up weights that are twice the top-down weights, as defined in the original pretraining algorithm (see Fig. [sent-125, score-0.944]
57 This way of training the simple DBM with tied weights is unlikely to maximize the likelihood objective, but in practice it produces surprisingly good models that reconstruct the training data well. [sent-127, score-0.209]
58 When learning the second RBM in the stack, instead of maintaining a set of replicated hidden groups, it will often be convenient to approximate CD learning by training a modified RBM with one hidden layer but with asymmetric bottom-up and top-down weights. [sent-128, score-0.434]
59 For example, consider pretraining a two-layer DBM, in which we would like to split the modelling work between the 1st and 2nd -layer RBMs as 1/3 and 2/3. [sent-130, score-0.658]
60 In this case, we train the first layer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. [sent-131, score-0.297]
61 (2) (1) (2) (2) 1 + exp(− j 2Wjl hj ) 1 + exp(− l 3Wjl hl ) Note that this second-layer modified RBM simply approximates the proper RBM with three sets of replicated h(2) groups. [sent-136, score-0.151]
62 2 4 Pretraining a Three Layer Deep Boltzmann Machine In the previous section, we showed that provided we start with a two-layer DBM with tied weights, we can train the second-layer RBM in a way that is guaranteed to improve the variational bound. [sent-140, score-0.223]
63 For the DBM with more than two layers, we have not been able to develop a pretraining algorithm that is guaranteed to improve a variational bound. [sent-141, score-0.659]
64 3 suggest that using simple modifications when pretraining a stack of RBMs would allow us to approximately control the amount of modelling work done by each layer. [sent-143, score-0.771]
65 Consider learning a 3-layer DBM, in which each layer is forced to perform approximately 1/3 of the modelling work. [sent-144, score-0.218]
66 Similar to the two-layer model, we train the first layer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. [sent-146, score-0.297]
67 Figure 4: Layer-wise pretraining of a 3-layer Deep Boltzmann Machine. [sent-154, score-0.564]
68 When combining the three RBMs into a three-layer DBM, we end up with symmetric weights W(1) , 2W(2) , and 2W(3) in the first, second, and third layers, with each layer performing 1/3 of the modelling work: P (v; θ) = 1 Z(θ) exp v W(1) h(1) + h(1) 2W(2) h(2) + h(2) 2W(3) h(3) . [sent-156, score-0.292]
69 h 6 (12) Algorithm 1 Greedy Pretraining Algorithm for a 3-layer Deep Boltzmann Machine 1: Train the 1st layer “RBM” using one-step CD learning with mean field reconstructions of the visible vectors. [sent-157, score-0.183]
70 2: Freeze 3W(1) that defines the 1st layer of features, and use samples h(1) from P (h(1) |v; 3W(1) ) as the data for training the second RBM. [sent-159, score-0.138]
71 3: Train the 2nd layer “RBM” using one-step CD learning with mean field reconstructions of the visible vectors. [sent-160, score-0.183]
72 4: Freeze 4W(2) that defines the 2nd layer of features and use the samples h(3) from P (h(2) |h(1) ; 4W(2) ) as the data for training the next RBM. [sent-162, score-0.138]
73 The new pretraining procedure for a 3-layer DBM is shown in Alg. [sent-166, score-0.588]
74 Extensions to training DBMs with more layers is trivial. [sent-169, score-0.19]
75 As we show in our experimental results, this pretraining can improve the generative performance of Deep Boltzmann Machines. [sent-170, score-0.623]
76 During greedy pretraining, each layer was trained for 100 epochs using one-step contrastive divergence. [sent-172, score-0.184]
77 In order to estimate the variational lower-bounds achieved by different pretraining algorithms, we need to estimate the global normalization constant. [sent-174, score-0.649]
78 Together with variational inference this will allow us to obtain good estimates of the lower bound on the log-probability of the training and test data. [sent-177, score-0.169]
79 In our first experiment, we considered a standard two-layer DBM with 500 and 1000 hidden units2 , and used two different algorithms for pretraining it. [sent-180, score-0.675]
80 The first pretraining algorithm, which we call DBM-1/2-1/2, is the original algorithm for pretraining DBMs, as introduced by [7] (see Fig. [sent-181, score-1.14]
81 The second algorithm, DBM-1/3-2/3, uses a modified pretraining procedure of Sec. [sent-184, score-0.588]
82 4, so that the second RBM in the stack ends up doing 2/3 of the modelling work compared to the 1st -layer RBM. [sent-186, score-0.207]
83 The large difference of about 7 nats shows that leaving more of the modelling work to the second layer, which has a larger number of hidden units, substantially improves the variational bound. [sent-191, score-0.312]
84 The existing pretraining algorithm, DBM-1/2-1/4-1/4, approximately splits the modelling between three RBMs in the stack as 1/2, 1/4, 1/4, so the weights in the 1st -layer RBM perform half of the work compared to the higher-level RBMs. [sent-198, score-0.882]
85 On the other hand, the new pretraining procedure (see Alg. [sent-199, score-0.588]
86 7 Table 1: MNIST: Estimating the lower bound on the average training and test log-probabilities for two DBMs: one with two layers (500 and 1000 hidden units), and the other one with three layers (500, 500, and 1000 hidden units). [sent-202, score-0.657]
87 Results are shown for various pretraining algorithms, followed by generative fine-tuning. [sent-203, score-0.612]
88 Pretraining Generative Fine-Tuning Train 2 layers 3 layers DBM-1/2-1/2 DBM-1/3-2/3 DBM-1/2-1/4-1/4 DBM-1/3-1/3-1/3 Test Train Test −113. [sent-204, score-0.322]
89 02 Table 2: NORB: Estimating the lower bound on the average training and test log-probabilities for two DBMs: one with two layers (1000 and 2000 hidden units), and the other one with three layers (1000, 1000, and 2000 hidden units). [sent-220, score-0.657]
90 Results are shown for various pretraining algorithms, followed by generative fine-tuning. [sent-221, score-0.612]
91 Pretraining Generative Fine-Tuning Train 2 layers 3 layers DBM-1/2-1/2 DBM-1/3-2/3 DBM-1/2-1/4-1/4 DBM-1/3-1/3-1/3 Test Train Test −640. [sent-222, score-0.322]
92 The difference of about 10 nats further demonstrates that during the pretraining stage, it is rather crucial to push more of the modelling work to the higher layers. [sent-241, score-0.678]
93 02, so with a new pretraining procedure, the three-hidden-layer DBM performs slightly better than the two-hidden-layer DBM. [sent-243, score-0.564]
94 With the original pretraining procedure, the 3-layer DBM achieves a bound of −85. [sent-244, score-0.611]
95 To deal with raw pixel data, we followed the approach of [5] by first learning a Gaussian-binary RBM with 4000 hidden units, and then treating the the activities of its hidden layer as preprocessed binary data. [sent-252, score-0.331]
96 Similar to the MNIST experiments, we trained two Deep Boltzmann Machines: one with two layers (1000 and 2000 hidden units), and the other one with three layers (1000, 1000, and 2000 hidden units). [sent-253, score-0.581]
97 Table 2 reveals that for both DBMs, the new pretraining achieves much better variational bounds on the average test log-probability. [sent-254, score-0.651]
98 6 Conclusion In this paper we provided a better understanding of how the pretraining algorithms for Deep Belief Networks and Deep Boltzmann Machines are related, and used this understanding to develop a different method of pretraining. [sent-256, score-0.59]
99 Unlike many of the existing pretraining algorithms for DBNs and DBMs, the new procedure can distribute the modelling work more evenly over the hidden layers. [sent-257, score-0.807]
100 Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn much better generative models. [sent-258, score-0.612]
wordName wordTfidf (topN-words)
[('pretraining', 0.564), ('dbm', 0.457), ('rbm', 0.403), ('deep', 0.19), ('boltzmann', 0.182), ('layers', 0.161), ('dbms', 0.16), ('rbms', 0.155), ('vn', 0.139), ('stack', 0.113), ('hidden', 0.111), ('layer', 0.109), ('modelling', 0.094), ('units', 0.092), ('tied', 0.091), ('variational', 0.072), ('norb', 0.064), ('belief', 0.061), ('cd', 0.061), ('weights', 0.06), ('dbn', 0.059), ('eh', 0.055), ('dbns', 0.053), ('generative', 0.048), ('replicated', 0.048), ('hj', 0.048), ('mnist', 0.045), ('contrastive', 0.042), ('machines', 0.04), ('hl', 0.039), ('reconstructions', 0.039), ('modi', 0.038), ('train', 0.037), ('ev', 0.035), ('visible', 0.035), ('bound', 0.035), ('kl', 0.034), ('eld', 0.033), ('training', 0.029), ('salakhutdinov', 0.028), ('asymmetric', 0.026), ('freeze', 0.026), ('procedure', 0.024), ('half', 0.024), ('pretrain', 0.023), ('panel', 0.022), ('prior', 0.022), ('ais', 0.021), ('trained', 0.021), ('twice', 0.02), ('nats', 0.02), ('pretrained', 0.02), ('posterior', 0.018), ('lower', 0.018), ('composed', 0.018), ('lacks', 0.018), ('stereo', 0.018), ('replace', 0.017), ('stage', 0.017), ('hinton', 0.017), ('three', 0.016), ('restricted', 0.016), ('welling', 0.016), ('networks', 0.016), ('constrained', 0.015), ('improves', 0.015), ('forced', 0.015), ('test', 0.015), ('evenly', 0.014), ('replacing', 0.014), ('adding', 0.014), ('network', 0.014), ('assigns', 0.014), ('aggregated', 0.014), ('exp', 0.013), ('extra', 0.013), ('conversely', 0.013), ('replaces', 0.013), ('wij', 0.013), ('digits', 0.013), ('architectures', 0.013), ('achieved', 0.013), ('understanding', 0.013), ('identical', 0.013), ('eq', 0.012), ('forming', 0.012), ('greedy', 0.012), ('handwritten', 0.012), ('guaranteed', 0.012), ('equally', 0.012), ('original', 0.012), ('divergence', 0.012), ('bypassing', 0.012), ('gifts', 0.012), ('halved', 0.012), ('louradour', 0.012), ('wjl', 0.012), ('improve', 0.011), ('splits', 0.011), ('controlling', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
2 0.50319105 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
3 0.22302584 65 nips-2012-Cardinality Restricted Boltzmann Machines
Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams
Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1
4 0.1156229 8 nips-2012-A Generative Model for Parts-based Object Segmentation
Author: S. Eslami, Christopher Williams
Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9
5 0.10639473 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
6 0.0954329 197 nips-2012-Learning with Recursive Perceptual Representations
7 0.090671711 193 nips-2012-Learning to Align from Scratch
8 0.084280498 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
9 0.07379505 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
10 0.070411958 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
11 0.06940756 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
12 0.06575419 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
13 0.065249398 170 nips-2012-Large Scale Distributed Deep Networks
14 0.063898534 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
15 0.060535911 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
16 0.058932953 100 nips-2012-Discriminative Learning of Sum-Product Networks
17 0.056868143 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
18 0.055184115 12 nips-2012-A Neural Autoregressive Topic Model
19 0.050168578 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
20 0.048202239 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities
topicId topicWeight
[(0, 0.104), (1, 0.054), (2, -0.155), (3, 0.02), (4, -0.008), (5, -0.01), (6, -0.005), (7, -0.059), (8, -0.021), (9, -0.054), (10, 0.043), (11, 0.166), (12, -0.059), (13, 0.218), (14, -0.124), (15, -0.118), (16, 0.031), (17, -0.088), (18, 0.156), (19, -0.203), (20, -0.003), (21, -0.158), (22, -0.053), (23, 0.135), (24, 0.046), (25, 0.099), (26, 0.116), (27, 0.037), (28, 0.129), (29, -0.204), (30, -0.016), (31, -0.031), (32, -0.165), (33, 0.021), (34, -0.12), (35, -0.053), (36, -0.042), (37, -0.094), (38, -0.134), (39, 0.046), (40, -0.071), (41, 0.126), (42, 0.213), (43, 0.016), (44, -0.011), (45, -0.122), (46, -0.023), (47, -0.02), (48, -0.026), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.96730441 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
2 0.8662945 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
3 0.75206977 65 nips-2012-Cardinality Restricted Boltzmann Machines
Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams
Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1
4 0.4839015 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi
Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1
5 0.46444491 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
6 0.42244682 170 nips-2012-Large Scale Distributed Deep Networks
7 0.41478926 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
8 0.39603329 8 nips-2012-A Generative Model for Parts-based Object Segmentation
9 0.37095451 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
10 0.32893556 12 nips-2012-A Neural Autoregressive Topic Model
11 0.31702736 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
12 0.28591961 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
13 0.28529325 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
14 0.28192496 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
15 0.24369919 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.23697335 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
17 0.23355369 37 nips-2012-Affine Independent Variational Inference
18 0.2311767 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
19 0.22795501 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition
20 0.21454051 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity
topicId topicWeight
[(0, 0.015), (15, 0.271), (21, 0.027), (35, 0.017), (38, 0.089), (42, 0.023), (53, 0.016), (54, 0.031), (55, 0.071), (74, 0.05), (76, 0.09), (80, 0.096), (92, 0.096)]
simIndex simValue paperId paperTitle
1 0.91250366 308 nips-2012-Semi-Supervised Domain Adaptation with Non-Parametric Copulas
Author: David Lopez-paz, Jose M. Hernández-lobato, Bernhard Schölkopf
Abstract: A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques. 1
2 0.79868418 31 nips-2012-Action-Model Based Multi-agent Plan Recognition
Author: Hankz H. Zhuo, Qiang Yang, Subbarao Kambhampati
Abstract: Multi-Agent Plan Recognition (MAPR) aims to recognize dynamic team structures and team behaviors from the observed team traces (activity sequences) of a set of intelligent agents. Previous MAPR approaches required a library of team activity sequences (team plans) be given as input. However, collecting a library of team plans to ensure adequate coverage is often difficult and costly. In this paper, we relax this constraint, so that team plans are not required to be provided beforehand. We assume instead that a set of action models are available. Such models are often already created to describe domain physics; i.e., the preconditions and effects of effects actions. We propose a novel approach for recognizing multi-agent team plans based on such action models rather than libraries of team plans. We encode the resulting MAPR problem as a satisfiability problem and solve the problem using a state-of-the-art weighted MAX-SAT solver. Our approach also allows for incompleteness in the observed plan traces. Our empirical studies demonstrate that our algorithm is both effective and efficient in comparison to state-of-the-art MAPR methods based on plan libraries. 1
same-paper 3 0.74076062 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
4 0.61891264 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
Author: Odalric-ambrym Maillard
Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process ν defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. 1
5 0.59755361 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
6 0.56466502 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
7 0.55767494 197 nips-2012-Learning with Recursive Perceptual Representations
8 0.55480337 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
9 0.55346787 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search
10 0.55293894 65 nips-2012-Cardinality Restricted Boltzmann Machines
11 0.55259931 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
12 0.54717952 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
13 0.54706705 211 nips-2012-Meta-Gaussian Information Bottleneck
14 0.54581535 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
15 0.54473907 193 nips-2012-Learning to Align from Scratch
16 0.54318869 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
17 0.53960532 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
18 0.5392338 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
19 0.53895074 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
20 0.53800672 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video