nips nips2012 nips2012-229 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The model can be used to extract a unified representation that fuses modalities together. [sent-6, score-0.348]
2 The model works by learning a probability density over the space of multimodal inputs. [sent-8, score-0.38]
3 The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. [sent-10, score-0.344]
4 Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. [sent-11, score-1.147]
5 Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. [sent-13, score-0.311]
6 Useful representations can be learned about such data by fusing the modalities into a joint representation that captures the real-world ‘concept’ that the data corresponds to. [sent-17, score-0.377]
7 For example, we would like a probabilistic model to correlate the occurrence of the words ‘beautiful sunset’ and the visual properties of an image of a beautiful sunset and represent them jointly, so that the model assigns high probability to one conditioned on the other. [sent-18, score-0.223]
8 Unless we do multimodal learning, it would not be possible to discover a lot of useful information about the world (for example, ‘what do beautiful sunsets look like? [sent-22, score-0.439]
9 In a multimodal setting, data consists of multiple input modalities, each modality having a different kind of representation and correlational structure. [sent-25, score-0.547]
10 For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented using pixel intensities or outputs of feature extractors which are real-valued and dense. [sent-26, score-0.314]
11 This makes it much harder to discover relationships across modalities than relationships among features in the same modality. [sent-27, score-0.304]
12 There is a lot of structure in the input but it is difficult to discover the highly non-linear relationships that exist 1 Figure 1: Left: Examples of text generated from a DBM by sampling from P (vtxt |vimg , θ). [sent-28, score-0.181]
13 A good multimodal learning model must satisfy certain properties. [sent-32, score-0.38]
14 It should also be possible to fill-in missing modalities given the observed ones. [sent-35, score-0.334]
15 Our proposed multimodal Deep Boltzmann Machine (DBM) model satisfies the above desiderata. [sent-37, score-0.38]
16 DBMs are undirected graphical models with bipartite connections between adjacent layers of hidden units [1]. [sent-38, score-0.393]
17 The key idea is to learn a joint density model over the space of multimodal inputs. [sent-39, score-0.436]
18 Missing modalities can then be filled-in by sampling from the conditional distributions over them given the observed ones. [sent-40, score-0.241]
19 For example, we use a large collection of user-tagged images to learn a joint distribution over images and text P (vimg , vtxt |θ). [sent-41, score-0.391]
20 By drawing samples from P (vtxt |vimg , θ) and from P (vimg |vtxt , θ) we can fill-in missing data, thereby doing image annotation and image retrieval respectively, as shown in Fig. [sent-42, score-0.332]
21 There have been several approaches to learning from multimodal data. [sent-44, score-0.35]
22 [3], based on multiple kernel learning framework, further demonstrated that an additional text modality can improve the accuracy of SVMs on various object recognition tasks. [sent-48, score-0.248]
23 However, all of these approaches are discriminative by nature and cannot make use of large amounts of unlabeled data or deal easily with missing input modalities. [sent-49, score-0.226]
24 [4] used dual-wing harmoniums to build a joint model of images and text, which can be viewed as a linear RBM model with Gaussian hidden units together with Gaussian and Poisson visible units. [sent-51, score-0.545]
25 However, various data modalities will typically have very different statistical properties which makes it difficult to model them using shallow models. [sent-52, score-0.271]
26 [5] that used a deep autoencoder for speech and vision fusion. [sent-54, score-0.227]
27 First, in this work we focus on integrating together very different data modalities: sparse word count vectors and real-valued dense image features. [sent-56, score-0.187]
28 While both approaches have lead to interesting results in several domains, using a generative model is important for applications we consider in this paper, as it allows our model to naturally handle missing data modalities. [sent-58, score-0.179]
29 In particular, the Replicated Softmax model [6] has been shown to be effective in modeling sparse word count vectors, whereas Gaussian RBMs have been used for modeling real-valued inputs for speech and vision tasks. [sent-61, score-0.281]
30 In this section we briefly review these models, as they will serve as our building blocks for the multimodal model. [sent-62, score-0.35]
31 1 Restricted Boltzmann Machines A Restricted Boltzmann Machine is an undirected graphical model with stochastic visible units v ∈ {0, 1}D and stochastic hidden units h ∈ {0, 1}F , with each visible unit connected to each hidden unit. [sent-64, score-0.833]
32 The model defines the following energy function E : {0, 1}D+F → R: D F E(v, h; θ) = − D vi Wij hj − i=1 j=1 F bi vi − i=1 aj hj j=1 where θ = {a, b, W} are the model parameters. [sent-65, score-0.292]
33 The joint distribution over the visible and hidden units is defined by: 1 exp (−E(v, h; θ)). [sent-66, score-0.425]
34 2 Gaussian RBM Consider modeling visible real-valued units v ∈ RD , and let h ∈ {0, 1}F be binary stochastic hidden units. [sent-68, score-0.418]
35 The energy of the state {v, h} of the Gaussian RBM is defined as follows: D D F F (vi − bi )2 vi E(v, h; θ) = − Wij hj − aj hj , (2) 2 2σi σ i=1 i=1 j=1 i j=1 where θ = {a, b, W, σ} are the model parameters. [sent-69, score-0.242]
36 3 Replicated Softmax Model The Replicated Softmax Model is useful for modeling sparse count data, such as word count vectors in a document. [sent-71, score-0.215]
37 Let v ∈ NK be a vector of visible units where vk is the number of times word k occurs in the document with the vocabulary of size K. [sent-72, score-0.303]
38 The energy of the state {v, h} is defined as follows K F E(v, h; θ) = − K vk Wkj hj − k=1 j=1 F bk v k − M k=1 aj hj (3) j=1 where θ = {a, b, W} are the model parameters and M = k vk is the total number of words in a document. [sent-74, score-0.27]
39 We note that this replicated softmax model can also be interpreted as an RBM model that uses a single visible multinomial unit with support {1, . [sent-75, score-0.296]
40 It contains a set of visible units v ∈ {0, 1}D , and a sequence of layers of hidden units h(1) ∈ {0, 1}F1 , h(2) ∈ {0, 1}F2 ,. [sent-82, score-0.599]
41 There are connections only between hidden units in adjacent layers. [sent-86, score-0.291]
42 Right: A Multimodal DBM that models the joint distribution over image and text inputs. [sent-92, score-0.264]
43 We illustrate the construction of a multimodal DBM using an image-text bi-modal DBM as our running example. [sent-93, score-0.35]
44 Let vm ∈ RD denote an image input and vt ∈ NK denote a text input. [sent-94, score-0.479]
45 To form a multimodal DBM, we combine the two models by adding an additional layer of binary hidden units on top of them. [sent-104, score-0.747]
46 The joint distribution over the multi-modal input can be written as: (2) P (vm , vt ; θ)= P (h(2) , ht , h(3) ) m (2) (2) (1) hm ,ht ,h(3) 3. [sent-107, score-0.289]
47 4 (a) RBM (b) Multimodal DBN (c) Multimodal DBM Figure 3: Different ways of combining multimodal inputs 3. [sent-114, score-0.415]
48 Each pathway can be pretrained separately in a completely unsupervised fashion, which allows us to leverage a large supply of unlabeled data. [sent-116, score-0.189]
49 The type of the lower-level RBMs in each pathway could be different, accounting for different input distributions, as long as the final hidden representations at the end of each pathway are of the same type. [sent-118, score-0.37]
50 Each data modality has very different statistical properties which make it difficult for a single hidden layer model (such as Fig. [sent-120, score-0.395]
51 In our model, this difference is bridged by putting layers of hidden units between the modalities. [sent-122, score-0.368]
52 3a), where the hidden layer h directly models the (1) distribution over vt and vm , the first layer of hidden units hm in a DBM has an easier task to (2) perform - that of modeling the distribution over vm and hm . [sent-127, score-1.195]
53 Each layer of hidden units in the DBM contributes a small part to the overall task of modeling the distribution over vm and vt . [sent-128, score-0.663]
54 Therefore, the middle layer in the network can be seen as a (relatively) “modality-free” representation of the input as opposed to the input layers which were “modality-full”. [sent-130, score-0.314]
55 Another way of using a deep model to combine multimodal inputs is to use a Multimodal Deep Belief Network (DBN) (Fig. [sent-131, score-0.564]
56 In a DBN model the responsibility of the multimodal modeling falls entirely on the joint layer. [sent-135, score-0.489]
57 The modality fusion process is distributed across all hidden units in all layers. [sent-137, score-0.412]
58 From the generative perspective, states of low-level hidden units in one pathway can influence the states of hidden units in other pathways through the higher-level layers, which is not the case for DBNs. [sent-138, score-0.721]
59 3 Modeling Tasks Generating Missing Modalities: As argued in the introduction, many real-world applications will often have one or more modalities missing. [sent-140, score-0.241]
60 The Multimodal DBM can be used to generate such missing data modalities by clamping the observed modalities at the inputs and sampling the hidden modalities from the conditional distribution by running the standard alternating Gibbs sampler [1]. [sent-141, score-1.05]
61 For example, consider generating text conditioned on a given image1 vm . [sent-142, score-0.283]
62 The observed modality vm is clamped at the inputs and all hidden units are initialized randomly. [sent-143, score-0.676]
63 This fused representation is inferred by clamping the observed modalities and doing alternating Gibbs sampling to sample from P (h(3) |vm , vt ) (if both modalities are present) or from P (h(3) |vm ) (if text is missing). [sent-149, score-0.822]
64 The activation probabilities of hidden units h(3) constitute the joint representation of the inputs. [sent-153, score-0.389]
65 1 Generating image features conditioned on text can be done in a similar way. [sent-154, score-0.251]
66 5 This representation can then be used to do information retrieval for multimodal or unimodal queries. [sent-155, score-0.595]
67 Each data point in the database (whether missing some modalities or not) can be mapped to this latent space. [sent-156, score-0.356]
68 PHOW features are bags of image words obtained by extracting dense SIFT features over multiple scales and clustering them. [sent-181, score-0.167]
69 2 Model Architecture and Learning The image pathway consists of a Gaussian RBM with 3857 visible units followed by 2 layers of 1024 hidden units. [sent-184, score-0.607]
70 The text pathway consists of a Replicated Softmax Model with 2000 visible units followed by 2 layers of 1024 hidden units. [sent-185, score-0.653]
71 For discriminative tasks, we perform 1-vs-all classification using logistic regression on the joint hidden layer representation. [sent-192, score-0.366]
72 3 Classification Tasks Multimodal Inputs: Our first set of experiments, evaluate the DBM as a discriminative model for multimodal data. [sent-195, score-0.424]
73 For each model that we trained, the fused representation of the data was extracted and feed to a separate logistic regression for each of the 38 topics. [sent-196, score-0.175]
74 The text input layer in the DBM was left unclamped when the text was missing. [sent-197, score-0.394]
75 Linear Discriminant Analysis (LDA) and Support Vector Machines (SVMs) [2] were trained using the labeled data on concatenated image and text features that did not include SIFT-based features. [sent-200, score-0.283]
76 Right: MAP using representations from different layers of multimodal DBMs and DBNs. [sent-238, score-0.465]
77 To measure the effect of using unlabeled data, a DBM was trained using all the unlabeled examples that had both modalities present. [sent-239, score-0.383]
78 We compared our model to two other deep learning models: Multimodal Deep Belief Network (DBN) and a deep Autoencoder model [5]. [sent-248, score-0.298]
79 These models were trained with the same number of layers and hidden units as the DBM. [sent-249, score-0.4]
80 Unimodal Inputs: Next, we evaluate the ability of the model to improve classification of unimodal inputs by filling in other modalities. [sent-260, score-0.221]
81 For multimodal models, the text input was only used during training. [sent-261, score-0.511]
82 4 compares the Multimodal DBM model with an SVM over image features alone (ImageSVM) [2], a DBN over image features (Image-DBN) and a DBM over image features (ImageDBM). [sent-264, score-0.402]
83 All deep models had the same depth and same number of hidden units in each layer. [sent-265, score-0.41]
84 In one case (DBMZeroText), the state of the joint hidden layer was inferred keeping the missing text input clamped at zero. [sent-267, score-0.597]
85 In the other case (DBM-GenText), the text input was not clamped and the model was allowed to update the state of the text input layer when performing mean-field updates. [sent-268, score-0.501]
86 In doing so, the model effectively filled-in the missing text modality (some examples of which are shown in Fig. [sent-269, score-0.371]
87 The DBM-GenText model performs better than all other models, showing that the DBM is able to generate meaningful text that serves as a plausible proxy for missing data. [sent-272, score-0.25]
88 This suggests that learning multimodal features helps even when some modalities are absent at test time. [sent-274, score-0.665]
89 Having multiple modalities probably regularizes the model and makes it learn much better features. [sent-275, score-0.271]
90 As we go deeper into the model from either input layer towards the middle, the internal representations get better. [sent-282, score-0.208]
91 The joint layer in the middle serves as the most useful feature representation. [sent-283, score-0.205]
92 For each model, all queries and all points in the database were mapped to the joint hidden representation under that model. [sent-293, score-0.346]
93 6 shows some examples of multimodal queries and the top 4 retrieved results. [sent-301, score-0.467]
94 Unimodal Queries: The DBM model can also be used to query for unimodal inputs by filling in the missing modality. [sent-303, score-0.314]
95 5b shows the precision-recall curves for the DBM model along with other unimodal models, where each model received the same image queries as input. [sent-305, score-0.355]
96 By effectively inferring the missing text, the DBM model was able to achieve far better results than any unimodal method (MAP of 0. [sent-306, score-0.249]
97 5 Conclusion We proposed a Deep Boltzmann Machine model for learning multimodal data representations. [sent-310, score-0.38]
98 Pathways for each modality can be pretrained independently and “plugged in” together for doing joint training. [sent-312, score-0.231]
99 The model fuses multiple data modalities into a unified representation. [sent-313, score-0.306]
100 It also works nicely when some modalities are absent and improves upon models trained on only the observed modalities. [sent-315, score-0.304]
wordName wordTfidf (topN-words)
[('dbm', 0.631), ('multimodal', 0.35), ('modalities', 0.241), ('vm', 0.156), ('units', 0.153), ('hidden', 0.138), ('text', 0.127), ('unimodal', 0.126), ('modality', 0.121), ('deep', 0.119), ('rbm', 0.115), ('dbn', 0.114), ('layer', 0.106), ('boltzmann', 0.1), ('missing', 0.093), ('vimg', 0.088), ('vtxt', 0.088), ('queries', 0.088), ('replicated', 0.086), ('autoencoder', 0.086), ('image', 0.081), ('vt', 0.081), ('pathway', 0.08), ('visible', 0.078), ('layers', 0.077), ('retrieval', 0.077), ('rbms', 0.074), ('hj', 0.073), ('softmax', 0.072), ('huiskes', 0.071), ('hm', 0.066), ('inputs', 0.065), ('images', 0.06), ('fused', 0.059), ('count', 0.058), ('mir', 0.057), ('joint', 0.056), ('unlabeled', 0.055), ('dbms', 0.054), ('pretrained', 0.054), ('ht', 0.052), ('word', 0.048), ('map', 0.047), ('guillaumin', 0.047), ('beautiful', 0.047), ('lling', 0.046), ('lda', 0.045), ('discriminative', 0.044), ('clamped', 0.043), ('features', 0.043), ('representation', 0.042), ('tags', 0.041), ('flickr', 0.04), ('representations', 0.038), ('dbmzerotext', 0.035), ('fuses', 0.035), ('nitish', 0.035), ('phow', 0.035), ('prec', 0.035), ('sunset', 0.035), ('vmi', 0.035), ('input', 0.034), ('pathways', 0.033), ('trained', 0.032), ('precision', 0.032), ('clamping', 0.031), ('absent', 0.031), ('multimedia', 0.031), ('model', 0.03), ('machines', 0.029), ('captions', 0.029), ('modeling', 0.029), ('retrieved', 0.029), ('wij', 0.027), ('svms', 0.026), ('generative', 0.026), ('undirected', 0.025), ('svm', 0.025), ('pretraining', 0.025), ('aj', 0.024), ('vk', 0.024), ('ngiam', 0.024), ('responsibility', 0.024), ('contrastive', 0.023), ('useful', 0.022), ('energy', 0.022), ('salakhutdinov', 0.022), ('autoencoders', 0.022), ('feed', 0.022), ('mapped', 0.022), ('logistic', 0.022), ('vision', 0.022), ('ruslan', 0.021), ('middle', 0.021), ('tasks', 0.021), ('belief', 0.021), ('stochastic', 0.02), ('vi', 0.02), ('discover', 0.02), ('classi', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
2 0.50319105 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
3 0.159141 65 nips-2012-Cardinality Restricted Boltzmann Machines
Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams
Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1
4 0.12609282 12 nips-2012-A Neural Autoregressive Topic Model
Author: Hugo Larochelle, Stanislas Lauly
Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
5 0.11253599 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
6 0.11028014 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
7 0.10891981 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
8 0.10795651 8 nips-2012-A Generative Model for Parts-based Object Segmentation
9 0.1037283 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
10 0.099758364 71 nips-2012-Co-Regularized Hashing for Multimodal Data
11 0.098148517 193 nips-2012-Learning to Align from Scratch
12 0.091410227 75 nips-2012-Collaborative Ranking With 17 Parameters
13 0.087712206 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
14 0.086729996 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
15 0.077911623 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
16 0.074196629 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
17 0.066126332 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex
18 0.064841412 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio
19 0.064006627 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
20 0.062760748 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
topicId topicWeight
[(0, 0.161), (1, 0.067), (2, -0.209), (3, 0.016), (4, 0.01), (5, -0.04), (6, -0.004), (7, -0.028), (8, -0.006), (9, -0.035), (10, 0.055), (11, 0.168), (12, -0.043), (13, 0.221), (14, -0.107), (15, -0.079), (16, 0.045), (17, -0.07), (18, 0.139), (19, -0.181), (20, 0.004), (21, -0.133), (22, -0.063), (23, 0.086), (24, 0.057), (25, 0.1), (26, 0.113), (27, 0.032), (28, 0.15), (29, -0.196), (30, -0.044), (31, -0.046), (32, -0.138), (33, 0.02), (34, -0.11), (35, -0.069), (36, -0.073), (37, -0.073), (38, -0.081), (39, 0.01), (40, -0.056), (41, 0.161), (42, 0.21), (43, 0.063), (44, -0.001), (45, -0.103), (46, -0.042), (47, 0.014), (48, -0.028), (49, -0.02)]
simIndex simValue paperId paperTitle
1 0.95068181 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models. 1
same-paper 2 0.93437678 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
3 0.78486419 65 nips-2012-Cardinality Restricted Boltzmann Machines
Author: Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S. Zemel, Ruslan Salakhutdinov, Ryan P. Adams
Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM’s hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers. 1
4 0.53140748 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
5 0.51503772 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction
Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi
Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1
6 0.46676263 8 nips-2012-A Generative Model for Parts-based Object Segmentation
7 0.46303859 170 nips-2012-Large Scale Distributed Deep Networks
8 0.45048767 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
9 0.42975023 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
10 0.42886013 12 nips-2012-A Neural Autoregressive Topic Model
11 0.39697114 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
12 0.372895 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
13 0.36434823 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks
14 0.35392025 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
15 0.33781123 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.30386958 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
17 0.30185011 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition
18 0.27974749 71 nips-2012-Co-Regularized Hashing for Multimodal Data
19 0.27864283 198 nips-2012-Learning with Target Prior
20 0.27830598 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
topicId topicWeight
[(0, 0.041), (15, 0.023), (21, 0.028), (35, 0.198), (38, 0.078), (42, 0.028), (53, 0.024), (54, 0.024), (55, 0.052), (74, 0.067), (76, 0.11), (80, 0.12), (92, 0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.80675894 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
2 0.74541378 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation
Author: Jason Pacheco, Erik B. Sudderth
Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1
3 0.71888506 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
Author: Michael Bryant, Erik B. Sudderth
Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.
4 0.71646547 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
5 0.70930743 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
6 0.70814663 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
7 0.69656652 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space
8 0.69498879 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
9 0.69463146 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search
10 0.68934858 65 nips-2012-Cardinality Restricted Boltzmann Machines
11 0.68830913 168 nips-2012-Kernel Latent SVM for Visual Recognition
12 0.68611932 291 nips-2012-Reducing statistical time-series problems to binary classification
13 0.68596083 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection
14 0.68572068 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
15 0.68451709 193 nips-2012-Learning to Align from Scratch
16 0.68217194 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
17 0.68108994 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines
18 0.68072867 329 nips-2012-Super-Bit Locality-Sensitive Hashing
19 0.67947036 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
20 0.67645293 260 nips-2012-Online Sum-Product Computation Over Trees