nips nips2008 nips2008-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). [sent-3, score-0.449]
2 This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. [sent-4, score-0.413]
3 Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. [sent-5, score-0.52]
4 The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. [sent-6, score-0.563]
5 The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. [sent-7, score-0.546]
6 We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. [sent-9, score-0.6]
7 1 Introduction A typical mixture model is composed of a number of separately parameterized density models each of which has two important properties: 1. [sent-10, score-0.418]
8 The mixture is created by assigning a mixing proportion to each of the component models and it is typically fitted by using the EM algorithm that alternates between two steps. [sent-14, score-0.578]
9 The M-step also changes the mixing proportions of the component models to match the proportion of the training data that they are responsible for. [sent-18, score-0.316]
10 An RBM with N hidden units can be viewed as a mixture of 2N Bernoulli models, one per latent state vector, with a lot of parameter sharing between the 2N component models and with the 2N mixing proportions being implicitly determined by the same parameters. [sent-21, score-0.96]
11 1 A multivariate Bernoulli model consists of a set of probabilities, one per component of the binary data vector. [sent-22, score-0.187]
12 It can also be viewed as a product of N “uni-Bernoulli” models (plus one Bernoulli model that is implemented by the visible biases). [sent-24, score-0.224]
13 A uni-Bernoulli model is a mixture of a uniform and a Bernoulli. [sent-25, score-0.385]
14 The weights of a hidden unit define the ith probability in its Bernoulli model as pi = σ(wi ), and the bias, b, of a hidden unit defines the mixing proportion of the Bernoulli in its uni-Bernoulli as σ(b), where σ(x) = (1 + exp(−x))−1 . [sent-26, score-0.459]
15 The mixture could be used to model the raw data or some preprocessed representation that has already extracted features that are shared by different classes. [sent-28, score-0.447]
16 This paper describes a way of fitting a mixture of RBM’s without explicitly computing the partition function of each RBM. [sent-31, score-0.404]
17 2 The model We start with the energy function for a Restricted Boltzmann Machine (RBM) and then modify it to define the implicit mixture of RBMs. [sent-32, score-0.626]
18 To simplify the description, we assume that the visible and hidden variables of the RBM are binary. [sent-33, score-0.32]
19 Defining the energy function in terms of three-way interactions among the components of v, h, and z gives I Wijk vi hj zk , E(v, h, z) = − (2) i,j,k where W I is a 3D tensor of parameters. [sent-39, score-0.485]
20 The joint distribution for the mixture model is exp(−E(v, h, z)) P (v, h, z) = , (3) ZI 2 where ZI = exp(−E(u, g, y)) (4) u,g,y is the partition function of the implicit mixture model. [sent-41, score-0.952]
21 Re-writing the joint distribution in the usual mixture model form gives K P (v) = P (v, h, z) = h,z P (v, h|zk = 1)P (zk = 1). [sent-42, score-0.385]
22 (5) k=1 h Equation 5 defines the implicit mixture of RBMs. [sent-43, score-0.535]
23 P (v, h|zk = 1) is the k th component RBM’s distribution, with W R being the k th slice of W I . [sent-44, score-0.254]
24 Unlike in a typical mixture model, the mixing proportion P (zk = 1) is not a separate parameter in our model. [sent-45, score-0.472]
25 Changing the bias of the k th unit in z changes the mixing proportion of the k th RBM, but all of the weights of all the RBM’s also influence it. [sent-47, score-0.268]
26 Figure 1 gives a visual description of the implicit mixture model’s structure. [sent-48, score-0.535]
27 , vN }, we want to learn the parameters of the implicit mixN ture model by maximizing the log likelihood L = n=1 log P (vn ) with respect to W I . [sent-52, score-0.22]
28 The second case is easy: given zk = 1, we select the k th component RBM of the mixture model and then sample from its conditional distribution Pk (v|h). [sent-61, score-0.788]
29 So the ith visible unit is drawn independently of the other units from the Bernoulli distribution P (vi = 1|h, zk = 1) = 1 1 + exp(− j I Wijk hj ) . [sent-63, score-0.666]
30 Then, given zk = 1, we select the k th component RBM and sample from its conditional distribution Pk (h|v). [sent-66, score-0.403]
31 Again, this distribution is factorial, and the j th hidden unit is drawn from the Bernoulli distribution P (hj = 1|v, zk = 1) = 1 1 + exp(− i I Wijk vi ) . [sent-67, score-0.486]
32 (8) To compute P (z|v) we first note that P (zk = 1|v) ∝ exp(−F (v, zk = 1)), (9) where the free energy F (v, zk = 1) is given by I Wijk vi )). [sent-68, score-0.592]
33 log(1 + exp( F (v, zk = 1) = − j i 3 (10) If the number of possible states of z is small enough, then it is practical to compute the quantity F (v, zk = 1) for every k by brute-force. [sent-69, score-0.468]
34 So we can compute P (zk = 1|v) = exp(−F (v, zk = 1)) . [sent-70, score-0.234]
35 l exp(−F (v, zl = 1)) (11) Equation 11 defines the responsibility of the k th component RBM for the data vector v. [sent-71, score-0.227]
36 Contrastive divergence learning: Below is a summary of the steps in the CD learning for the implicit mixture model. [sent-72, score-0.535]
37 For a training vector v+ , pick a component RBM by sampling the responsibilities P (zk = 1|v+ ). [sent-74, score-0.404]
38 Pick a component RBM by sampling the responsibilities P (zk = 1|v− ). [sent-83, score-0.37]
39 m − Repeating the above steps for a mini-batch of Nb training cases results in two sets of outer products + − for each component k in the mixture model: Sk = {D+ , . [sent-89, score-0.551]
40 (12) I Nb ∂Wk Nb i=1 ki j=1 kj Note that to compute the outer products D+ and D− for a given training vector, the component + − RBMs are selected through two separate stochastic picks. [sent-97, score-0.201]
41 Therefore the sets Sk and Sk need not be of the same size because the choice of the mixture component can be different for v+ and v− . [sent-98, score-0.456]
42 Scaling free energies with a temperature parameter: In practice, the above learning algorithm causes all the training cases to be captured by a single component RBM, and the other components to be left unused. [sent-99, score-0.297]
43 This is because free energy is an unnormalized quantity that can have very different numerical scales across the RBMs. [sent-100, score-0.201]
44 One RBM may happen to produce much smaller free energies than the rest because of random differences in the initial parameter values, and thus end up with high responsibilities for most training cases. [sent-101, score-0.346]
45 The solution is to use a temperature parameter T when computing the responsibilities: P (zk = 1|v) = exp(−F (v, zk = 1)/T ) . [sent-103, score-0.256]
46 4 Results We apply the implicit mixture of RBMs to two datasets, MNIST [1] and NORB [7]. [sent-107, score-0.535]
47 NORB contains stereo-pair images of 3D toy objects taken under different lighting conditions and viewpoints. [sent-109, score-0.225]
48 Evaluation method: Since computing the exact partition function of an RBM is intractable, it is not possible to directly evaluate the quality of our mixture model’s fit to the data, e. [sent-112, score-0.382]
49 , by computing 4 Figure 2: Features of the mixture model with five component RBMs trained on all ten classes of MNIST images. [sent-114, score-0.611]
50 A reasonable evaluation criterion for a mixture modelling algorithm is that it should be able to find clusters that are mostly ‘pure’ with respect to class labels. [sent-119, score-0.439]
51 That is, the set of data vectors that a particular mixture component has high responsibilities for should have the same class label. [sent-120, score-0.725]
52 So it should be possible to accurately predict the class label of a given data vector from the responsibilities of the different mixture components for that vector. [sent-121, score-0.704]
53 Once a mixture model is fully trained, we evaluate it by training a classifier that takes as input the responsibilities of the mixture components for a data vector and predicts its class label. [sent-122, score-1.102]
54 The goodness of the mixture model is measured by the test set prediction accuracy of this classifier. [sent-123, score-0.426]
55 1 Results for MNIST Before attempting to learn a good mixture model of the whole MNIST dataset, we tried two simpler modeling tasks. [sent-125, score-0.415]
56 First, we fitted an implicit mixture of two RBM’s with 100 hidden units each to an unlabelled dataset consisting of 4,000 twos and 4,000 threes. [sent-126, score-0.864]
57 This compares very favorably with logistic regression which needs 8000 labels in addition to the images and gives 36 errors on the test set even when using a penalty on the squared weights whose magnitude is set using a validation set. [sent-129, score-0.192]
58 Logistic regression also gives a good indication of the performance that could be expected from fitting a mixture of two Gaussians with a shared covariance matrix, because logistic regression is equivalent to fitting such a mixture discriminatively. [sent-130, score-0.833]
59 We then tried fitting an implicit mixture model with only five component RBMs, each with 25 hidden units, to the entire training set. [sent-131, score-0.871]
60 We purposely make the model very small so that it is possible to visually inspect the features and the responsibilities of the component RBMs and understand what each component is modelling. [sent-132, score-0.518]
61 The learning algorithm has produced a sensible mixture model in that visually similar digit classes are combined under the same mixture component. [sent-137, score-0.853]
62 5 We have also trained larger models with many more mixture components. [sent-142, score-0.405]
63 As the number of components increase, we expect the model to partition the image space more finely, with the different components specializing on various sub-classes of digits. [sent-143, score-0.223]
64 If they specialize in a way that respects the class boundaries, then their responsibilities for a data vector will become a better predictor of its class label. [sent-144, score-0.297]
65 The component RBMs use binary units both in the visible and hidden layers. [sent-145, score-0.648]
66 We have tried various settings for the number of mixture components (from 20 to 120 in steps of 20) and a component’s hidden layer size (50, 100, 200, 500). [sent-147, score-0.626]
67 The hidden layer size is set to 100, but 200 and 500 also produce similar accuracies. [sent-150, score-0.182]
68 Out of the 60,000 training images in MNIST, we use 50,000 to train the mixture model and the classifier, and the remaining 10,000 as a validation set for early stopping. [sent-151, score-0.527]
69 Once the mixture model is trained, we train a logistic regression classifier to predict the class label from the responsibilities2 . [sent-153, score-0.566]
70 It has as many inputs as there are mixture components, and a ten-way softmax over the class labels at the output. [sent-154, score-0.378]
71 In our experiments, classification accuracy is consistently and significantly higher when unnormalized responsibilities are used as the classifier input, instead of the actual posterior probabilities of the mixture components given a data vector. [sent-156, score-0.826]
72 As a simple baseline comparison, we train a logistic regression classifier that predicts the class label from the raw pixels. [sent-159, score-0.243]
73 2 Results for NORB NORB is a much more difficult dataset than MNIST because the images are of very different classes of 3D objects (instead of 2D patterns) shown from different viewpoints and under various lighting conditions. [sent-166, score-0.291]
74 So binary units are no longer appropriate for the visible layer of the component RBMs. [sent-168, score-0.568]
75 Gaussian visible units have previously been shown to be effective for modelling grayscale images [6], and therefore we use them here. [sent-169, score-0.508]
76 As in that paper, the variance of the units is fixed to 1, and only their means are learned. [sent-171, score-0.176]
77 Learning an RBM with Gaussian visible units can be slow, as it may require a much greater number of weight updates than an equivalent RBM with binary visible units. [sent-172, score-0.6]
78 We avoid it by first training a single RBM with Gaussian visible units and binary hidden units on the raw pixel data, and then treating the activities of its hidden layer as pre-processed data to which the implicit mixture model is applied. [sent-174, score-1.598]
79 Since the hidden layer activities of the pre-processing RBM are binary, the mixture model can now be trained efficiently with binary units in the visible layer3 . [sent-175, score-1.033]
80 Once trained, the low-level RBM acts as a fixed pre-processing step that converts the raw grayscale images into 2 Note that the mixture model parameters are kept fixed when training the classifier, so the learning of the mixture model is entirely unsupervised. [sent-176, score-0.985]
81 3 We actually use the real-valued probabilities of the hidden units as the data, and we also use real-valued probabilities for the reconstructions. [sent-177, score-0.377]
82 6 1-of-K activation Hidden units m Wjmk k Binary data j Pre-processing transformation Wij i Gaussian visible units (raw pixel data) Figure 3: Implicit mixture model used for MNORB. [sent-179, score-0.995]
83 Its parameters are not modified further when training the mixture model. [sent-181, score-0.384]
84 A difficulty with training the implicit mixture model (or any other mixture model) on NORB is that the ‘natural’ clusters in the dataset correspond to the six lighting conditions instead of the five object classes. [sent-183, score-1.141]
85 This significantly reduces the interference of the lighting on the mixture learning4 . [sent-188, score-0.478]
86 We use 2000 binary hidden units for the preprocessing RBM, so the input dimensionality of the implicit mixture model is 2000. [sent-193, score-0.923]
87 We have tried many different settings for the number of mixture components and the hidden layer size of the components. [sent-194, score-0.626]
88 Table 2 shows the test set error rates for a logistic regression classifier trained on various input representations. [sent-197, score-0.173]
89 Mixture of Factor Analyzers (MFA) [3] is similar to the implicit mixture of RBMs in that it also learns a clustering while simultaneously learning a latent representation per cluster component. [sent-198, score-0.573]
90 3 of [2]) is similar to an implicit mixture model whose component RBMs have no hidden units and only visible biases as trainable parameters. [sent-208, score-1.2]
91 The differences are that a Bernoulli mixture is a directed model, it has explicitly parameterized mixing proportions, and maximum likelihood learning with EM is tractable. [sent-209, score-0.49]
92 We train this model with 100 components on the activation probabilities of the preprocessing RBM’s hidden units. [sent-210, score-0.336]
93 A logistic regression classifier can still predict the lighting label with 18% test set error when trained and tested on normalized images, compared to 8% error for unnormalized images. [sent-213, score-0.437]
94 Logistic regression classifier input Unnormalized responsibilities computed by the implicit mixture of RBMs Probabilities computed by the transformation Wij in fig 3 (i. [sent-215, score-0.811]
95 53% These results show that the implicit mixture of RBMs has learned clusters that reflect the class structure in the data. [sent-223, score-0.6]
96 By the classification accuracy criterion, the implicit mixture is also better than MFA. [sent-224, score-0.556]
97 The results also confirm that the lack of explicitly parameterized mixing proportions does not prevent the implicit mixture model from discovering interesting cluster structure in the data. [sent-225, score-0.764]
98 5 Conclusions We have presented a tractable formulation of a mixture of RBMs. [sent-226, score-0.35]
99 The key insight here is that the mixture model can be cast as a third-order Boltzmann machine, provided we are willing to abandon explicitly parameterized mixing proportions. [sent-228, score-0.525]
100 Representational power of restricted boltzmann machines and deep belief networks. [sent-289, score-0.205]
wordName wordTfidf (topN-words)
[('rbm', 0.527), ('mixture', 0.35), ('rbms', 0.271), ('responsibilities', 0.241), ('zk', 0.234), ('visible', 0.189), ('implicit', 0.185), ('units', 0.176), ('mnist', 0.144), ('norb', 0.141), ('hidden', 0.131), ('lighting', 0.128), ('wijk', 0.121), ('boltzmann', 0.117), ('unnormalized', 0.115), ('component', 0.106), ('mixing', 0.085), ('bernoulli', 0.082), ('er', 0.077), ('mfa', 0.075), ('images', 0.074), ('contrastive', 0.071), ('classi', 0.069), ('tractably', 0.064), ('components', 0.064), ('logistic', 0.063), ('th', 0.063), ('raw', 0.062), ('energy', 0.056), ('trained', 0.055), ('proportions', 0.054), ('layer', 0.051), ('hj', 0.047), ('binary', 0.046), ('nb', 0.045), ('grayscale', 0.045), ('classes', 0.044), ('vn', 0.043), ('datapoint', 0.043), ('sk', 0.042), ('energies', 0.041), ('mnorb', 0.04), ('outer', 0.039), ('wij', 0.038), ('vi', 0.038), ('latent', 0.038), ('pixels', 0.037), ('proportion', 0.037), ('clusters', 0.037), ('activation', 0.037), ('regression', 0.035), ('probabilities', 0.035), ('model', 0.035), ('training', 0.034), ('train', 0.034), ('cd', 0.033), ('parameterized', 0.033), ('restricted', 0.032), ('partition', 0.032), ('pixel', 0.032), ('exp', 0.032), ('deep', 0.031), ('intractable', 0.031), ('tried', 0.03), ('responsibility', 0.03), ('pl', 0.03), ('visually', 0.03), ('free', 0.03), ('analyzers', 0.028), ('zl', 0.028), ('biases', 0.028), ('image', 0.028), ('class', 0.028), ('toronto', 0.028), ('roth', 0.027), ('tensor', 0.025), ('machines', 0.025), ('sensible', 0.024), ('modelling', 0.024), ('cvpr', 0.024), ('sampling', 0.023), ('objects', 0.023), ('assigns', 0.022), ('dataset', 0.022), ('temperature', 0.022), ('slice', 0.022), ('products', 0.022), ('explicitly', 0.022), ('ht', 0.022), ('interactions', 0.021), ('label', 0.021), ('ten', 0.021), ('tting', 0.021), ('modelled', 0.021), ('accuracy', 0.021), ('unit', 0.02), ('implicitly', 0.02), ('test', 0.02), ('pk', 0.02), ('digit', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
2 0.32852384 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel
Abstract: Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overfitting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classification task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by fitting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1
3 0.194592 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
4 0.16747773 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor
Abstract: The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difficulty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we introduce the Recurrent TRBM, which is a very slight modification of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. 1
5 0.13875821 219 nips-2008-Spectral Hashing
Author: Yair Weiss, Antonio Torralba, Rob Fergus
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1
6 0.11695637 65 nips-2008-Domain Adaptation with Multiple Sources
7 0.096009776 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
8 0.094989449 62 nips-2008-Differentiable Sparse Coding
9 0.082669564 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
10 0.080085687 148 nips-2008-Natural Image Denoising with Convolutional Networks
11 0.079160191 249 nips-2008-Variational Mixture of Gaussian Process Experts
12 0.077003203 118 nips-2008-Learning Transformational Invariants from Natural Movies
13 0.065668702 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
14 0.065136544 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
15 0.064807966 119 nips-2008-Learning a discriminative hidden part model for human action recognition
16 0.062814742 226 nips-2008-Supervised Dictionary Learning
17 0.062744975 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
18 0.06068765 4 nips-2008-A Scalable Hierarchical Distributed Language Model
19 0.057614144 196 nips-2008-Relative Margin Machines
20 0.057068728 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
topicId topicWeight
[(0, -0.193), (1, -0.075), (2, 0.098), (3, -0.069), (4, 0.005), (5, -0.007), (6, -0.053), (7, 0.039), (8, 0.018), (9, 0.039), (10, 0.099), (11, 0.118), (12, 0.05), (13, 0.081), (14, -0.283), (15, -0.243), (16, -0.136), (17, -0.28), (18, 0.057), (19, 0.093), (20, -0.002), (21, -0.103), (22, 0.064), (23, 0.02), (24, 0.02), (25, 0.109), (26, -0.074), (27, 0.217), (28, 0.081), (29, 0.107), (30, -0.023), (31, 0.016), (32, -0.006), (33, -0.085), (34, -0.035), (35, 0.046), (36, -0.06), (37, -0.13), (38, -0.145), (39, 0.057), (40, 0.002), (41, -0.088), (42, -0.04), (43, 0.042), (44, 0.026), (45, -0.026), (46, 0.017), (47, -0.093), (48, 0.023), (49, -0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.95252371 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
2 0.84695989 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel
Abstract: Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overfitting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classification task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by fitting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1
3 0.75768828 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
Author: Ilya Sutskever, Geoffrey E. Hinton, Graham W. Taylor
Abstract: The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difficulty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we introduce the Recurrent TRBM, which is a very slight modification of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. 1
4 0.55422497 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
5 0.42847934 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
6 0.41937968 219 nips-2008-Spectral Hashing
7 0.394795 249 nips-2008-Variational Mixture of Gaussian Process Experts
8 0.36307472 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
9 0.336162 226 nips-2008-Supervised Dictionary Learning
10 0.33407739 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
11 0.33022112 234 nips-2008-The Infinite Factorial Hidden Markov Model
12 0.31403431 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
13 0.30859202 3 nips-2008-A Massively Parallel Digital Learning Processor
14 0.30443043 124 nips-2008-Load and Attentional Bayes
15 0.29698172 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
16 0.29481658 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
17 0.2889494 185 nips-2008-Privacy-preserving logistic regression
18 0.286957 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
19 0.28393868 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
20 0.28327009 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
topicId topicWeight
[(6, 0.087), (7, 0.072), (12, 0.021), (28, 0.18), (32, 0.247), (57, 0.128), (59, 0.015), (63, 0.031), (77, 0.041), (78, 0.011), (83, 0.064)]
simIndex simValue paperId paperTitle
1 0.92677522 28 nips-2008-Asynchronous Distributed Learning of Topic Models
Author: Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors. As a stepping stone in the development of asynchronous HDP, a parallel HDP sampler is also introduced. 1
same-paper 2 0.83795589 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
3 0.83714718 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
Author: Kate Saenko, Trevor Darrell
Abstract: Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionarybased approach outperforms baseline methods. 1
4 0.83170056 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
5 0.74372494 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
Author: Tanya Schmah, Geoffrey E. Hinton, Steven L. Small, Stephen Strother, Richard S. Zemel
Abstract: Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overfitting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classification task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by fitting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1
6 0.71021342 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
7 0.70842886 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
8 0.70502234 62 nips-2008-Differentiable Sparse Coding
9 0.70367432 4 nips-2008-A Scalable Hierarchical Distributed Language Model
10 0.70293546 219 nips-2008-Spectral Hashing
11 0.70239002 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
12 0.7021594 200 nips-2008-Robust Kernel Principal Component Analysis
13 0.70129204 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
14 0.69848788 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
15 0.69747579 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
16 0.69624442 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
17 0.69337213 118 nips-2008-Learning Transformational Invariants from Natural Movies
18 0.69144094 226 nips-2008-Supervised Dictionary Learning
19 0.6911366 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
20 0.69093943 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning