nips nips2013 nips2013-200 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
Reference: text
sentIndex sentText sentNum sentScore
1 The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. [sent-7, score-0.571]
2 Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. [sent-8, score-0.271]
3 The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks. [sent-9, score-0.368]
4 1 1 Introduction A deep Boltzmann machine (DBM) [18] is a structured probabilistic model consisting of many layers of random variables, most of which are latent. [sent-10, score-0.142]
5 DBMs are usually used as feature learners, where the mean field expectations of the hidden units are used as input features to a separate classifier, such as an MLP or logistic regression. [sent-13, score-0.213]
6 Another drawback to the DBM is the complexity of training it. [sent-15, score-0.152]
7 Typically it is trained in a greedy, layerwise fashion, by training a stack of RBMs. [sent-16, score-0.452]
8 It can be difficult for practitioners to tell whether a given lower layer RBM is a good starting point to build a larger model. [sent-19, score-0.14]
9 We propose a new way of training deep Boltzmann machines called multi-prediction training (MPT). [sent-20, score-0.411]
10 MPT uses the mean field equations for the DBM to induce recurrent nets that are then trained to solve different inference tasks. [sent-21, score-0.485]
11 The resulting trained MP-DBM model can be viewed either as a single probabilistic model trained with a variational criterion, or as a family of recurrent nets that solve related inference tasks. [sent-22, score-0.628]
12 We find empirically that the MP-DBM does not require greedy layerwise training, so its performance on the final task can be monitored from the start. [sent-23, score-0.245]
13 html 1 practitioners who do not have extensive experience with layerwise pretraining techniques or Markov chains. [sent-28, score-0.328]
14 Anyone with experience minimizing non-convex functions should find MP-DBM training familiar and straightforward. [sent-29, score-0.178]
15 Moreover, we show that inference in the MP-DBM is useful– the MPDBM does not need an extra classifier built on top of its learned features to obtain good inference accuracy. [sent-30, score-0.17]
16 We show that it outperforms the DBM at solving a variety of inference tasks including classification, classification with missing inputs, and prediction of randomly selected subsets of variables. [sent-31, score-0.188]
17 2 Review of deep Boltzmann machines Typically, a DBM contains a set of D input features v that are called the visible units because they are always observed during both training and evaluation. [sent-33, score-0.41]
18 These hidden units are usually organized into L layers h(i) of size Ni , i ∈ {1, . [sent-37, score-0.211]
19 , L}, with each unit in a layer conditionally independent of the other units in the layer given the neighboring layers. [sent-40, score-0.328]
20 The DBM is trained to maximize the mean field lower bound on log P (v, y). [sent-41, score-0.149]
21 Unfortunately, training the entire model simultaneously does not seem to be feasible. [sent-42, score-0.152]
22 See [8] for an example of a DBM that has failed to learn using the naive training algorithm. [sent-43, score-0.152]
23 Salakhutdinov and Hinton [18] found that for their joint training procedure to work, the DBM must first be initialized by training one layer at a time. [sent-44, score-0.433]
24 After each layer is trained as an RBM, the RBMs can be modified slightly, assembled into a DBM, and the DBM may be trained with PCD [22, 21] and mean field. [sent-45, score-0.297]
25 Simply running mean field inference to predict y given v in the DBM model does not work nearly as well. [sent-47, score-0.16]
26 See figure 1 for a graphical description of the training procedure used by [18]. [sent-48, score-0.19]
27 The standard approach to training a DBM requires training L + 2 different models using L + 2 different objective functions, and does not yield a single model that excels at answering all queries. [sent-49, score-0.368]
28 Optimization As a greedy optimization procedure, layerwise training may be suboptimal. [sent-52, score-0.397]
29 In general, for layerwise training to be optimal, the training procedure for each layer must take into account the influence that the deeper layers will provide. [sent-54, score-0.694]
30 The layerwise initialization procedure simply does not attempt to be optimal. [sent-55, score-0.257]
31 Moreover, model architectures incorporating design features such as sparse connections, 2 pooling, or factored multilinear interactions make it difficult to predict how best to structure one layer’s hidden units in order for the next layer to make good use of them. [sent-59, score-0.291]
32 If we have one model that excels at all tasks, we can use inference in this model to answer arbitrary queries, perform classification with missing inputs, and so on. [sent-62, score-0.192]
33 The standard DBM training procedure gives this up by training a rich probabilistic model and then using it as just a feature extractor for an MLP. [sent-63, score-0.369]
34 Simplicity Needing to implement multiple models and training stages makes the cost of developing software with DBMs greater, and makes using them more cumbersome. [sent-65, score-0.152]
35 Beyond the software engineering considerations, it can be difficult to monitor training and tell what kind of results during layerwise RBM pretraining will correspond to good DBM classification accuracy later. [sent-66, score-0.477]
36 Our joint training procedure allows the user to monitor the model’s ability of interest (usually ability to classify y given v) from the very start of training. [sent-67, score-0.282]
37 1 Multi-prediction Training Our proposed approach is to directly train the DBM to be good at solving all possible variational inference problems. [sent-70, score-0.213]
38 We call this multi-prediction training because the procedure involves training the model to predict any subset of variables given the complement of that subset of variables. [sent-71, score-0.373]
39 Note that y won’t be observed at test time, only training time. [sent-75, score-0.152]
40 Because each fixed point update uses the output of a previous fixed point update as input, this optimization procedure can be viewed as a recurrent neural network. [sent-87, score-0.22]
41 Sampling O just means drawing an example from the training set. [sent-90, score-0.152]
42 5) for each variable, to determine whether that variable should be an input to the inference procedure or a prediction target. [sent-92, score-0.123]
43 To compute the gradient, we simply backprop the error derivatives of J through the recurrent net defining Q. [sent-93, score-0.303]
44 2 for a graphical description of this training procedure, and Fig. [sent-95, score-0.152]
45 3 for an example of the inference procedure run on MNIST digits. [sent-96, score-0.151]
46 3 a) b) c) d) Figure 1: The training procedure used by Salakhutdinov and Hinton [18] on MNIST. [sent-97, score-0.19]
47 Figure 2: Multi-prediction training: This diagram shows the neural nets instantiated to do multiprediction training on one minibatch of data. [sent-117, score-0.291]
48 Here we show only one iteration to save space, but in a real application MP training should be run with 5-15 iterations. [sent-124, score-0.18]
49 4 This training procedure is similar to one introduced by Brakel et al. [sent-126, score-0.19]
50 2 The Multi-Inference Trick Mean field inference can be expensive due to needing to run the fixed point equations several times in order to reach convergence. [sent-131, score-0.137]
51 In this case, we are no longer necessarily minimizing J as written, but rather doing partial training of a large number of fixediteration recurrent nets that solve related problems. [sent-133, score-0.427]
52 We can approximately take the geometric mean over all predicted distributions Q (for different subsets Si ) and renormalize in order to combine the predictions of all of these recurrent nets. [sent-134, score-0.25]
53 This way, imperfections in the training procedure are averaged out, and we are able to solve inference tasks even if the corresponding recurrent net was never sampled during MP training. [sent-135, score-0.573]
54 In order to approximate this average efficiently, we simply take the geometric mean at each step of inference, instead of attempting to take the correct geometric mean of the entire inference process. [sent-136, score-0.173]
55 To take the geometric mean over a unit hj that receives input from vi , we average together the contribution vi Wij from the model that contains vi and the contribution 0 from the model that does not. [sent-142, score-0.283]
56 For the multi-inference trick, each recurrent net we average over solves a different inference problem. [sent-145, score-0.356]
57 In half of the problems, vi is observed, and contributes vi Wij to hj ’s total input. [sent-146, score-0.149]
58 If we represent the mean field estimate of vi with ri , then in this case that unit contributes ri Wij to hj ’s total input. [sent-149, score-0.163]
59 The main benefit to this approach is that it gives a good way to incorporate information from many recurrent nets trained in slightly different ways. [sent-152, score-0.356]
60 If the recurrent net corresponding to the desired inference task is somewhat suboptimal due to not having been sampled enough during training, its defects can be oftened be remedied by averaging its predictions with those of other similar recurrent nets. [sent-153, score-0.538]
61 In practice, multi-inference mostly seems to be beneficial if the network was trained without letting mean field run to convergence. [sent-155, score-0.153]
62 When the model was trained with converged mean field, each recurrent net is just solving an optimization problem in a graphical model, and it doesn’t matter whether every recurrent net has been individually trained. [sent-156, score-0.667]
63 The multi-inference trick is mostly useful as a cheap alternative when getting the absolute best possible test set accuracy is not as important as fast training and evaluation. [sent-157, score-0.236]
64 3 Justification and advantages In the case where we run the recurrent net for predicting Q to convergence, the multi-prediction training algorithm follows the gradient of the objective function J. [sent-159, score-0.473]
65 Maximum likelihood should be better if the overall goal is to draw realistic samples from the model, but generalized pseudolikelihood can often be better for training a model to answer queries conditioning on sets similar to the Si used during training. [sent-162, score-0.275]
66 Note that our variational approximation is not quite the same as the way variational approximations are usually applied. [sent-163, score-0.158]
67 We use variational inference to ensure that the distributions we shape using backprop are as close as possible to the true conditionals. [sent-164, score-0.196]
68 This is different from the usual approach to variational learning, where Q is used to define a lower bound on the log likelihood and variational inference is used to make the bound as tight as possible. [sent-165, score-0.243]
69 5 In the case where the recurrent net is not trained to convergence, there is an alternate way to justify MP training. [sent-166, score-0.352]
70 Rather than doing variational learning on a single probabilistic model, the MP procedure trains a family of recurrent nets to solve related prediction problems by running for some fixed number of iterations. [sent-167, score-0.447]
71 Each recurrent net is trained only on a subset of the data (and most recurrent nets are never trained at all, but only work because they share parameters with the others). [sent-168, score-0.735]
72 In this case, the multi-inference trick allows us to justify MP training as approximately training an ensemble of recurrent nets using bagging. [sent-169, score-0.663]
73 [20] have observed that a training strategy similar to MPT (but lacking the multiinference trick) is useful because it trains the model to work well with the inference approximations it will be evaluated with at test time. [sent-171, score-0.265]
74 The choice of this type of variational learning combined with the underlying generalized pseudolikelihood objective makes an MP-DBM very well suited for solving approximate inference problems but not very well suited for sampling. [sent-173, score-0.22]
75 Our primary design consideration when developing multi-prediction training was ensuring that the learning rule was state-free. [sent-174, score-0.152]
76 PCD training uses persistent Markov chains to estimate the gradient. [sent-175, score-0.174]
77 The MP training rule does not make any reference to earlier training steps, and can be computed with no burn in. [sent-177, score-0.304]
78 This means that the accuracy of the MP gradient is not dependent on properties of the training algorithm such as the learning rate which can easily break PCD for many choices of the hyperparameters. [sent-178, score-0.152]
79 We find that for joint training, it is critically important to not do this (on the MNIST dataset, we were not able to find any MP-DBM hyperparameter configuration involving weight decay that performs as well as layerwise DBMs, but without weight decay MP-DBMs outperform DBMs). [sent-186, score-0.296]
80 When the second layer weights are not trained well enough for them to be useful for modeling the data, the weight decay term will drive them to become very small, and they will never have an opportunity to recover. [sent-187, score-0.222]
81 Salakhutdinov and Hinton [18] regularize the activities of the hidden units with a somewhat complicated sparsity penalty. [sent-189, score-0.169]
82 5 Related work: centering Montavon and M¨ ller [15] showed that an alternative, “centered” representation of the DBM results u in successful generative training without a greedy layerwise pretraining step. [sent-197, score-0.643]
83 We therefore evaluate the classification performance of centering in this work. [sent-199, score-0.13]
84 (b) Generic inference tasks: When classifying with missing inputs, the MP-DBM outperforms the other DBMs for most amounts of missing inputs. [sent-236, score-0.243]
85 (c) When using approximate inference to resolve general queries, the standard DBM, centered DBM, and MP-DBM all perform about the same when asked to predict a small number of variables. [sent-237, score-0.166]
86 1 MNIST experiments In order to compare MP training and centering to standard DBM performance, we cross-validated each of the new methods by running 25 training experiments for each of three conditions: centered DBMs, centered DBMs with the special negative phase (“Centering+”), and MP training. [sent-251, score-0.585]
87 The centered DBMs also required one additional hyperparameter, the number of Gibbs steps to run for variational PCD. [sent-253, score-0.157]
88 We use the same size of model, minibatch and negative chain collection as Salakhutdinov and Hinton [18], with 500 hidden units in the first layer, 1,000 hidden units in the second, 100 examples per minibatch, and 100 negative chains. [sent-255, score-0.432]
89 On the validation set, MP training consistently performs better and is much less sensitive to hyperparameters than the other methods. [sent-259, score-0.193]
90 7 If instead of adding an MLP to the model, we simply train a larger MP-DBM with twice as many hidden units in each layer, and apply the multi-inference trick, we obtain a classification error rate of 0. [sent-268, score-0.218]
91 6b shows an evaluation of various DBM’s ability to classify with missing inputs. [sent-274, score-0.125]
92 Salakhutdinov and Hinton [18] then trained an RBM with 4,000 binary hidden units and Gaussian visible units to preprocess the data into an all-binary representation, and trained a DBM with two hidden layers of 4,000 units each on this representation. [sent-282, score-0.693]
93 Since the goal of this work is to provide a single unified model and training algorithm, we do not train a separate Gaussian RBM. [sent-283, score-0.201]
94 Instead we train a single MP-DBM with Gaussian visible units and three hidden layers of 4,000 units each. [sent-284, score-0.411]
95 It is possible to do better on NORB using convolution or synthetic transformations of the training data. [sent-293, score-0.152]
96 We did not evaluate the effect of these techniques on the MP-DBM because our present goal is not to obtain state-of-the-art object recognition performance but only to verify that our joint training procedure works as well as the layerwise training procedure for DBMs. [sent-294, score-0.599]
97 There is no public demo code available for the standard DBM on this dataset, and we were not able to reproduce the standard DBM results (layerwise DBM training requires significant experience and intuition). [sent-295, score-0.178]
98 We therefore can’t compare the MP-DBM to the original DBM in terms of answering general queries or classification with missing inputs on this dataset. [sent-296, score-0.235]
99 We have verified that MP training outperforms the standard training procedure at classification on the MNIST and NORB datasets where the original DBM was first applied. [sent-298, score-0.342]
100 We have shown that MP training works well with binary, Gaussian, and softmax units, as well as architectures with either two or three hidden layers. [sent-299, score-0.205]
wordName wordTfidf (topN-words)
[('dbm', 0.684), ('layerwise', 0.219), ('dbms', 0.193), ('recurrent', 0.182), ('mp', 0.167), ('training', 0.152), ('mlp', 0.137), ('centering', 0.13), ('salakhutdinov', 0.121), ('units', 0.116), ('hinton', 0.11), ('osi', 0.109), ('eld', 0.103), ('norb', 0.096), ('rbm', 0.095), ('nets', 0.093), ('layer', 0.091), ('net', 0.089), ('inference', 0.085), ('trick', 0.084), ('boltzmann', 0.081), ('trained', 0.081), ('mpt', 0.08), ('missing', 0.079), ('variational', 0.079), ('deep', 0.073), ('bengio', 0.069), ('mnist', 0.068), ('queries', 0.067), ('classi', 0.064), ('vi', 0.06), ('si', 0.06), ('pcd', 0.059), ('pretraining', 0.059), ('pseudolikelihood', 0.056), ('dropout', 0.054), ('inputs', 0.053), ('hidden', 0.053), ('centered', 0.05), ('train', 0.049), ('goodfellow', 0.047), ('minibatch', 0.046), ('brakel', 0.044), ('mean', 0.044), ('layers', 0.042), ('qi', 0.041), ('hyperparameters', 0.041), ('theano', 0.04), ('procedure', 0.038), ('pascanu', 0.037), ('gsns', 0.036), ('montavon', 0.036), ('mpdbm', 0.036), ('ollivier', 0.036), ('stoyanov', 0.036), ('answering', 0.036), ('visible', 0.035), ('generative', 0.035), ('machines', 0.034), ('bergstra', 0.033), ('wij', 0.033), ('bastien', 0.032), ('lamblin', 0.032), ('backprop', 0.032), ('shraibman', 0.032), ('arnold', 0.032), ('hyperparameter', 0.031), ('predict', 0.031), ('unit', 0.03), ('hj', 0.029), ('cation', 0.029), ('ais', 0.028), ('excels', 0.028), ('montr', 0.028), ('trains', 0.028), ('run', 0.028), ('never', 0.027), ('phase', 0.027), ('probabilistic', 0.027), ('mirza', 0.026), ('greedy', 0.026), ('experience', 0.026), ('pixels', 0.025), ('tell', 0.025), ('stage', 0.025), ('maximize', 0.024), ('subsets', 0.024), ('ability', 0.024), ('needing', 0.024), ('practitioners', 0.024), ('negative', 0.024), ('decay', 0.023), ('momentum', 0.023), ('rbms', 0.023), ('predicting', 0.022), ('ller', 0.022), ('monitor', 0.022), ('courville', 0.022), ('classify', 0.022), ('chains', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
2 0.25986087 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
Author: Shunan Zhang, Angela J. Yu
Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1
3 0.16598836 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
4 0.15353245 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
5 0.14703515 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
6 0.14003336 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
7 0.12809476 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
8 0.1138764 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
9 0.11378286 75 nips-2013-Convex Two-Layer Modeling
10 0.10687508 339 nips-2013-Understanding Dropout
11 0.10555501 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
12 0.10369998 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
13 0.10094737 160 nips-2013-Learning Stochastic Feedforward Neural Networks
14 0.09170568 226 nips-2013-One-shot learning by inverting a compositional causal process
15 0.086786918 64 nips-2013-Compete to Compute
16 0.084461704 318 nips-2013-Structured Learning via Logistic Regression
17 0.08024279 36 nips-2013-Annealing between distributions by averaging moments
18 0.073037207 99 nips-2013-Dropout Training as Adaptive Regularization
19 0.072244622 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
20 0.067566857 5 nips-2013-A Deep Architecture for Matching Short Texts
topicId topicWeight
[(0, 0.179), (1, 0.065), (2, -0.143), (3, -0.091), (4, 0.105), (5, -0.079), (6, -0.021), (7, 0.041), (8, 0.042), (9, -0.149), (10, 0.191), (11, 0.02), (12, -0.092), (13, 0.028), (14, 0.019), (15, 0.045), (16, -0.079), (17, 0.014), (18, 0.004), (19, -0.011), (20, 0.048), (21, -0.026), (22, -0.012), (23, 0.001), (24, -0.013), (25, 0.037), (26, -0.067), (27, 0.001), (28, 0.004), (29, 0.053), (30, -0.035), (31, 0.026), (32, -0.083), (33, -0.024), (34, 0.114), (35, -0.063), (36, -0.055), (37, 0.042), (38, -0.093), (39, 0.032), (40, -0.01), (41, 0.118), (42, -0.024), (43, 0.031), (44, 0.006), (45, 0.105), (46, -0.032), (47, 0.013), (48, 0.085), (49, -0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.91315329 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
2 0.751872 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.73118454 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
4 0.67248017 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
5 0.67058671 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
6 0.62729847 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
7 0.59128428 160 nips-2013-Learning Stochastic Feedforward Neural Networks
8 0.58878577 36 nips-2013-Annealing between distributions by averaging moments
9 0.58120543 251 nips-2013-Predicting Parameters in Deep Learning
10 0.57183683 64 nips-2013-Compete to Compute
11 0.53145987 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
12 0.52565503 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
13 0.52521861 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
14 0.4893426 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
15 0.47840917 226 nips-2013-One-shot learning by inverting a compositional causal process
16 0.46821705 339 nips-2013-Understanding Dropout
17 0.45530623 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
18 0.45080176 75 nips-2013-Convex Two-Layer Modeling
19 0.43551511 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
20 0.41885105 168 nips-2013-Learning to Pass Expectation Propagation Messages
topicId topicWeight
[(2, 0.015), (16, 0.042), (23, 0.185), (33, 0.218), (34, 0.099), (41, 0.022), (49, 0.079), (56, 0.09), (70, 0.034), (85, 0.015), (89, 0.017), (93, 0.086), (95, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.87699652 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
2 0.82261664 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.81449538 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
4 0.8142873 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
5 0.81200767 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1
6 0.8098737 301 nips-2013-Sparse Additive Text Models with Low Rank Background
7 0.80924356 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
8 0.80760527 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
9 0.80605859 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
10 0.80506659 251 nips-2013-Predicting Parameters in Deep Learning
11 0.80407256 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
12 0.80403584 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
13 0.80367428 183 nips-2013-Mapping paradigm ontologies to and from the brain
14 0.80338985 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.80263382 335 nips-2013-Transfer Learning in a Transductive Setting
16 0.80235696 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
17 0.80120385 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
18 0.80112386 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
19 0.80064911 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
20 0.80060768 36 nips-2013-Annealing between distributions by averaging moments