nips nips2013 nips2013-200 knowledge-graph by maker-knowledge-mining

200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Source: pdf

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classiﬁcation tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classiﬁcation, classiﬁcation with missing inputs, and mean ﬁeld prediction tasks.1 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. [sent-7, score-0.571]

2 Prior methods of training DBMs either do not perform well on classiﬁcation tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. [sent-8, score-0.271]

3 The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classiﬁcation, classiﬁcation with missing inputs, and mean ﬁeld prediction tasks. [sent-9, score-0.368]

4 1 1 Introduction A deep Boltzmann machine (DBM) [18] is a structured probabilistic model consisting of many layers of random variables, most of which are latent. [sent-10, score-0.142]

5 DBMs are usually used as feature learners, where the mean ﬁeld expectations of the hidden units are used as input features to a separate classiﬁer, such as an MLP or logistic regression. [sent-13, score-0.213]

6 Another drawback to the DBM is the complexity of training it. [sent-15, score-0.152]

7 Typically it is trained in a greedy, layerwise fashion, by training a stack of RBMs. [sent-16, score-0.452]

8 It can be difﬁcult for practitioners to tell whether a given lower layer RBM is a good starting point to build a larger model. [sent-19, score-0.14]

9 We propose a new way of training deep Boltzmann machines called multi-prediction training (MPT). [sent-20, score-0.411]

10 MPT uses the mean ﬁeld equations for the DBM to induce recurrent nets that are then trained to solve different inference tasks. [sent-21, score-0.485]

11 The resulting trained MP-DBM model can be viewed either as a single probabilistic model trained with a variational criterion, or as a family of recurrent nets that solve related inference tasks. [sent-22, score-0.628]

12 We ﬁnd empirically that the MP-DBM does not require greedy layerwise training, so its performance on the ﬁnal task can be monitored from the start. [sent-23, score-0.245]

13 html 1 practitioners who do not have extensive experience with layerwise pretraining techniques or Markov chains. [sent-28, score-0.328]

14 Anyone with experience minimizing non-convex functions should ﬁnd MP-DBM training familiar and straightforward. [sent-29, score-0.178]

15 Moreover, we show that inference in the MP-DBM is useful– the MPDBM does not need an extra classiﬁer built on top of its learned features to obtain good inference accuracy. [sent-30, score-0.17]

16 We show that it outperforms the DBM at solving a variety of inference tasks including classiﬁcation, classiﬁcation with missing inputs, and prediction of randomly selected subsets of variables. [sent-31, score-0.188]

17 2 Review of deep Boltzmann machines Typically, a DBM contains a set of D input features v that are called the visible units because they are always observed during both training and evaluation. [sent-33, score-0.41]

18 These hidden units are usually organized into L layers h(i) of size Ni , i ∈ {1, . [sent-37, score-0.211]

19 , L}, with each unit in a layer conditionally independent of the other units in the layer given the neighboring layers. [sent-40, score-0.328]

20 The DBM is trained to maximize the mean ﬁeld lower bound on log P (v, y). [sent-41, score-0.149]

21 Unfortunately, training the entire model simultaneously does not seem to be feasible. [sent-42, score-0.152]

22 See [8] for an example of a DBM that has failed to learn using the naive training algorithm. [sent-43, score-0.152]

23 Salakhutdinov and Hinton [18] found that for their joint training procedure to work, the DBM must ﬁrst be initialized by training one layer at a time. [sent-44, score-0.433]

24 After each layer is trained as an RBM, the RBMs can be modiﬁed slightly, assembled into a DBM, and the DBM may be trained with PCD [22, 21] and mean ﬁeld. [sent-45, score-0.297]

25 Simply running mean ﬁeld inference to predict y given v in the DBM model does not work nearly as well. [sent-47, score-0.16]

26 See ﬁgure 1 for a graphical description of the training procedure used by [18]. [sent-48, score-0.19]

27 The standard approach to training a DBM requires training L + 2 different models using L + 2 different objective functions, and does not yield a single model that excels at answering all queries. [sent-49, score-0.368]

28 Optimization As a greedy optimization procedure, layerwise training may be suboptimal. [sent-52, score-0.397]

29 In general, for layerwise training to be optimal, the training procedure for each layer must take into account the inﬂuence that the deeper layers will provide. [sent-54, score-0.694]

30 The layerwise initialization procedure simply does not attempt to be optimal. [sent-55, score-0.257]

31 Moreover, model architectures incorporating design features such as sparse connections, 2 pooling, or factored multilinear interactions make it difﬁcult to predict how best to structure one layer’s hidden units in order for the next layer to make good use of them. [sent-59, score-0.291]

32 If we have one model that excels at all tasks, we can use inference in this model to answer arbitrary queries, perform classiﬁcation with missing inputs, and so on. [sent-62, score-0.192]

33 The standard DBM training procedure gives this up by training a rich probabilistic model and then using it as just a feature extractor for an MLP. [sent-63, score-0.369]

34 Simplicity Needing to implement multiple models and training stages makes the cost of developing software with DBMs greater, and makes using them more cumbersome. [sent-65, score-0.152]

35 Beyond the software engineering considerations, it can be difﬁcult to monitor training and tell what kind of results during layerwise RBM pretraining will correspond to good DBM classiﬁcation accuracy later. [sent-66, score-0.477]

36 Our joint training procedure allows the user to monitor the model’s ability of interest (usually ability to classify y given v) from the very start of training. [sent-67, score-0.282]

37 1 Multi-prediction Training Our proposed approach is to directly train the DBM to be good at solving all possible variational inference problems. [sent-70, score-0.213]

38 We call this multi-prediction training because the procedure involves training the model to predict any subset of variables given the complement of that subset of variables. [sent-71, score-0.373]

39 Note that y won’t be observed at test time, only training time. [sent-75, score-0.152]

40 Because each ﬁxed point update uses the output of a previous ﬁxed point update as input, this optimization procedure can be viewed as a recurrent neural network. [sent-87, score-0.22]

41 Sampling O just means drawing an example from the training set. [sent-90, score-0.152]

42 5) for each variable, to determine whether that variable should be an input to the inference procedure or a prediction target. [sent-92, score-0.123]

43 To compute the gradient, we simply backprop the error derivatives of J through the recurrent net deﬁning Q. [sent-93, score-0.303]

44 2 for a graphical description of this training procedure, and Fig. [sent-95, score-0.152]

45 3 for an example of the inference procedure run on MNIST digits. [sent-96, score-0.151]

46 3 a) b) c) d) Figure 1: The training procedure used by Salakhutdinov and Hinton [18] on MNIST. [sent-97, score-0.19]

47 Figure 2: Multi-prediction training: This diagram shows the neural nets instantiated to do multiprediction training on one minibatch of data. [sent-117, score-0.291]

48 Here we show only one iteration to save space, but in a real application MP training should be run with 5-15 iterations. [sent-124, score-0.18]

49 4 This training procedure is similar to one introduced by Brakel et al. [sent-126, score-0.19]

50 2 The Multi-Inference Trick Mean ﬁeld inference can be expensive due to needing to run the ﬁxed point equations several times in order to reach convergence. [sent-131, score-0.137]

51 In this case, we are no longer necessarily minimizing J as written, but rather doing partial training of a large number of ﬁxediteration recurrent nets that solve related problems. [sent-133, score-0.427]

52 We can approximately take the geometric mean over all predicted distributions Q (for different subsets Si ) and renormalize in order to combine the predictions of all of these recurrent nets. [sent-134, score-0.25]

53 This way, imperfections in the training procedure are averaged out, and we are able to solve inference tasks even if the corresponding recurrent net was never sampled during MP training. [sent-135, score-0.573]

54 In order to approximate this average efﬁciently, we simply take the geometric mean at each step of inference, instead of attempting to take the correct geometric mean of the entire inference process. [sent-136, score-0.173]

55 To take the geometric mean over a unit hj that receives input from vi , we average together the contribution vi Wij from the model that contains vi and the contribution 0 from the model that does not. [sent-142, score-0.283]

56 For the multi-inference trick, each recurrent net we average over solves a different inference problem. [sent-145, score-0.356]

57 In half of the problems, vi is observed, and contributes vi Wij to hj ’s total input. [sent-146, score-0.149]

58 If we represent the mean ﬁeld estimate of vi with ri , then in this case that unit contributes ri Wij to hj ’s total input. [sent-149, score-0.163]

59 The main beneﬁt to this approach is that it gives a good way to incorporate information from many recurrent nets trained in slightly different ways. [sent-152, score-0.356]

60 If the recurrent net corresponding to the desired inference task is somewhat suboptimal due to not having been sampled enough during training, its defects can be oftened be remedied by averaging its predictions with those of other similar recurrent nets. [sent-153, score-0.538]

61 In practice, multi-inference mostly seems to be beneﬁcial if the network was trained without letting mean ﬁeld run to convergence. [sent-155, score-0.153]

62 When the model was trained with converged mean ﬁeld, each recurrent net is just solving an optimization problem in a graphical model, and it doesn’t matter whether every recurrent net has been individually trained. [sent-156, score-0.667]

63 The multi-inference trick is mostly useful as a cheap alternative when getting the absolute best possible test set accuracy is not as important as fast training and evaluation. [sent-157, score-0.236]

64 3 Justiﬁcation and advantages In the case where we run the recurrent net for predicting Q to convergence, the multi-prediction training algorithm follows the gradient of the objective function J. [sent-159, score-0.473]

65 Maximum likelihood should be better if the overall goal is to draw realistic samples from the model, but generalized pseudolikelihood can often be better for training a model to answer queries conditioning on sets similar to the Si used during training. [sent-162, score-0.275]

66 Note that our variational approximation is not quite the same as the way variational approximations are usually applied. [sent-163, score-0.158]

67 We use variational inference to ensure that the distributions we shape using backprop are as close as possible to the true conditionals. [sent-164, score-0.196]

68 This is different from the usual approach to variational learning, where Q is used to deﬁne a lower bound on the log likelihood and variational inference is used to make the bound as tight as possible. [sent-165, score-0.243]

69 5 In the case where the recurrent net is not trained to convergence, there is an alternate way to justify MP training. [sent-166, score-0.352]

70 Rather than doing variational learning on a single probabilistic model, the MP procedure trains a family of recurrent nets to solve related prediction problems by running for some ﬁxed number of iterations. [sent-167, score-0.447]

71 Each recurrent net is trained only on a subset of the data (and most recurrent nets are never trained at all, but only work because they share parameters with the others). [sent-168, score-0.735]

72 In this case, the multi-inference trick allows us to justify MP training as approximately training an ensemble of recurrent nets using bagging. [sent-169, score-0.663]

73 [20] have observed that a training strategy similar to MPT (but lacking the multiinference trick) is useful because it trains the model to work well with the inference approximations it will be evaluated with at test time. [sent-171, score-0.265]

74 The choice of this type of variational learning combined with the underlying generalized pseudolikelihood objective makes an MP-DBM very well suited for solving approximate inference problems but not very well suited for sampling. [sent-173, score-0.22]

75 Our primary design consideration when developing multi-prediction training was ensuring that the learning rule was state-free. [sent-174, score-0.152]

76 PCD training uses persistent Markov chains to estimate the gradient. [sent-175, score-0.174]

77 The MP training rule does not make any reference to earlier training steps, and can be computed with no burn in. [sent-177, score-0.304]

78 This means that the accuracy of the MP gradient is not dependent on properties of the training algorithm such as the learning rate which can easily break PCD for many choices of the hyperparameters. [sent-178, score-0.152]

79 We ﬁnd that for joint training, it is critically important to not do this (on the MNIST dataset, we were not able to ﬁnd any MP-DBM hyperparameter conﬁguration involving weight decay that performs as well as layerwise DBMs, but without weight decay MP-DBMs outperform DBMs). [sent-186, score-0.296]

80 When the second layer weights are not trained well enough for them to be useful for modeling the data, the weight decay term will drive them to become very small, and they will never have an opportunity to recover. [sent-187, score-0.222]

81 Salakhutdinov and Hinton [18] regularize the activities of the hidden units with a somewhat complicated sparsity penalty. [sent-189, score-0.169]

82 5 Related work: centering Montavon and M¨ ller [15] showed that an alternative, “centered” representation of the DBM results u in successful generative training without a greedy layerwise pretraining step. [sent-197, score-0.643]

83 We therefore evaluate the classiﬁcation performance of centering in this work. [sent-199, score-0.13]

84 (b) Generic inference tasks: When classifying with missing inputs, the MP-DBM outperforms the other DBMs for most amounts of missing inputs. [sent-236, score-0.243]

85 (c) When using approximate inference to resolve general queries, the standard DBM, centered DBM, and MP-DBM all perform about the same when asked to predict a small number of variables. [sent-237, score-0.166]

86 1 MNIST experiments In order to compare MP training and centering to standard DBM performance, we cross-validated each of the new methods by running 25 training experiments for each of three conditions: centered DBMs, centered DBMs with the special negative phase (“Centering+”), and MP training. [sent-251, score-0.585]

87 The centered DBMs also required one additional hyperparameter, the number of Gibbs steps to run for variational PCD. [sent-253, score-0.157]

88 We use the same size of model, minibatch and negative chain collection as Salakhutdinov and Hinton [18], with 500 hidden units in the ﬁrst layer, 1,000 hidden units in the second, 100 examples per minibatch, and 100 negative chains. [sent-255, score-0.432]

89 On the validation set, MP training consistently performs better and is much less sensitive to hyperparameters than the other methods. [sent-259, score-0.193]

90 7 If instead of adding an MLP to the model, we simply train a larger MP-DBM with twice as many hidden units in each layer, and apply the multi-inference trick, we obtain a classiﬁcation error rate of 0. [sent-268, score-0.218]

91 6b shows an evaluation of various DBM’s ability to classify with missing inputs. [sent-274, score-0.125]

92 Salakhutdinov and Hinton [18] then trained an RBM with 4,000 binary hidden units and Gaussian visible units to preprocess the data into an all-binary representation, and trained a DBM with two hidden layers of 4,000 units each on this representation. [sent-282, score-0.693]

93 Since the goal of this work is to provide a single uniﬁed model and training algorithm, we do not train a separate Gaussian RBM. [sent-283, score-0.201]

94 Instead we train a single MP-DBM with Gaussian visible units and three hidden layers of 4,000 units each. [sent-284, score-0.411]

95 It is possible to do better on NORB using convolution or synthetic transformations of the training data. [sent-293, score-0.152]

96 We did not evaluate the effect of these techniques on the MP-DBM because our present goal is not to obtain state-of-the-art object recognition performance but only to verify that our joint training procedure works as well as the layerwise training procedure for DBMs. [sent-294, score-0.599]

97 There is no public demo code available for the standard DBM on this dataset, and we were not able to reproduce the standard DBM results (layerwise DBM training requires signiﬁcant experience and intuition). [sent-295, score-0.178]

98 We therefore can’t compare the MP-DBM to the original DBM in terms of answering general queries or classiﬁcation with missing inputs on this dataset. [sent-296, score-0.235]

99 We have veriﬁed that MP training outperforms the standard training procedure at classiﬁcation on the MNIST and NORB datasets where the original DBM was ﬁrst applied. [sent-298, score-0.342]

100 We have shown that MP training works well with binary, Gaussian, and softmax units, as well as architectures with either two or three hidden layers. [sent-299, score-0.205]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dbm', 0.684), ('layerwise', 0.219), ('dbms', 0.193), ('recurrent', 0.182), ('mp', 0.167), ('training', 0.152), ('mlp', 0.137), ('centering', 0.13), ('salakhutdinov', 0.121), ('units', 0.116), ('hinton', 0.11), ('osi', 0.109), ('eld', 0.103), ('norb', 0.096), ('rbm', 0.095), ('nets', 0.093), ('layer', 0.091), ('net', 0.089), ('inference', 0.085), ('trick', 0.084), ('boltzmann', 0.081), ('trained', 0.081), ('mpt', 0.08), ('missing', 0.079), ('variational', 0.079), ('deep', 0.073), ('bengio', 0.069), ('mnist', 0.068), ('queries', 0.067), ('classi', 0.064), ('vi', 0.06), ('si', 0.06), ('pcd', 0.059), ('pretraining', 0.059), ('pseudolikelihood', 0.056), ('dropout', 0.054), ('inputs', 0.053), ('hidden', 0.053), ('centered', 0.05), ('train', 0.049), ('goodfellow', 0.047), ('minibatch', 0.046), ('brakel', 0.044), ('mean', 0.044), ('layers', 0.042), ('qi', 0.041), ('hyperparameters', 0.041), ('theano', 0.04), ('procedure', 0.038), ('pascanu', 0.037), ('gsns', 0.036), ('montavon', 0.036), ('mpdbm', 0.036), ('ollivier', 0.036), ('stoyanov', 0.036), ('answering', 0.036), ('visible', 0.035), ('generative', 0.035), ('machines', 0.034), ('bergstra', 0.033), ('wij', 0.033), ('bastien', 0.032), ('lamblin', 0.032), ('backprop', 0.032), ('shraibman', 0.032), ('arnold', 0.032), ('hyperparameter', 0.031), ('predict', 0.031), ('unit', 0.03), ('hj', 0.029), ('cation', 0.029), ('ais', 0.028), ('excels', 0.028), ('montr', 0.028), ('trains', 0.028), ('run', 0.028), ('never', 0.027), ('phase', 0.027), ('probabilistic', 0.027), ('mirza', 0.026), ('greedy', 0.026), ('experience', 0.026), ('pixels', 0.025), ('tell', 0.025), ('stage', 0.025), ('maximize', 0.024), ('subsets', 0.024), ('ability', 0.024), ('needing', 0.024), ('practitioners', 0.024), ('negative', 0.024), ('decay', 0.023), ('momentum', 0.023), ('rbms', 0.023), ('predicting', 0.022), ('ller', 0.022), ('monitor', 0.022), ('courville', 0.022), ('classify', 0.022), ('chains', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

2 0.25986087 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

Author: Shunan Zhang, Angela J. Yu

Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added beneﬁt of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are signiﬁcantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1

3 0.16598836 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1

4 0.15353245 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classiﬁcation error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

5 0.14703515 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is signiﬁcant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

6 0.14003336 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

7 0.12809476 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

8 0.1138764 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

9 0.11378286 75 nips-2013-Convex Two-Layer Modeling

10 0.10687508 339 nips-2013-Understanding Dropout

11 0.10555501 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

12 0.10369998 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

13 0.10094737 160 nips-2013-Learning Stochastic Feedforward Neural Networks

14 0.09170568 226 nips-2013-One-shot learning by inverting a compositional causal process

15 0.086786918 64 nips-2013-Compete to Compute

16 0.084461704 318 nips-2013-Structured Learning via Logistic Regression

17 0.08024279 36 nips-2013-Annealing between distributions by averaging moments

18 0.073037207 99 nips-2013-Dropout Training as Adaptive Regularization

19 0.072244622 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

20 0.067566857 5 nips-2013-A Deep Architecture for Matching Short Texts

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, 0.065), (2, -0.143), (3, -0.091), (4, 0.105), (5, -0.079), (6, -0.021), (7, 0.041), (8, 0.042), (9, -0.149), (10, 0.191), (11, 0.02), (12, -0.092), (13, 0.028), (14, 0.019), (15, 0.045), (16, -0.079), (17, 0.014), (18, 0.004), (19, -0.011), (20, 0.048), (21, -0.026), (22, -0.012), (23, 0.001), (24, -0.013), (25, 0.037), (26, -0.067), (27, 0.001), (28, 0.004), (29, 0.053), (30, -0.035), (31, 0.026), (32, -0.083), (33, -0.024), (34, 0.114), (35, -0.063), (36, -0.055), (37, 0.042), (38, -0.093), (39, 0.032), (40, -0.01), (41, 0.118), (42, -0.024), (43, 0.031), (44, 0.006), (45, 0.105), (46, -0.032), (47, 0.013), (48, 0.085), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91315329 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

2 0.751872 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

3 0.73118454 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

Author: Yann Dauphin, Yoshua Bengio

Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classiﬁcation benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1

4 0.67248017 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel

Abstract: This paper examines the question: What kinds of distributions can be efﬁciently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efﬁciently capture any distribution whose density depends on the number of 1’s in their input. We also provide the ﬁrst known example of a particular type of distribution that provably cannot be efﬁciently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efﬁciently by an RBM our results provide a new rigorous justiﬁcation for the use of potentially more expressive generative models, such as deeper ones. 1

5 0.67058671 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

6 0.62729847 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

7 0.59128428 160 nips-2013-Learning Stochastic Feedforward Neural Networks

8 0.58878577 36 nips-2013-Annealing between distributions by averaging moments

9 0.58120543 251 nips-2013-Predicting Parameters in Deep Learning

10 0.57183683 64 nips-2013-Compete to Compute

11 0.53145987 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

12 0.52565503 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

13 0.52521861 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

14 0.4893426 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

15 0.47840917 226 nips-2013-One-shot learning by inverting a compositional causal process

16 0.46821705 339 nips-2013-Understanding Dropout

17 0.45530623 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

18 0.45080176 75 nips-2013-Convex Two-Layer Modeling

19 0.43551511 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

20 0.41885105 168 nips-2013-Learning to Pass Expectation Propagation Messages

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (16, 0.042), (23, 0.185), (33, 0.218), (34, 0.099), (41, 0.022), (49, 0.079), (56, 0.09), (70, 0.034), (85, 0.015), (89, 0.017), (93, 0.086), (95, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87699652 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

2 0.82261664 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

3 0.81449538 64 nips-2013-Compete to Compute

Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber

Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artiﬁcial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1

4 0.8142873 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

5 0.81200767 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori

Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efﬁciently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classiﬁcation, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classiﬁcation, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classiﬁcation label and localized latent paths. 1

6 0.8098737 301 nips-2013-Sparse Additive Text Models with Low Rank Background

7 0.80924356 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

8 0.80760527 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

9 0.80605859 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

10 0.80506659 251 nips-2013-Predicting Parameters in Deep Learning

11 0.80407256 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

12 0.80403584 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

13 0.80367428 183 nips-2013-Mapping paradigm ontologies to and from the brain

14 0.80338985 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

15 0.80263382 335 nips-2013-Transfer Learning in a Transductive Setting

16 0.80235696 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

17 0.80120385 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

18 0.80112386 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

19 0.80064911 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

20 0.80060768 36 nips-2013-Annealing between distributions by averaging moments