nips nips2013 nips2013-160 knowledge-graph by maker-knowledge-mining

160 nips-2013-Learning Stochastic Feedforward Neural Networks

Source: pdf

Author: Yichuan Tang, Ruslan Salakhutdinov

Abstract: Multilayer perceptrons (MLPs) or neural networks are popular models used for nonlinear regression and classiﬁcation tasks. As regressors, MLPs model the conditional distribution of the predictor variables Y given the input variables X. However, this predictive distribution is assumed to be unimodal (e.g. Gaussian). For tasks involving structured prediction, the conditional distribution should be multi-modal, resulting in one-to-many mappings. By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. However, previously proposed learning algorithms for SBNs are not efﬁcient and unsuitable for modeling real-valued data. In this paper, we propose a stochastic feedforward network with hidden layers composed of both deterministic and stochastic variables. A new Generalized EM training procedure using importance sampling allows us to efﬁciently learn complicated conditional distributions. Our model achieves superior performance on synthetic and facial expressions datasets compared to conditional Restricted Boltzmann Machines and Mixture Density Networks. In addition, the latent features of our model improves classiﬁcation and can learn to generate colorful textures of objects. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 By using stochastic hidden variables rather than deterministic ones, Sigmoid Belief Nets (SBNs) can induce a rich multimodal distribution in the output space. [sent-13, score-0.499]

2 In this paper, we propose a stochastic feedforward network with hidden layers composed of both deterministic and stochastic variables. [sent-15, score-0.649]

3 A new Generalized EM training procedure using importance sampling allows us to efﬁciently learn complicated conditional distributions. [sent-16, score-0.277]

4 Our model achieves superior performance on synthetic and facial expressions datasets compared to conditional Restricted Boltzmann Machines and Mixture Density Networks. [sent-17, score-0.247]

5 Since the nonlinear activations are all deterministic, MLPs model the conditional distribution p(Y |X) with a unimodal assumption (e. [sent-21, score-0.147]

6 For many structured prediction problems, we are interested in a conditional distribution p(Y |X) that is multimodal and may have complicated structure2 . [sent-24, score-0.176]

7 One way to model the multi-modality is to make the hidden variables stochastic. [sent-25, score-0.224]

8 Conditioned on a particular input X, different hidden conﬁgurations lead to different Y . [sent-26, score-0.224]

9 With binary input, hidden, and output variables, they can be viewed as directed graphical models where the sigmoid function is used to compute the degrees of “belief” of a child variable given the parent nodes. [sent-28, score-0.141]

10 The original paper by Neal [2] proposed a Gibbs sampler which cycles through the hidden nodes one at a time. [sent-30, score-0.278]

11 A variational learning algorithm based on 2 For example, in a MLP with one input, one output and one hidden layer: p(y|x) ∼ N (y|µy , σy ), µy = σ W2 σ(W1 x) , σ(a) = 1/(1 + exp(−a)) is the sigmoid function. [sent-33, score-0.397]

12 1 1 stochastic stochastic y x Figure 1: Stochastic Feedforward Neural Networks. [sent-36, score-0.174]

13 Red nodes are stochastic and binary, while the rest of the hiddens are deterministic sigmoid nodes. [sent-38, score-0.37]

14 A drawback of the variational approach is that, similar to Gibbs, it has to cycle through the hidden nodes one at a time. [sent-42, score-0.337]

15 In this paper, we introduce the Stochastic Feedforward Neural Network (SFNN) for modeling conditional distributions p(y|x) over continuous real-valued Y output space. [sent-45, score-0.175]

16 Unlike SBNs, to better model continuous data, SFNNs have hidden layers with both stochastic and deterministic units. [sent-46, score-0.474]

17 1 shows a diagram of SFNNs with multiple hidden layers. [sent-48, score-0.224]

18 Given an input vector x, different states of the stochastic units can generates different modes in Y . [sent-49, score-0.204]

19 • Stochastic units form a distributed code to represent an exponential number of mixture components in output space. [sent-53, score-0.219]

20 • Combination of stochastic and deterministic hidden units can be jointly trained using the backpropagation algorithm, as in standard feed-forward neural networks. [sent-55, score-0.556]

21 Note that Gaussian Processes [7] and Gaussian Random Fields [8] are unimodal and therefore incapable of modeling a multimodal Y . [sent-57, score-0.145]

22 MDNs use a mixture of Gaussians to represent the output Y . [sent-62, score-0.132]

23 However, the number of mixture components in the output Y space must be pre-speciﬁed and the number of parameters is linear in the number of mixture components. [sent-65, score-0.24]

24 In contrast, with Nh stochastic hidden nodes, SFNNs can use its distributed representation to model up to 2Nh mixture components in the output Y . [sent-66, score-0.47]

25 2 Stochastic Feedforward Neural Networks SFNNs contain binary stochastic hidden variables h ∈ {0, 1}Nh , where Nh is the number of hidden nodes. [sent-67, score-0.535]

26 For clarity of presentation, we construct a SFNN from a one-hidden-layer MLP by replacing the sigmoid nodes with stochastic binary ones. [sent-68, score-0.231]

27 Note that other types stochastic units can also be used. [sent-69, score-0.147]

28 The conditional distribution of interest, p(y|x), is obtained by marginalizing out the latent stochastic hidden variables: p(y|x) = h p(y, h|x). [sent-70, score-0.403]

29 Since we have a mixture model with potentially 2Nh components conditioned on any x, p(y|x) does not have a closed-form expression. [sent-77, score-0.171]

30 However, using only discrete hiddens is suboptimal when modeling real-valued output Y . [sent-82, score-0.143]

31 This is due to the fact that while y is continuous, there are only a ﬁnite number of discrete hidden states, each one (e. [sent-83, score-0.224]

32 The mean of a Gaussian component T is a function of the hidden state: µ(h ) = W2 h + b2 . [sent-86, score-0.224]

33 When x varies, only the probability of choosing a speciﬁc hidden state h changes via p(h |x), not µ(h ). [sent-87, score-0.224]

34 However, if we allow µ(h ) to be a deterministic function of x as well, we can learn a smoother p(y|x), even when it is desirable 2 to learn small residual variances σy . [sent-88, score-0.186]

35 This can be accomplished by allowing for both stochastic and deterministic units in a single SFNN hidden layer, allowing the mean µ(h , x) to have contributions from two components, one from the hidden state h , and another one from deﬁning a deterministic mapping from x. [sent-89, score-0.753]

36 In SFNNs with only one hidden layer, p(h|x) is a factorial Bernoulli distribution. [sent-91, score-0.251]

37 We can increase the entropy over the stochastic hidden variables by adding a second hidden layer. [sent-93, score-0.535]

38 The second hidden layer takes the stochastic and any deterministic hidden nodes of the ﬁrst layer as its input. [sent-94, score-0.768]

39 In our SFNNs, we assume a conditional diagonal Gaussian distribution for the output Y : 1 2 2 log p(y|h, x) ∝ − 2 i log σi − 1 i (yi − µ(h, x))2 /σi . [sent-97, score-0.143]

40 Specifically, importance sampling is used during the E-step to approximate the posterior p(h|y, x), while the Backprop algorithm is used during the M-step to calculate the derivatives of the parameters of both the stochastic and deterministic nodes. [sent-104, score-0.331]

41 The drawback of our learning algorithm is the requirement of sampling the stochastic nodes M times for every weight update. [sent-106, score-0.227]

42 As a comparison, energy based models, such as conditional Restricted Boltzmann Machines, require MCMC sampling per weight update to estimate the gradient of the log-partition function. [sent-109, score-0.177]

43 For clarity, we provide the following derivations for SFNNs with one hidden layer containing only stochastic nodes4 . [sent-111, score-0.361]

44 It is straightforward to extend the model to multiple and hybid hidden layered SFNNs. [sent-113, score-0.224]

45 While the posterior p(h|y, x) is hard to compute, the “conditional prior” p(h|x) is easy (corresponds to a simple feedforward pass). [sent-116, score-0.131]

46 Instead, it is critical to use importance sampling with the conditional prior as the proposal distribution. [sent-119, score-0.266]

47 In SFNNs with a mixture ∂θ log p(h of deterministic and stochastic units, backprop will additionally propagate error information from the ﬁrst part to the second part. [sent-129, score-0.282]

48 2 Cooperation during learning We note that for importance sampling to work well in general, a key requirement is that the proposal distribution is not small where the true distribution has signiﬁcant mass. [sent-135, score-0.174]

49 Let us hypothesize that at a particular learning iteration, the conditional prior p(h|x) is small in certain regions where p(h|y, x) is large, which is undesirable for importance sampling. [sent-139, score-0.184]

50 While all samples h(m) will have very low log-likelihood due to the bad conditional prior, there will be a certain preferred ˆ state h with the largest weight. [sent-142, score-0.183]

51 5 will accomplish two things: (1) it will adjust the generative weights to allow preferred states to better generate the observed y; (2) it ˆ will make the conditional prior better by making it more likely to predict h given x. [sent-144, score-0.145]

52 Blue stars are the training data, red pluses are exact samples from SFNNs. [sent-147, score-0.136]

53 The cooperative interaction between the conditional prior and posterior during learning provides some robustness to the importance sampler. [sent-151, score-0.253]

54 30 importance samples are used during learning with 2 hidden layers of 5 stochastic nodes. [sent-157, score-0.552]

55 We then use SFNNs to model face images with varying facial expressions and emotions. [sent-168, score-0.28]

56 By drawing samples from these trained SFNNs, we obtain qualitative results and insights into the modeling capacity of SFNNs. [sent-172, score-0.187]

57 Dataset B has a large number of tight modes conditioned on any given x, which is useful for testing a model’s ability to learn many modes and a small residual variance. [sent-179, score-0.252]

58 SBN is a Sigmoid Belief Net with three hidden stochastic binary layers between the input and the output layer. [sent-186, score-0.446]

59 It is trained in the same way as SFNN, but there are no deterministic units. [sent-187, score-0.138]

60 Finally, SFNN has four hidden layers with the inner 5 two being hybrid stochastic/deterministic layers (See Fig. [sent-188, score-0.454]

61 We used 30 importance samples to approximate the posterior during the E-step. [sent-190, score-0.2]

62 Comparing SBNs to SFNNs, it is clear that having deterministic hidden nodes is a big win for modeling continuous y. [sent-196, score-0.417]

63 2 Modeling Facial Expression Conditioned on a subject’s face with neutral expression, the distribution of all possible emotions or expressions of this particular individual is multimodal in pixel space. [sent-229, score-0.208]

64 We learn SFNNs to model facial expressions in the Toronto Face Database [16]. [sent-230, score-0.187]

65 We randomly selected 100 subjects with 1385 total images for training, while 24 subjects with a total of 344 images were selected as the test set. [sent-233, score-0.208]

66 For each subject, we take the average of their face images as x (mean face), and learn to model this subject’s varying expressions y. [sent-234, score-0.209]

67 We trained a SFNN with 4 hidden layers of size 128 on these facial expression images. [sent-236, score-0.505]

68 The second and third “hybrid” hidden layers contained 32 stochastic binary and 96 deterministic hidden nodes, while the ﬁrst and the fourth hidden layers consisted of only deterministic sigmoids. [sent-237, score-1.085]

69 We also tested the same model but with only one hybrid hidden layer, that we call SFNN1. [sent-239, score-0.286]

70 We used mini-batches of size 100 and and 30 importance samples for the E-step. [sent-240, score-0.157]

71 For the Mixture of Factor Analyzers model, we trained a mixture with 100 components, ˆ one for each training individual. [sent-245, score-0.171]

72 Given a new test face xtest , we ﬁrst ﬁnd the training x which ˆ is closest in Euclidean distance. [sent-246, score-0.175]

73 The number of Gaussian mixture components and the number of hidden nodes were selected using a validation set. [sent-249, score-0.386]

74 The optimal number of hidden units, selected via validation, was 1024. [sent-252, score-0.224]

75 A population sparsity objective on the hidden activations was also part of the objective [19]. [sent-253, score-0.224]

76 Optimization used stochastic gradient descent with mini-batches of 100 samples each. [sent-255, score-0.178]

77 We also recorded the total training time of each Table 2: Average test log-probability and total training time on algorithm, although this depends on the facial expression images. [sent-257, score-0.232]

78 Having two hybrid hidden layers (SFNN2) improves model performance over SFNN1, which has only one hybrid hidden layer. [sent-264, score-0.656]

79 The leftmost column are the mean faces of 3 test subjects, followed by 7 samples from the distribution p(y|x). [sent-272, score-0.125]

80 We can see that having about 500 samples is reasonable, but more samples provides a slightly better estimate. [sent-293, score-0.13]

81 While 500 or more samples are needed for accurate model evaluation, only 20 or 30 samples are sufﬁcient for learning good models (as shown in Fig. [sent-296, score-0.13]

82 5(c), we varied the number of binary stochastic hidden variables in the 2 inner hybrid layers. [sent-301, score-0.373]

83 With more hidden nodes, over-ﬁtting can also be a problem. [sent-303, score-0.224]

84 1 Expression Classiﬁcation The internal hidden representations learned by SFNNs are also useful for classiﬁcation of facial expressions. [sent-306, score-0.327]

85 We then append the learned hidden features of SFNNs and C-GRBMs to the image pixels and re-train the same classiﬁers. [sent-311, score-0.259]

86 Adding hidden features from the SFNN trained in an unsupervised manner (without expression labels) improves accuracy for both linear and nonlinear classiﬁers. [sent-313, score-0.318]

87 Right: generated y statistically signiﬁcant than the competitor modimages from the expected hidden activations. [sent-355, score-0.224]

88 Conditioned on a given foreground mask, the appearance is multimodal (different color and texture). [sent-358, score-0.141]

89 1, we can compute the expected values of the binary stochastic hidden variables given the corrupted test y images5 . [sent-366, score-0.343]

90 6, we show the corresponding generated y from the inferred average hidden states. [sent-368, score-0.224]

91 3 Additional Qualitative Experiments Not only are SFNNs capable of modeling facial expressions of aligned face images, they can also model complex real-valued conditional distributions. [sent-376, score-0.351]

92 Here, we present some qualitative samples drawn from SFNNs trained on more complicated distributions (an additional example on rotated faces is presented in the Supplementary Materials). [sent-377, score-0.183]

93 We trained SFNNs to generate colorful images of common objects from the Amsterdam Library of Objects database [20], conditioned on the foreground masks. [sent-378, score-0.372]

94 For every object, we selected the image under frontal lighting without any rotations, and trained a SFNN conditioned on the foreground mask. [sent-381, score-0.205]

95 Of the 1000 objects, there are many objects with similar foreground masks (e. [sent-383, score-0.15]

96 We also tested on the Weizmann segmentation database [21] of horses, learning a conditional distribution of horse appearances conditioned on the segmentation mask. [sent-388, score-0.281]

97 4 Discussions In this paper we introduced a novel model with hybrid stochastic and deterministic hidden nodes. [sent-391, score-0.452]

98 We have also proposed an efﬁcient learning algorithm that allows us to learn rich multi-modal conditional distributions, supported by quantitative and qualitative empirical results. [sent-392, score-0.155]

99 The major drawback of SFNNs is that inference is not trivial and M samples are needed for the importance sampler. [sent-393, score-0.184]

100 Gaussian ﬁelds for approximate inference in layered sigmoid belief networks. [sent-418, score-0.143]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sfnns', 0.638), ('sfnn', 0.439), ('hidden', 0.224), ('old', 0.154), ('facial', 0.103), ('sbns', 0.1), ('conditional', 0.092), ('importance', 0.092), ('sigmoid', 0.09), ('feedforward', 0.088), ('stochastic', 0.087), ('toronto', 0.084), ('layers', 0.084), ('foreground', 0.083), ('mixture', 0.081), ('mdn', 0.08), ('mdns', 0.08), ('deterministic', 0.079), ('mlp', 0.075), ('face', 0.072), ('samples', 0.065), ('boltzmann', 0.065), ('conditioned', 0.063), ('hybrid', 0.062), ('nh', 0.061), ('hiddens', 0.06), ('mfa', 0.06), ('units', 0.06), ('trained', 0.059), ('mlps', 0.058), ('multimodal', 0.058), ('modes', 0.057), ('unimodal', 0.055), ('nodes', 0.054), ('images', 0.053), ('belief', 0.053), ('expressions', 0.052), ('proposal', 0.052), ('output', 0.051), ('layer', 0.05), ('backpropagation', 0.047), ('database', 0.044), ('residual', 0.043), ('posterior', 0.043), ('segmentation', 0.041), ('gaussian', 0.04), ('analyzers', 0.04), ('braces', 0.04), ('cgrbms', 0.04), ('netlab', 0.04), ('pluses', 0.04), ('xtest', 0.04), ('density', 0.039), ('rotations', 0.039), ('objects', 0.038), ('gibbs', 0.036), ('sbn', 0.035), ('backprop', 0.035), ('horses', 0.035), ('ontario', 0.035), ('subjects', 0.035), ('pixels', 0.035), ('expression', 0.035), ('fields', 0.034), ('multilayer', 0.034), ('colorful', 0.032), ('variational', 0.032), ('test', 0.032), ('learn', 0.032), ('modeling', 0.032), ('training', 0.031), ('carlo', 0.031), ('qualitative', 0.031), ('monte', 0.031), ('machines', 0.031), ('classi', 0.031), ('sampling', 0.03), ('em', 0.029), ('weight', 0.029), ('masks', 0.029), ('faces', 0.028), ('salakhutdinov', 0.028), ('progresses', 0.028), ('nats', 0.028), ('win', 0.028), ('occlusion', 0.028), ('components', 0.027), ('drawback', 0.027), ('generative', 0.027), ('perceptrons', 0.027), ('factorial', 0.027), ('kl', 0.026), ('structured', 0.026), ('preferred', 0.026), ('fa', 0.026), ('cooperative', 0.026), ('annealed', 0.026), ('amsterdam', 0.026), ('neutral', 0.026), ('gradient', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 160 nips-2013-Learning Stochastic Feedforward Neural Networks

Author: Yichuan Tang, Ruslan Salakhutdinov

2 0.11708864 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

Abstract: We demonstrate that there is signiﬁcant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1

3 0.10094737 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classiﬁcation tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classiﬁcation, classiﬁcation with missing inputs, and mean ﬁeld prediction tasks.1 1

4 0.09309613 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1

5 0.08690197 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classiﬁcation error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

6 0.085817635 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

7 0.085709333 70 nips-2013-Contrastive Learning Using Spectral Methods

8 0.083246998 318 nips-2013-Structured Learning via Logistic Regression

9 0.078342862 75 nips-2013-Convex Two-Layer Modeling

10 0.072012149 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

11 0.070125282 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

12 0.068321556 161 nips-2013-Learning Stochastic Inverses

13 0.067778081 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

14 0.067485683 84 nips-2013-Deep Neural Networks for Object Detection

15 0.067413837 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

16 0.067150809 5 nips-2013-A Deep Architecture for Matching Short Texts

17 0.066424496 166 nips-2013-Learning invariant representations and applications to face verification

18 0.06625919 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

19 0.065917373 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

20 0.065738492 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.178), (1, 0.071), (2, -0.107), (3, -0.05), (4, 0.077), (5, -0.003), (6, 0.044), (7, 0.052), (8, 0.066), (9, -0.086), (10, 0.063), (11, 0.046), (12, -0.028), (13, 0.015), (14, -0.011), (15, 0.029), (16, -0.038), (17, -0.076), (18, -0.03), (19, -0.031), (20, 0.023), (21, 0.012), (22, 0.02), (23, -0.034), (24, -0.01), (25, 0.051), (26, -0.04), (27, -0.006), (28, -0.004), (29, 0.05), (30, 0.036), (31, 0.019), (32, -0.026), (33, -0.007), (34, 0.021), (35, 0.032), (36, 0.017), (37, 0.023), (38, 0.065), (39, -0.035), (40, -0.008), (41, -0.02), (42, -0.029), (43, -0.01), (44, 0.021), (45, 0.034), (46, 0.068), (47, 0.027), (48, -0.004), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92972255 160 nips-2013-Learning Stochastic Feedforward Neural Networks

Author: Yichuan Tang, Ruslan Salakhutdinov

2 0.72944862 331 nips-2013-Top-Down Regularization of Deep Belief Networks

Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

3 0.72078425 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio

4 0.68639159 251 nips-2013-Predicting Parameters in Deep Learning

Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas

5 0.65641475 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

Author: Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum

Abstract: The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difﬁcult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that deﬁne ﬂexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs (GPGP) consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the ﬁdelity of the renderer and the tolerance of the likelihood. Representations and algorithms from computer graphics are used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on generalpurpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and yields accurate, approximately Bayesian inferences about real-world images. 1

6 0.64786381 168 nips-2013-Learning to Pass Expectation Propagation Messages

7 0.64561874 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

8 0.63586861 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

9 0.63295054 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

10 0.63018936 84 nips-2013-Deep Neural Networks for Object Detection

11 0.62907219 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

12 0.62796819 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

13 0.6192013 36 nips-2013-Annealing between distributions by averaging moments

14 0.60931098 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

15 0.59709591 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

16 0.59666157 161 nips-2013-Learning Stochastic Inverses

17 0.59358406 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

18 0.59225076 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

19 0.58133745 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

20 0.57781237 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.04), (33, 0.565), (34, 0.093), (41, 0.019), (49, 0.022), (56, 0.058), (70, 0.033), (85, 0.027), (89, 0.014), (93, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99592769 88 nips-2013-Designed Measurements for Vector Count Data

Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin

Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classiﬁcation (photon counting). 1

2 0.99389869 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

Author: Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, Tomas Mikolov

Abstract: Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difﬁculty of acquiring sufﬁcient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. 1

3 0.99337661 217 nips-2013-On Poisson Graphical Models

Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu

Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its inﬁnite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1

same-paper 4 0.98912781 160 nips-2013-Learning Stochastic Feedforward Neural Networks

Author: Yichuan Tang, Ruslan Salakhutdinov

5 0.98797262 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh

Abstract: Multiple hypothesis testing is a signiﬁcant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacriﬁcing the ﬁdelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1

6 0.98086053 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

7 0.97667658 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

8 0.962237 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

9 0.95870823 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

10 0.94497275 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

11 0.93262446 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

12 0.9128089 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

13 0.90916556 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

14 0.90909261 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks

15 0.90490836 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

16 0.90412521 67 nips-2013-Conditional Random Fields via Univariate Exponential Families

17 0.90403622 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

18 0.89946735 226 nips-2013-One-shot learning by inverting a compositional causal process

19 0.8975482 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

20 0.89719331 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes