nips nips2008 nips2008-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. [sent-6, score-0.189]
2 In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. [sent-8, score-0.19]
3 In some models the latent variables associated with a test input can be easily summed out: P (v) = h P (v, h). [sent-15, score-0.2]
4 As an example, standard mixture models have a single discrete mixture component indicator for each data point; the joint probability P (v, h) can be explicitly evaluated for each setting of the latent variable. [sent-16, score-0.175]
5 In particular, marginalizing out many latent variables can require complex integrals or exponentially large sums. [sent-19, score-0.164]
6 One popular latent variable model, the Restricted Boltzmann Machine (RBM), is unusual in that the posterior over hiddens P (h|v) is fully-factored, which allows efficient evaluation of P (v) up to a constant. [sent-20, score-0.189]
7 Almost all other latent variable models have posterior dependencies amongst latent variables, even if they are independent a priori. [sent-21, score-0.357]
8 For DBNs, test probabilities were lower-bounded through a variational technique. [sent-25, score-0.206]
9 A more accurate method for summing over latent variables would enable better and broader evaluation of DBNs. [sent-28, score-0.164]
10 We then develop a new cheap Monte Carlo procedure for evaluating latent variable models in section 3. [sent-31, score-0.237]
11 2 Probability of observations as a normalizing constant The probability of a data vector, P (v), is the normalizing constant relating the posterior over hidden variables to the joint distribution in Bayes rule, P (h|v) = P (h, v)/P (v). [sent-35, score-0.362]
12 In principle, there are many methods that could be applied to evaluating the probability assigned to data by a latent variable model. [sent-37, score-0.233]
13 In what follows, all auxiliary distributions Q and transition operators T are conditioned on the current test case v, this is not shown in the notation to reduce clutter. [sent-39, score-0.247]
14 1 Importance sampling Importance sampling can in principle find the normalizing constant of any distribution. [sent-43, score-0.217]
15 Specifically, the variance of the estimator is an α-divergence between the distributions [3]. [sent-47, score-0.316]
16 As an alternative, the harmonic mean method, also called the reciprocal method, gives an unbiased estimate of 1/P (v): 1 = P (v) h P (h) = P (v) h P (h|v) 1 ≈ P (v|h) S S s=1 1 , P v|h(s) h(s) ∼ P h(s) |v). [sent-51, score-0.275]
17 (2) In practice correlated samples from MCMC are used; then the estimator is asymptotically unbiased. [sent-52, score-0.316]
18 It was clear from the original paper and its discussion that the harmonic mean estimator can behave very poorly [4]. [sent-53, score-0.382]
19 Samples in the tails of the posterior have large weights, which makes it easy to construct distributions where the estimator has infinite variance. [sent-54, score-0.374]
20 In many problems, the estimate of 1/P (v) will be an underestimate with high probability. [sent-56, score-0.198]
21 Also, more expensive versions of the estimator exist with lower variance. [sent-59, score-0.363]
22 However, it is still prone to overestimate test probabiliˆ ties: If 1/PHME (v) is the Harmonic Mean Estimator in (2), Jensen’s inequality gives P (v) = ˆ ˆHME (v) ≤ E PHME (v) . [sent-60, score-0.164]
23 Similarly log P (v) will be overestimated in expectation. [sent-61, score-0.148]
24 1 E 1/P Hence the average of a large number of test log probabilities is highly likely to be an overestimate. [sent-62, score-0.146]
25 Despite these problems the estimator has received significant attention in statistics, and has been used for evaluating latent variable models in recent machine learning literature [5, 6]. [sent-63, score-0.505]
26 3 Importance sampling based on Markov chains Paradoxically, introducing auxiliary variables and making a distribution much higher-dimensional than it was before, can help find an approximating Q distribution that closely matches the target distribution. [sent-67, score-0.146]
27 The T operators are the reverse operators, of those used to define QAIS . [sent-83, score-0.203]
28 For any transition operator T that leaves a distribution P (h|v) stationary, there is a unique corresponding “reverse operator” T , which is defined for any point h in the support of P : T (h ← h) P (h|v) T (h ← h) P (h|v) . [sent-84, score-0.185]
29 Operators that are their own reverse operator are said to satisfy “detailed balance” and are also known as “reversible”. [sent-86, score-0.167]
30 Many transition operators used in practice, such as Metropolis–Hastings, are reversible. [sent-87, score-0.178]
31 Non-reversible operators are usually composed from a sequence of reversible operations, such as the component updates in a Gibbs sampler. [sent-88, score-0.193]
32 The reverse of these (so-called) non-reversible operators is constructed from the same reversible base operations, but applied in reverse order. [sent-89, score-0.359]
33 ≡ ∗ (h(s−1) ) (S) , v w(H) P P h s=2 s (6) ∗ We can usually evaluate the Ps , which are unnormalized versions of the stationary distributions of the Markov chain operators. [sent-91, score-0.198]
34 The AIS importance weight provides an unbiased estimate: PAIS (H) = P (v). [sent-93, score-0.151]
35 EQAIS (H) w(H) = P (v) (7) H As with standard importance sampling, the variance of the estimator depends on a divergence between PAIS and QAIS . [sent-94, score-0.435]
36 4 Chib-style estimators Bayes rule implies that for any special hidden state h∗ , P (v) = P (h∗ , v)/P (h∗ |v). [sent-97, score-0.165]
37 First, we choose a particular hidden state h∗ , usually one with high posterior probability, and then estimate P (h∗ |v). [sent-99, score-0.256]
38 We would like to obtain an estimator that is based on a sequence of states H = {h(1) , h(2) , . [sent-100, score-0.316]
39 , h(S) } generated by a Markov chain that explores the posterior distribution P (h|v). [sent-103, score-0.183]
40 Obviously this estimator is impractical as it equals zero with high probability when applied to highdimensional problems. [sent-105, score-0.36]
41 A “Rao–Blackwellized” version of this estimator, p(H), replaces the indiˆ cator function with the probability of transitioning from h(s) to the special state under a Markov chain transition operator that leaves the posterior stationary. [sent-106, score-0.457]
42 If the chain has stationary distribution P (h|v) and could be initialized at equilibrium so that S P(H) = P h(1) v T h(s) ← h(s−1) , (10) s=2 ∗ then p(H) would be an unbiased estimate of P (h |v). [sent-108, score-0.393]
43 For ergodic chains the stationary distribution ˆ is achieved asymptotically and the estimator is consistent regardless of how it is initialized. [sent-109, score-0.47]
44 If T is a Gibbs sampling transition operator, the only way of moving from h to h∗ is to draw each element of h∗ in turn. [sent-110, score-0.179]
45 3 A new estimator for evaluating latent-variable models We start with the simplest Chib-inspired estimator based on equations (8,9,11). [sent-122, score-0.69]
46 (12) P (v) = ∗ |v) P (h E[ˆ(H)] p p(H) ˆ That is, we will overestimate the probability of a visible vector in expectation. [sent-126, score-0.182]
47 Jensen’s inequality also says that we will overestimate log P (v) in expectation. [sent-127, score-0.172]
48 Ideally we would like an accurate estimate of log P (v). [sent-128, score-0.144]
49 However, if we must suffer some bias, then a lower bound that does not overstate performance will usually be preferred. [sent-129, score-0.144]
50 The probability of the special state h∗ will often be overestimated in practice if we initialize our Markov chain at h∗ . [sent-131, score-0.285]
51 (14) Inputs: v, observed test vector h∗ , a (preferably high posterior probability) hidden state S, number of Markov chain steps T , Markov chain operator that leaves P (h|v) stationary 1. [sent-138, score-0.746]
52 The quantity in square brackets is the estimator for P (h∗ |v) given in (9). [sent-153, score-0.357]
53 P (h∗ |v) (15) Although we are using the simple estimator from (9), by drawing H from a carefully constructed Markov chain procedure, the estimator is now unbiased in P (v). [sent-155, score-0.834]
54 As long as no division by zero has occurred in the above equations, the estimator is unbiased in P (v) for finite runs of the Markov chain. [sent-157, score-0.427]
55 Neal noted that Chibs method will return incorrect answers in cases where the Markov chain does not mix well amongst modes [14]. [sent-159, score-0.162]
56 Poor mixing may cause P (h∗ |v) to be overestimated with high probability, which would result in an underestimate of P (v), i. [sent-162, score-0.202]
57 The variance of the estimator is generally unknown, as it depends on the (generally unavailable) auto-covariance structure of the Markov chain. [sent-165, score-0.316]
58 We can note one positive property: for the ideal Markov chain operator that mixes in one step, the estimator has zero variance and gives the correct answer immediately. [sent-166, score-0.561]
59 DBNs are probabilistic generative models, that can contain many layers of hidden variables. [sent-169, score-0.154]
60 Each layer captures strong high-order correlations between the activities of hidden features in the layer below. [sent-170, score-0.248]
61 The top two layers of the DBN model form a Restricted Boltzmann Machine (RBM) which is an undirected graphical model, but the lower layers form a directed generative model. [sent-171, score-0.183]
62 Consider a DBN model with two layers of hidden features. [sent-173, score-0.154]
63 The model’s joint distribution is: P (v, h1 , h2 ) = P (v|h1 ) P (h2 , h1 ), (16) where P (v|h1 ) represents a sigmoid belief network, and P (h1 , h2 ) is the joint distribution defined by the second layer RBM. [sent-174, score-0.147]
64 Using an approximating factorial posterior distribution Q(h|v), Image Patches Estimated Test Log−probability Estimated Test Log−probability MNIST digits −85 AIS Estimator −85. [sent-176, score-0.157]
65 obtained as a byproduct of the greedy learning procedure, and an AIS estimate of the model’s partition function Z, [1] proposed obtaining an estimate of a variational lower bound: Q(h1 |v) log P ∗ (v, h1 ) − log Z + H(Q(h1 |v)). [sent-180, score-0.513]
66 log P (v) ≥ (17) h1 The entropy term H(·) can be computed analytically, since Q is factorial, and the expectation term was estimated by a simple Monte Carlo approximation: 1 log P ∗ (v, h1(s) ), where h1(s) ∼ Q(h1 |v). [sent-181, score-0.154]
67 S h Instead of the variational approach, we could also adopt AIS to estimate h1 P ∗ (v, h1 ). [sent-184, score-0.204]
68 In the next section we show that variational lower bounds can be quite loose. [sent-186, score-0.184]
69 Our proposed estimator requires the same single AIS estimate of Z as the variational method, so that we can evaluate P (v, h1 ). [sent-188, score-0.52]
70 It then provides better estimates of log P (v) by approximately summing over h1 for each test case in a reasonable amount of computer time. [sent-189, score-0.179]
71 Gibbs sampling was used as a Markov chain transition operator throughout. [sent-197, score-0.332]
72 1 MNIST digits In our first experiment we used a deep belief network (DBN) taken from [1]. [sent-200, score-0.219]
73 The network had two hidden layers with 500 and 2000 hidden units, and was greedily trained by learning a stack of two RBMs one layer at a time. [sent-201, score-0.393]
74 The estimate of the lower bound on the average test log probability, using (17), was −86. [sent-203, score-0.323]
75 To estimate how loose the variational bound is, we randomly sampled 50 test cases, 5 of each class, and ran AIS for each test case to estimate the true test log probability. [sent-205, score-0.618]
76 05 per test case, whereas the estimate of the true test log probability using AIS was −85. [sent-209, score-0.326]
77 The special state h∗ for each test example v was obtained by first sampling from the approximating distribution Q(h|v), and then performing deterministic hill-climbing in log p(v, h) to get to a local mode. [sent-213, score-0.256]
78 00GHz machine, whereas for our proposed estimator we only used S = 40, which took about 50 minutes. [sent-217, score-0.349]
79 59, which is lower than the lower bound and tells us nothing. [sent-220, score-0.157]
80 This is higher than the lower bound, but still worse than our estimator at S = 40, or even S = 5. [sent-223, score-0.363]
81 Finally, using our proposed estimator, the average test log probability on the entire MNIST test data was −84. [sent-224, score-0.259]
82 The difference of about 2 nats shows that the variational bound in [1] was rather tight, although a very small improvement of the DBN over the RBM is now revealed. [sent-226, score-0.248]
83 2 Image Patches In our second experiment we trained a two-layer DBN model on the image patches of natural scenes. [sent-228, score-0.161]
84 The first layer RBM had 2000 hidden units and 400 Gaussian visible units. [sent-229, score-0.21]
85 The second layer represented a semi-restricted Boltzmann machine (SRBM) with 500 hidden and 2000 visible units. [sent-230, score-0.21]
86 To estimate the model’s partition function, we used AIS with 15,000 intermediate distributions and 100 annealing runs. [sent-234, score-0.142]
87 The estimated lower bound on the average test log probability (see Eq. [sent-235, score-0.3]
88 17), using a factorial approximate posterior distribution Q(h1 |v), which we also get as a byproduct of the greedy learning algorithm, was −583. [sent-236, score-0.144]
89 The estimate of the true test log probability, using our proposed estimator, was −563. [sent-238, score-0.213]
90 In contrast to the model trained on MNIST, the difference of over 20 nats shows that, for model comparison purposes, the variational lower bound is quite loose. [sent-240, score-0.332]
91 Square ICA achieves a test log probability of −551. [sent-242, score-0.19]
92 In practice it is likely to underestimate the (log-)probability of a test set. [sent-246, score-0.2]
93 Although the algorithm involves Markov chains, importance sampling underlies the estimator. [sent-247, score-0.139]
94 Therefore the methods discussed in [18] could be used to bound the probability of accidentally over-estimating a test set probability. [sent-248, score-0.176]
95 P (h(s) , v) (19) As before the reciprocal of the quantity in square brackets would give an estimate of P (v). [sent-254, score-0.173]
96 If an approximation q(h) is available that captures more mass than T (h ← h∗ ), this generalized estimator could perform better. [sent-255, score-0.316]
97 We thank Geoffrey Hinton and Radford Neal for useful discussions, Simon Osindero for providing preprocessed image patches of natural scenes, and the reviewers for useful comments. [sent-259, score-0.159]
98 We do the same: P (h∗ |v) = dh T (h∗ ← h)P (h|v) dh T (h ← h∗ ). [sent-337, score-0.142]
99 will contain an infinite term giving a trivial underestimate P (v) = 0. [sent-346, score-0.176]
100 ) On repeated runs, the average estimate is still unbiased, or an underestimate for chains that can’t mix. [sent-348, score-0.279]
wordName wordTfidf (topN-words)
[('ais', 0.451), ('estimator', 0.316), ('pais', 0.244), ('hastings', 0.152), ('chib', 0.143), ('variational', 0.137), ('qais', 0.136), ('latent', 0.131), ('underestimate', 0.131), ('markov', 0.127), ('chain', 0.125), ('metropolis', 0.123), ('operators', 0.12), ('dbn', 0.115), ('mnist', 0.105), ('dbns', 0.102), ('carlo', 0.1), ('deep', 0.099), ('monte', 0.097), ('rbm', 0.096), ('rbms', 0.096), ('overestimate', 0.095), ('normalizing', 0.087), ('hidden', 0.086), ('operator', 0.084), ('reverse', 0.083), ('chains', 0.081), ('patches', 0.081), ('layer', 0.081), ('log', 0.077), ('unbiased', 0.077), ('importance', 0.074), ('reversible', 0.073), ('stationary', 0.073), ('gibbs', 0.072), ('dh', 0.071), ('overestimated', 0.071), ('radford', 0.071), ('ps', 0.07), ('test', 0.069), ('geoffrey', 0.068), ('layers', 0.068), ('estimate', 0.067), ('harmonic', 0.066), ('belief', 0.066), ('sampling', 0.065), ('reciprocal', 0.065), ('osindero', 0.065), ('bound', 0.063), ('posterior', 0.058), ('evaluating', 0.058), ('transition', 0.058), ('draw', 0.056), ('iain', 0.055), ('simon', 0.055), ('antonietta', 0.054), ('ruslan', 0.054), ('siddhartha', 0.054), ('srbm', 0.054), ('underestimated', 0.054), ('digits', 0.054), ('jensen', 0.053), ('boltzmann', 0.053), ('equilibrium', 0.051), ('cheap', 0.048), ('nats', 0.048), ('lower', 0.047), ('divergence', 0.045), ('state', 0.045), ('factorial', 0.045), ('giving', 0.045), ('probability', 0.044), ('leaves', 0.043), ('image', 0.043), ('ts', 0.043), ('visible', 0.043), ('annealed', 0.041), ('byproduct', 0.041), ('brackets', 0.041), ('mfa', 0.041), ('contrastive', 0.038), ('annealing', 0.038), ('tractable', 0.038), ('steps', 0.038), ('toronto', 0.037), ('trained', 0.037), ('amongst', 0.037), ('murray', 0.037), ('intermediate', 0.037), ('answer', 0.036), ('neal', 0.035), ('preprocessed', 0.035), ('salakhutdinov', 0.035), ('stack', 0.035), ('runs', 0.034), ('estimators', 0.034), ('suffer', 0.034), ('took', 0.033), ('summing', 0.033), ('integrals', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
2 0.194592 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
3 0.14651932 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
4 0.13088799 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
Author: Ralf M. Haefner, Bruce G. Cumming
Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantification of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modification of the traditional formula that significantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overfitting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and find that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1
5 0.13068435 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
6 0.11625 234 nips-2008-The Infinite Factorial Hidden Markov Model
7 0.097845465 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
8 0.094192035 152 nips-2008-Non-stationary dynamic Bayesian networks
9 0.090480916 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
10 0.086538412 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
11 0.081875078 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
12 0.080873512 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
13 0.08073622 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
14 0.080044538 118 nips-2008-Learning Transformational Invariants from Natural Movies
15 0.077328667 62 nips-2008-Differentiable Sparse Coding
16 0.074871406 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
17 0.074180692 148 nips-2008-Natural Image Denoising with Convolutional Networks
18 0.070494048 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
19 0.069290809 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
20 0.06928768 134 nips-2008-Mixed Membership Stochastic Blockmodels
topicId topicWeight
[(0, -0.222), (1, -0.009), (2, 0.076), (3, -0.017), (4, 0.06), (5, -0.029), (6, -0.006), (7, 0.183), (8, -0.017), (9, 0.013), (10, -0.009), (11, 0.041), (12, 0.184), (13, 0.039), (14, -0.246), (15, -0.165), (16, 0.01), (17, -0.109), (18, 0.06), (19, 0.077), (20, -0.104), (21, 0.06), (22, 0.056), (23, 0.015), (24, 0.039), (25, 0.08), (26, -0.003), (27, 0.046), (28, 0.071), (29, 0.117), (30, -0.079), (31, 0.027), (32, -0.015), (33, -0.102), (34, -0.019), (35, 0.11), (36, -0.043), (37, -0.04), (38, 0.036), (39, -0.01), (40, -0.054), (41, -0.017), (42, 0.004), (43, -0.123), (44, -0.034), (45, -0.013), (46, -0.009), (47, 0.074), (48, -0.063), (49, -0.153)]
simIndex simValue paperId paperTitle
same-paper 1 0.9530682 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
2 0.66580313 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
3 0.65372723 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
4 0.5846594 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
5 0.57732117 234 nips-2008-The Infinite Factorial Hidden Markov Model
Author: Jurgen V. Gael, Yee W. Teh, Zoubin Ghahramani
Abstract: We introduce a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After constructing an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation. 1
6 0.56358922 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
7 0.55279088 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
8 0.54361302 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
9 0.53427202 134 nips-2008-Mixed Membership Stochastic Blockmodels
10 0.52126449 236 nips-2008-The Mondrian Process
11 0.51442903 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
12 0.51348841 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
13 0.49603021 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
14 0.46915495 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
15 0.44880328 148 nips-2008-Natural Image Denoising with Convolutional Networks
16 0.44858751 152 nips-2008-Non-stationary dynamic Bayesian networks
17 0.44613418 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
18 0.43595546 249 nips-2008-Variational Mixture of Gaussian Process Experts
19 0.43593842 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
20 0.42867729 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
topicId topicWeight
[(6, 0.046), (7, 0.061), (12, 0.017), (28, 0.485), (32, 0.029), (57, 0.134), (59, 0.013), (63, 0.025), (71, 0.012), (77, 0.031), (83, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.99676687 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
2 0.98612875 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
3 0.98307157 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while also allowing for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound. At the heart of our method is a triangulated graph that we dynamically update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life datasets. Importantly, we also show that by using global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. 1
4 0.98160654 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control
Author: Lavi Shpigelman, Hagai Lalazar, Eilon Vaadia
Abstract: Using machine learning algorithms to decode intended behavior from neural activity serves a dual purpose. First, these tools allow patients to interact with their environment through a Brain-Machine Interface (BMI). Second, analyzing the characteristics of such methods can reveal the relative significance of various features of neural activity, task stimuli, and behavior. In this study we adapted, implemented and tested a machine learning method called Kernel Auto-Regressive Moving Average (KARMA), for the task of inferring movements from neural activity in primary motor cortex. Our version of this algorithm is used in an online learning setting and is updated after a sequence of inferred movements is completed. We first used it to track real hand movements executed by a monkey in a standard 3D reaching task. We then applied it in a closed-loop BMI setting to infer intended movement, while the monkey’s arms were comfortably restrained, thus performing the task using the BMI alone. KARMA is a recurrent method that learns a nonlinear model of output dynamics. It uses similarity functions (termed kernels) to compare between inputs. These kernels can be structured to incorporate domain knowledge into the method. We compare KARMA to various state-of-the-art methods by evaluating tracking performance and present results from the KARMA based BMI experiments. 1
5 0.98110527 126 nips-2008-Localized Sliced Inverse Regression
Author: Qiang Wu, Sayan Mukherjee, Feng Liang
Abstract: We developed localized sliced inverse regression for supervised dimension reduction. It has the advantages of preventing degeneracy, increasing estimation accuracy, and automatic subclass discovery in classification problems. A semisupervised version is proposed for the use of unlabeled data. The utility is illustrated on simulated as well as real data sets.
7 0.97667587 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
8 0.97579539 206 nips-2008-Sequential effects: Superstition or rational behavior?
9 0.97320026 174 nips-2008-Overlaying classifiers: a practical approach for optimal ranking
10 0.97151971 72 nips-2008-Empirical performance maximization for linear rank statistics
11 0.97037327 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
12 0.97033656 211 nips-2008-Simple Local Models for Complex Dynamical Systems
13 0.96734512 152 nips-2008-Non-stationary dynamic Bayesian networks
14 0.96679908 107 nips-2008-Influence of graph construction on graph-based clustering measures
15 0.9664014 101 nips-2008-Human Active Learning
16 0.96521384 159 nips-2008-On Bootstrapping the ROC Curve
17 0.96375871 112 nips-2008-Kernel Measures of Independence for non-iid Data
18 0.96006858 104 nips-2008-Improved Moves for Truncated Convex Models
19 0.95913219 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
20 0.95710582 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries