nips nips2011 nips2011-197 knowledge-graph by maker-knowledge-mining

197 nips-2011-On Tracking The Partition Function


Source: pdf

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, their use is often limited by an inability to estimate the partition function Z. [sent-4, score-0.265]

2 In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. [sent-5, score-0.513]

3 Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. [sent-6, score-0.349]

4 Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. [sent-8, score-0.163]

5 Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. [sent-9, score-0.597]

6 ˜ (1) x E(x) is refered to as the “energy” of configuration x, β is a free parameter known as the inverse temperature and Z(β) is the normalization factor commonly refered to as the partition function. [sent-12, score-0.384]

7 MRFs with latent variables – in particular restricted Boltzmann machines (RBMs) [9] – are among the most popular building block for deep architectures [1], being used in the unsupervised initialization of both Deep Belief Networks [9] and Deep Boltzmann Machines [22]. [sent-14, score-0.204]

8 1, the partition function is computed by summing over all variable configurations. [sent-16, score-0.213]

9 Since the number of configurations scales exponentially with the number of variables, exact calculation of the partition function is generally computationally intractable. [sent-17, score-0.239]

10 Without the partition function, probabilities under the model can only be determined up to a multiplicative constant, which seriously limits the model’s utility. [sent-18, score-0.213]

11 One method recently proposed for estimating Z(β) is annealed importance sampling (AIS) [18, 23]. [sent-19, score-0.22]

12 With a large number of variables, drawing a set of importance-weighted samples is generally subject to extreme variance in the importance weights. [sent-21, score-0.147]

13 AIS alleviates this issue by annealing the model distribution through a series of slowly changing distributions that link the target model distribution to one where the log partition function is tractable. [sent-22, score-0.407]

14 This computationally intensive 1 requirement renders AIS inappropriate as a means of maintaining a running estimate of the log partition function throughout training. [sent-24, score-0.363]

15 Likelihood could be used as a basis for model comparison throughout training; early-stopping could be accomplished by monitoring an estimate of the likelihood of a validation set. [sent-26, score-0.15]

16 Another important application is in Bayesian inference in MRFs [17] where we require the partition function for each value of the parameters in the region of support. [sent-27, score-0.213]

17 Tracking the log partition function would also enable simultaneous estimation of all the parameters of a heterogeneous model, for example an extended directed graphical model with Gibbs distributions forming some of the model components. [sent-28, score-0.347]

18 In this work, we consider a method of tracking the log partition function during training, which builds upon the parallel tempering (PT) framework [7, 10, 15]. [sent-29, score-0.688]

19 First, when using stochastic gradient descent 1 , parameters tend to change slowly during training; consequently, the partition function Z(β) also tends to evolve slowly. [sent-31, score-0.363]

20 We exploit this property of the learning process by using importance sampling to estimate changes in the log partition function from one learning iteration the next. [sent-32, score-0.479]

21 If the changes in the distribution from time-step t to t + 1 are small, the importance sampling estimate can be very accurate, even with relatively few samples. [sent-33, score-0.23]

22 Second, parallel tempering (PT) relies on simulating an extended system, consisting of multiple models each running at their own temperature. [sent-35, score-0.236]

23 These temperatures are chosen such that neighboring models overlap sufficiently as to allow for frequent cross-temperature state swaps. [sent-36, score-0.171]

24 This is an ideal operating regime for bridge sampling [2, 19], which can thus serve to estimate the difference in log partition functions between neighboring models. [sent-37, score-0.535]

25 While with relatively few samples, each method on its own tends not to provide reliable estimates, we propose to combine these measurements using a variation of the well-known Kalman filter (KF), allowing us to accurately track the evolution of the log partition function throughout learning. [sent-38, score-0.433]

26 In Section 2, we provide a brief overview of RBMs and the SML-PT training algorithm, which serves as the basis of our tracking algorithm. [sent-41, score-0.217]

27 3) cover the details of the importance and bridge sampling estimates, while Section 3. [sent-44, score-0.254]

28 4 provides a comprehensive look at our filtering procedure and the tracking algorithm as a whole. [sent-45, score-0.172]

29 2 Stochastic Maximum Likelihood with Parallel Tempering Our proposed log partition function tracking strategy is applicable to any Gibbs distribution model that is undergoing relatively smooth changes in the partition function. [sent-47, score-0.696]

30 RBMs are bipartite graphical models where visible units v ∈ {0, 1}nv interact with hidden units h ∈ {0, 1}nh through the energy function E(v, h) = −hT W v − cT h − bT v. [sent-49, score-0.177]

31 RBMs can be trained through a stochastic approximation to the negative log-likelihood gradient ∂F (v) − Ep [ ∂F (v) ], where F (v) is the free-energy function defined as ∂θ ∂θ P F (v) = − log h exp(−E(v, h)). [sent-51, score-0.152]

32 However, from the perspective of tracking the log partition function, we will see in Section 3 that the SML-PT scheme [7] presents a rather unique advantage. [sent-56, score-0.452]

33 Throughout training, parallel tempering draws samples from an extended system Mt = {qi,t ; i ∈ [1, M ]}, where qi,t denotes the model with inverse temperature βi ∈ [0, 1] obtained after t steps of gradient descent. [sent-57, score-0.424]

34 Each model qi,t (associated with a unique partition function Zi,t ) represents a smoothed version of the target distribution: q1,t (with β1 = 1). [sent-58, score-0.213]

35 The gradient is then estimated by using the sample which was last swapped into temperature β1 . [sent-64, score-0.185]

36 To reduce the variance on our estimate, we run multiple Markov chains per temperature, yielding a mini-batch of (n) model samples Xi,t = {xi,t ∼ qi,t (x); 1 ≤ n ≤ N } at each time-step and temperature. [sent-65, score-0.194]

37 SML with Adaptive parallel tempering (SML-APT) [6], further improves upon SML-PT by automating the choice of temperatures. [sent-66, score-0.236]

38 As previously mentioned, gradient descent learning causes qi,t , the model with inverse temperature βi obtained at time-step t, to be close to qi,t−1 . [sent-69, score-0.187]

39 We can thus apply importance sampling between adjacent temporal models 2 to obtain an estimate of ∆t ζi,t − ζi,t−1 , denoted as Oi,t . [sent-70, score-0.199]

40 This motivates using bridge sampling [2] to provide an estimate of ζi+1,t − ζi,t , the ∆β difference in log partitions between temperatures βi+1 and βi . [sent-74, score-0.4]

41 1), repeated application of bridge sampling alone could in principle arrive at an accurate estimate of {ζi,t ; i ∈ [1, M ], t ∈ [1, T ]}. [sent-78, score-0.222]

42 However, reducing the variance sufficiently to provide useful estimates of the log partition function would require using a relatively large number of samples at each temperature. [sent-79, score-0.409]

43 Within the context of RBM training, the required number of samples at each of the parallel chains would have an excessive computational cost. [sent-80, score-0.198]

44 Nonetheless even with relatively few samples, the bridge sampling estimate provides an additional source of information regarding the log partition function. [sent-81, score-0.533]

45 ∆β ∆t Our strategy is to combine these two high variance estimates Oi,t and Oi,t by treating the unknown log partition functions as a latent state to be tracked by a Kalman filter. [sent-82, score-0.389]

46 , ζM,t , bt ] ∆β , where bt is an extra term to account for a systematic bias in O1,t (see Sec. [sent-87, score-0.248]

47 Its log partition function is P thus given as i log(1 + exp(bi )) + nh · log(2). [sent-93, score-0.305]

48 −1 +1 0 3 7 0 7 5 0 0 Figure 1: A directed graphical model for log partition function tracking. [sent-106, score-0.304]

49 1 Model Dynamics The first step is to specify how we expect the log partition function to change over training iterations, i. [sent-110, score-0.349]

50 SML training of the RBM model parameters is a stochastic gradient descent algorithm (typically over a mini-batch of N examples) where the parameters change by small increments specified by an approximation to the likelihood gradient. [sent-113, score-0.224]

51 This implies that both the model distribution and the partition function change relatively slowly over learning increments with a rate of change being a function of the SML learning rate; i. [sent-114, score-0.33]

52 Our model dynamics are thus simple and capture the fact that the log partition function is slowly changing. [sent-117, score-0.318]

53 Characterizing the evolution of the log partition functions as independent Gaussian processes, we model the probability of ζt conditioned on ζt−1 as p(ζt |ζt−1 ) = N (ζt−1 , Σζ ), a normal 2 2 2 2 2 distribution with mean ζt−1 and fixed diagonal covariance Σζ = Diag[σZ , . [sent-118, score-0.31]

54 σZ and σb are hyper-parameters controlling how quickly the latent states ζi,t and bt are expected to change between learning iterations. [sent-122, score-0.174]

55 2 Importance Sampling Between Learning Iterations (∆t) The observation distribution p(Ot | ζt , ζt−1 ) = N (C[ζt , ζt−1 ]T , Σ∆t ) models the relation(∆t) ship between the evolution of the latent log partitions and the statistical measurements Ot = (∆t) (∆t) ∆t [O1,t , . [sent-124, score-0.162]

56 , OM,t ] given by importance sampling, with Oi,t defined as: ∆t Oi,t = log 1 N N n=1 (n) wi,t with (n) wi,t (n) = qi,t (xi,t−1 ) ˜ (n) qi,t−1 (xi,t−1 ) ˜ . [sent-127, score-0.151]

57 (3) In the above distribution, the matrix C encodes the fact that the average importance weights estimate ζi,t − ζi,t−1 + bt · Ii=1 , where I is the indicator function. [sent-128, score-0.246]

58 the gradient applied between qi,t−1 4 and qi,t ) and second, to compute the importance weights of Eq. [sent-136, score-0.144]

59 3 Bridging the Parallel Tempering Temperature Gaps Consider now the other dimension of our parallel tempered lattice of RBMs: temperature. [sent-140, score-0.149]

60 However, the intermediate distributions qi,t (v, h) are not so close to one another that we can use them as the intermediate distributions of AIS. [sent-142, score-0.174]

61 AIS typically requires thousands of intermediate chains, and maintaining that number of parallel chains would carry a prohibitive computational burden. [sent-143, score-0.213]

62 On the other hand, the parallel tempering strategy of spacing the temperature to ensure moderately frequent swapping nicely matches the ideal operating regime of bridge sampling [2]. [sent-144, score-0.558]

63 , OM −1,t ] are obtained via bridge sampling as estimates of ∆β ζi+1,t − ζi,t . [sent-150, score-0.205]

64 However it is possible to start with a coarse estimate of si,1 and refine it in subsequent iterations by using the output of our tracking algorithm. [sent-155, score-0.224]

65 4 ui,t n vi,t Kalman Filtering of the Log-Partition Function In the above we have described two sources of information regarding the log partition function for each of the RBMs in the lattice. [sent-159, score-0.308]

66 In this section we describe a method to fuse all available information to improve the overall accuracy of the estimate of every log partition function. [sent-160, score-0.332]

67 Having incorporated the importance sampling estimate into the model, (∆t) (∆β) we can then marginalize over ζt−1 (which is no longer required), to yield p(ζt | Ot:0 , Ot−1:0 ). [sent-175, score-0.199]

68 (∆β) Finally, it remains only to incorporate the bridge sampler estimate Ot by a second application (∆t) (∆β) of Bayes rule, which gives us p(ζt | Ot:0 , Ot:0 ), the updated posterior over the latent state at time-step t. [sent-176, score-0.199]

69 We used the decreasing schedule t = min( init t+1 , init ), where t is the learning rate at time-step t, init is the initial or base learning rate and α is the decrease constant. [sent-182, score-0.493]

70 Comparing to Exact Likelihood We start by comparing the performance of our tracking algorithm to the exact likelihood, obtained by marginalizing over both visible and hidden units. [sent-191, score-0.261]

71 In Figure 3(a), we can see that our tracker provides a very good fit to the likelihood with init = 0. [sent-194, score-0.272]

72 Our tracker fails however to capture the oscillatory behavior engendered by too high of a learning rate ( init = 0. [sent-198, score-0.205]

73 Comparing to AIS for Large-Scale Models In evaluating the performance of our tracking algorithm on larger models, exact computation of the likelihood is no longer possible, so we use AIS as our baseline. [sent-201, score-0.265]

74 We performed 200k updates, with learning rate parameters init ∈ {. [sent-203, score-0.15]

75 08 (where ± indicates the 3σ confidence interval), while our tracking algorithm reported a value −89. [sent-208, score-0.172]

76 14 according to AIS, while our tracker reported xk ¯ Each bk is initialized to log 1−¯k , where xk is the mean of the k-th dimension on the training set. [sent-212, score-0.167]

77 Estimates of log Z are averaged over 100 annealed importance weights. [sent-220, score-0.224]

78 010 α = 1e5 −200 −210 300 0 50 100 150 Updates (x1e3) 200 Exact AIS Kalman 250 300 Figure 3: Comparison of exact test-set likelihood and estimated likelihood as given by AIS and our tracking algorithm. [sent-227, score-0.332]

79 We trained a 25-hidden unit RBM for 300k updates using SML, with a learning rate schedule 3 4 5 t = min(α· init /(t+1), init ), with (left) init = 0. [sent-228, score-0.555]

80 Also of interest, the variance on Ot increases with t but is compensated by a decreasing variance (∆t) on Ot , yielding a relatively smooth estimate ζ1,t . [sent-234, score-0.175]

81 As the model converges and the learning rate decreases, qi,t−1 and qi,t become progressively closer and the importance sampling estimates become more robust. [sent-248, score-0.182]

82 An important point to note is that a naive linear-spacing of temperatures yielded low exchange rates between neighboring temperatures, with adverse effects on the quality of our bridge sampling estimates. [sent-250, score-0.314]

83 As a result, we observed a drop in performance, both in likelihood as well as tracking performance. [sent-251, score-0.239]

84 Adaptive tempering [6] (with a fixed number of chains M ) proved crucial in getting good tracking for these experiments. [sent-252, score-0.453]

85 In these experiments, we use our estimate of the log 7 Dataset adult connect4 dna mushrooms nips ocr letters rcv1 web RBM Kalman -15. [sent-254, score-0.145]

86 Early-stopping was performed by monitoring likelihood on a hold-out validation set, using our KF estimate of the log partition function. [sent-296, score-0.399]

87 partition to monitor model performance on a held-out validation set. [sent-302, score-0.213]

88 When the onset of over-fitting is detected, we store the model parameters and report the associated test-set likelihood, as estimated by both AIS and our tracking algorithm. [sent-303, score-0.172]

89 Detecting over-fitting without tracking the log partition would require a dense grid of AIS runs which would prove computationally prohibitive. [sent-305, score-0.452]

90 We tested parameters in the following range: number of hidden units in {100, 200, 500, 1000} (depending on dataset size), learning rates in {10−2 , 10−3 , 10−4 } either held constant during training or annealed with constants α ∈ {103 , 104 , 105 }. [sent-306, score-0.194]

91 SGD was performed using mini-batches of size {10, 100} when estimating the gradient, and mini-batches of size {10, 20} for our set of tempered-chains (we thus simulate 10 × {10, 20} tempered chains in total). [sent-308, score-0.194]

92 5 Discussion In this paper, we have shown that while exact calculation of the partition function of RBMs may be intractable, one can exploit the smoothness of gradient descent learning in order to approximately track the evolution of the log partition function during learning. [sent-311, score-0.673]

93 Treating the ζi,t ’s as latent variables, the graphical model of Figure 1 allowed us to combine multiple sources of information to achieve good tracking of the log partition function throughout training, on a variety of datasets. [sent-312, score-0.575]

94 We note however that good tracking performance is contingent on the ergodicity of the negative phase sampler. [sent-313, score-0.198]

95 The added computational cost lies in the computation of the importance weights for importance sampling and bridge sampling. [sent-316, score-0.338]

96 However, this boils down to computing free-energies which are mostly pre-computed in the course of gradient updates with the sole exception being the computation of qi,t (xi,t−1 ) in the importance sampling step. [sent-317, score-0.244]

97 In comparison ˜ to AIS, our method allows us to fairly accurately track the log partition function, and at a per-point estimate cost well below that of AIS. [sent-318, score-0.368]

98 Having a reliable and accurate online estimate of the log partition function opens the door to a wide range of new research directions. [sent-319, score-0.358]

99 Enhanced gradient and adaptive learning rate for training restricted boltzmann machines. [sent-355, score-0.233]

100 Adaptive parallel tempering for stochastic maximum likelihood learning of rbms. [sent-364, score-0.303]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ais', 0.523), ('ot', 0.506), ('partition', 0.213), ('tempering', 0.174), ('tracking', 0.172), ('init', 0.15), ('sml', 0.144), ('rbms', 0.131), ('temperatures', 0.111), ('bt', 0.11), ('chains', 0.107), ('bridge', 0.107), ('kalman', 0.102), ('temperature', 0.099), ('boltzmann', 0.096), ('tempered', 0.087), ('importance', 0.084), ('rbm', 0.08), ('pt', 0.078), ('annealed', 0.073), ('deep', 0.072), ('likelihood', 0.067), ('log', 0.067), ('mrfs', 0.065), ('desjardins', 0.063), ('nats', 0.063), ('sampling', 0.063), ('parallel', 0.062), ('gradient', 0.06), ('tracker', 0.055), ('estimate', 0.052), ('silhouettes', 0.048), ('hinton', 0.047), ('annealing', 0.046), ('units', 0.045), ('training', 0.045), ('intermediate', 0.044), ('bengio', 0.043), ('murray', 0.043), ('schedule', 0.043), ('gibbs', 0.043), ('distributions', 0.043), ('salakhutdinov', 0.041), ('mnist', 0.04), ('latent', 0.04), ('courville', 0.039), ('slowly', 0.038), ('updates', 0.037), ('caltech', 0.037), ('track', 0.036), ('nade', 0.036), ('refered', 0.036), ('estimates', 0.035), ('om', 0.034), ('neal', 0.034), ('qi', 0.034), ('variance', 0.034), ('neighboring', 0.033), ('restricted', 0.032), ('machines', 0.032), ('theano', 0.032), ('bridging', 0.032), ('tieleman', 0.032), ('visible', 0.032), ('littman', 0.031), ('larochelle', 0.031), ('relatively', 0.031), ('throughout', 0.031), ('hidden', 0.031), ('evolution', 0.03), ('vt', 0.03), ('plotted', 0.029), ('delalleau', 0.029), ('bottou', 0.029), ('samples', 0.029), ('var', 0.028), ('descent', 0.028), ('sources', 0.028), ('architectures', 0.028), ('bias', 0.028), ('editors', 0.027), ('turian', 0.027), ('marlin', 0.027), ('swaps', 0.027), ('frequent', 0.027), ('phase', 0.026), ('swapped', 0.026), ('door', 0.026), ('hp', 0.026), ('dna', 0.026), ('swapping', 0.026), ('thirteenth', 0.026), ('exact', 0.026), ('measurements', 0.025), ('nh', 0.025), ('aistats', 0.025), ('trained', 0.025), ('yielding', 0.024), ('change', 0.024), ('graphical', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 197 nips-2011-On Tracking The Partition Function

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1

2 0.16859666 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1

3 0.10331215 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus

Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1

4 0.087784544 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

Author: Angela Yao, Juergen Gall, Luc V. Gool, Raquel Urtasun

Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from “simple data”, i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. 1

5 0.087690115 303 nips-2011-Video Annotation and Tracking with Active Learning

Author: Carl Vondrick, Deva Ramanan

Abstract: We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efficient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. 1

6 0.086249784 250 nips-2011-Shallow vs. Deep Sum-Product Networks

7 0.085776493 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

8 0.080577999 275 nips-2011-Structured Learning for Cell Tracking

9 0.078907043 180 nips-2011-Multiple Instance Filtering

10 0.073503628 249 nips-2011-Sequence learning with hidden units in spiking neural networks

11 0.06971404 156 nips-2011-Learning to Learn with Compound HD Models

12 0.068314925 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

13 0.067193426 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

14 0.060959253 218 nips-2011-Predicting Dynamic Difficulty

15 0.055517063 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

16 0.055488717 220 nips-2011-Prediction strategies without loss

17 0.0539492 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

18 0.053655919 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

19 0.053118043 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

20 0.052115425 217 nips-2011-Practical Variational Inference for Neural Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.175), (1, -0.001), (2, 0.03), (3, -0.01), (4, -0.003), (5, -0.054), (6, 0.025), (7, -0.066), (8, -0.013), (9, -0.067), (10, 0.017), (11, -0.18), (12, 0.028), (13, -0.086), (14, -0.004), (15, -0.018), (16, -0.019), (17, 0.043), (18, -0.089), (19, 0.101), (20, 0.081), (21, 0.073), (22, -0.064), (23, -0.074), (24, -0.057), (25, -0.052), (26, -0.097), (27, 0.029), (28, 0.065), (29, 0.011), (30, 0.025), (31, -0.021), (32, 0.046), (33, 0.054), (34, 0.035), (35, 0.073), (36, 0.06), (37, -0.064), (38, 0.033), (39, 0.027), (40, -0.058), (41, -0.057), (42, -0.013), (43, 0.001), (44, 0.049), (45, -0.061), (46, 0.01), (47, -0.163), (48, -0.019), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92048579 197 nips-2011-On Tracking The Partition Function

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1

2 0.72532564 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1

3 0.68438387 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus

Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1

4 0.58888644 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1

5 0.56252152 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter

Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1

6 0.55575418 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

7 0.54025418 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

8 0.5400061 55 nips-2011-Collective Graphical Models

9 0.51805878 250 nips-2011-Shallow vs. Deep Sum-Product Networks

10 0.49815643 275 nips-2011-Structured Learning for Cell Tracking

11 0.48198086 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

12 0.48171416 156 nips-2011-Learning to Learn with Compound HD Models

13 0.47446457 221 nips-2011-Priors over Recurrent Continuous Time Processes

14 0.46142998 249 nips-2011-Sequence learning with hidden units in spiking neural networks

15 0.44224259 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

16 0.43911266 180 nips-2011-Multiple Instance Filtering

17 0.43870464 8 nips-2011-A Model for Temporal Dependencies in Event Streams

18 0.42715439 69 nips-2011-Differentially Private M-Estimators

19 0.42446285 303 nips-2011-Video Annotation and Tracking with Active Learning

20 0.42183989 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (4, 0.042), (12, 0.222), (20, 0.038), (26, 0.038), (31, 0.134), (33, 0.042), (43, 0.05), (45, 0.089), (57, 0.036), (65, 0.013), (74, 0.068), (83, 0.038), (99, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79827106 197 nips-2011-On Tracking The Partition Function

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1

2 0.72518051 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei

Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.

3 0.66335768 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1

4 0.6619699 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

Author: Armen Allahverdyan, Aram Galstyan

Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1

5 0.65389639 241 nips-2011-Scalable Training of Mixture Models via Coresets

Author: Dan Feldman, Matthew Faulkner, Andreas Krause

Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1

6 0.65289813 66 nips-2011-Crowdclustering

7 0.65082753 180 nips-2011-Multiple Instance Filtering

8 0.64764786 75 nips-2011-Dynamical segmentation of single trials from population neural data

9 0.64735597 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

10 0.64687502 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

11 0.64686364 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

12 0.64682335 156 nips-2011-Learning to Learn with Compound HD Models

13 0.64585066 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

14 0.64552373 221 nips-2011-Priors over Recurrent Continuous Time Processes

15 0.64524209 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.64335412 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

17 0.64210224 158 nips-2011-Learning unbelievable probabilities

18 0.64130521 249 nips-2011-Sequence learning with hidden units in spiking neural networks

19 0.64024413 55 nips-2011-Collective Graphical Models

20 0.64015043 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing