nips nips2003 nips2003-196 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton
Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.
Reference: text
sentIndex sentText sentNum sentScore
1 If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. [sent-4, score-0.384]
2 But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. [sent-5, score-0.757]
3 We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. [sent-6, score-0.131]
4 If the model is approximately correct, these long-range moves have a reasonable acceptance rate. [sent-7, score-0.196]
5 The main problem with this approach is the time that it takes for the Markov chain to approach its stationary distribution. [sent-11, score-0.154]
6 Fortunately, in [3] it was shown that if the chain is started at the data distribution, running the chain for just a few steps is often sufficient to provide a signal for learning. [sent-12, score-0.412]
7 |Θ) + (3) ∆θj ∝ − ∂θj ∂θj data conf abulations Ek λk 3 second hidden layer 2 1 f f k 0 first hidden layer −1 f f j Wij −2 −3 −3 Ej λj Wjk i data −2 −1 0 1 2 3 (a) (b) Figure 1: a) shows a two-dimensional data distribution that has four well-separated modes. [sent-16, score-0.246]
8 b) shows a feedforward neural network that is used to assign an energy to a two-dimensional input vector. [sent-17, score-0.232]
9 Each hidden unit takes a weighted sum of its inputs, adds a learned bias, and puts this sum through a logistic non-linearity to produce an output that is sent to the next layer. [sent-18, score-0.132]
10 Each hidden unit makes a contribution to the global energy that is equal to its output times a learned scale factor. [sent-19, score-0.301]
11 There are 20 units in the first hidden layer and 3 in the top layer. [sent-20, score-0.191]
12 Unfortunately, real training sets may have modes that are separated by regions of very low density, and running the Markov chain for only a few steps may not allow it to move between these modes even when there is a lot of data. [sent-23, score-0.912]
13 As a result, the relative energies of data points in different modes can be completely wrong without affecting the learning signal given by equation 3. [sent-24, score-0.339]
14 The point of this paper is to show that, in the context of modelfitting, there are ways to use the known training data to introduce extra mode-hopping moves into the Markov chain. [sent-25, score-0.173]
15 We rely on the observation that after some initial training, the training data itself provides useful suggestions about where the modes of the model are and how much probability mass there is in each mode. [sent-26, score-0.405]
16 2 A simple example of wormholes Figure 1a shows some two-dimensional training data and a model that was used to model the density of the training data. [sent-27, score-0.584]
17 The model is an unsupervised deterministic feedforward neural network with two hidden layers of logistic units. [sent-28, score-0.176]
18 The parameters of the model are the weights and biases of the hidden units and one additional scale parameter per hidden unit which is used to convert the output of the hidden unit into an additive contribution to the global energy. [sent-29, score-0.287]
19 By using backpropagation through the model, it is easy to compute the derivatives of the global energy assigned to an input vector w. [sent-30, score-0.197]
20 the parameters (needed in equation 3), and it is also easy to compute the gradient of the energy w. [sent-33, score-0.264]
21 e the slope of the energy surface at that point in dataspace). [sent-37, score-0.23]
22 The latter gradient is needed for the ’Hybrid Monte Carlo’ sampler that we discuss next. [sent-38, score-0.162]
23 To produce the confabulations we start at the datapoints and use a Markov chain that is a sim- (a) (b) Figure 2: (a) shows the probabilities learned by the network without using wormholes, displayed on a 32 × 32 grid in the dataspace. [sent-40, score-0.407]
24 (b) shows that the probability mass in the different minima matches the data distribution after 10 parameter updates using point-to-point wormholes defined by the vector differences between pairs of training points. [sent-42, score-0.687]
25 The mode-hopping allowed by the wormholes increases the number of confabulations that end up in the deeper minima which causes the learning algorithm to raise the energy of these minima. [sent-43, score-0.883]
26 Each datapoint is treated as a particle on the energy surface. [sent-45, score-0.234]
27 The particle is given a random initial momentum chosen from a unit-variance isotropic Gaussian and its deterministic trajectory along the energy surface is then simulated for 10 time steps. [sent-46, score-0.303]
28 If this simulation has no numerical errors the increase, ∆E, in the combined potential and kinetic energy will be zero. [sent-47, score-0.197]
29 Notice that the model assigns much more probability mass to some minima than to others. [sent-52, score-0.154]
30 Figure 2b shows how the probability density is corrected by 10 parameter updates using a Markov chain that has been modified by adding an optional long-range jump at the end of each accepted trajectory. [sent-54, score-0.708]
31 The candidate jump is simply the vector difference between two randomly selected training points. [sent-55, score-0.386]
32 The jump is always accepted if it lowers the energy. [sent-56, score-0.427]
33 If it raises the energy it is accepted with a probability of exp(−∆E). [sent-57, score-0.328]
34 Since the probability that point A in the space will be offered a jump to point B is the same as the probability that B will be offered a jump to A, the jumps do not affect detailed balance. [sent-58, score-1.046]
35 One way to think about the jumps is to imagine that every point in the dataspace is connected by wormholes to n(n − 1) other points so that it can move to any of these points in a single step. [sent-59, score-0.753]
36 To understand how the long-range moves deal with the trade-off between energy and entropy, consider a proposed move that is based on the vector offset between a training point 1 Note that depending on the height of the energy barrier between the modes this may take too long for practical purposes. [sent-60, score-0.895]
37 that lies in a deep narrow energy minimum and a training point that lies in a broad shallow minimum. [sent-61, score-0.469]
38 If the move is applied to a random point in the deep minimum, it stands a good chance of moving to a point within the broad shallow minimum, but it will probably be rejected because the energy has increased. [sent-62, score-0.476]
39 If the opposite move is applied to a random point in the broad minimum, the resulting point is unlikely to fall within the narrow minimum, though if it does it is very likely to be accepted. [sent-63, score-0.24]
40 Jumps generated by random pairs of datapoints work well if the minima are all the same shape, but in a high-dimensional space it is very unlikely that such a jump will be accepted if different energy minima are strongly elongated in different directions. [sent-65, score-0.792]
41 3 A local optimization-based method In high dimensions the simple wormhole method will have a low acceptance rate because most jumps will land in high-energy regions. [sent-66, score-0.681]
42 One way avoid to that is to use local optimization: after a jump has been made descend into a nearby low-energy region. [sent-67, score-0.339]
43 It fits Gaussians to the detected low energy regions in order to account for their volume. [sent-70, score-0.269]
44 Given a point x, let mx be the point found by running a minimization algorithm on E(x) for a few steps (or until convergence) starting at x. [sent-72, score-0.217]
45 A Gaussian density gx (y) is then defined by the mean mx and the covariance matrix Σx . [sent-75, score-0.179]
46 To generate a jump proposal, we make a forward jump by adding the vector difference d between two randomly selected data points to the initial point x0 , obtaining x. [sent-76, score-0.741]
47 Then we compute mx and Σx , and sample a proposed jump destination y from gx (y). [sent-77, score-0.474]
48 Then we make a backward jump by adding −d to y to obtain z, and compute mz and Σz , specifying gz (x). [sent-78, score-0.417]
49 exp(−E(x0 )) gx (y) Our implementation of the algorithm executes 20 steps of steepest descent to find mx and mz . [sent-80, score-0.25]
50 4 Gaping wormholes In this section we describe a third method based on “darting MCMC” [8] to jump between the modes of a distribution. [sent-82, score-1.018]
51 The idea of this technique is to define spherical regions on the modes of the distribution and to jump only between corresponding points in those regions. [sent-83, score-0.674]
52 When we consider a long-range move we check whether or not we are inside a wormhole. [sent-84, score-0.133]
53 When inside a wormhole we initiate a jump to some other wormhole (e. [sent-85, score-1.308]
54 If we make a jump we must also use the usual Metropolis rejection rule to decide whether to accept the jump. [sent-88, score-0.417]
55 In high dimensional spaces this procedure may still lead to unacceptably high rejection rates because the modes will likely decay sharply in at least a few directions. [sent-89, score-0.311]
56 Since these ridges of probability are likely to be uncorrelated across the modes, the proposed target location of the jump will most of the time have very low probability, resulting in almost certain rejection. [sent-90, score-0.473]
57 To deal with this problem, we propose a generalization of the described method, where the wormholes can have arbitrary shapes and volumes. [sent-91, score-0.475]
58 As before, when we are considering a long-range move we first check our position, and if we are located inside a wormhole we initiate a jump (which may be rejected) while if we are located outside a wormhole we stay put. [sent-92, score-1.403]
59 To maintain detailed balance between wormholes we need to compensate for their potentially different volume factors. [sent-93, score-0.608]
60 To that end, we impose the constraint Vi Pi→j = Vj Pj→i (4) on all pairs of wormholes, where Pi→j is a transition probability and Vi and Vj are the volumes of the wormholes i and j respectively. [sent-94, score-0.524]
61 This in fact defines a separate Markov chain between the wormholes with equilibrium distribution, PiEQ = Vi j Vj (5) The simplest method2 to compensate for the different volume factors is therefore to sample a target wormhole from this distribution P EQ . [sent-95, score-1.114]
62 When the target wormhole has been determined we can either sample a point uniformly within its volume or design some deterministic mapping (see also [4]). [sent-96, score-0.581]
63 Finally, once the arrival point has been determined we need to compensate for the fact that the probability of the point of departure is likely to be different than the probability of the point of arrival. [sent-97, score-0.22]
64 The usual Metropolis rule applies in this case, Parrive Paccept = min 1, (6) Pdepart This combined set of rules ensures that detailed balance holds and that the samples will eventually come from the correct probability distribution. [sent-98, score-0.14]
65 One way of employing this sampler in conjunction with contrastive divergence learning is to fit a “mixture of Gaussians” model to the data distribution in a preprocessing step. [sent-99, score-0.26]
66 These regions provide good jump points during CD-learning because it is expected that the valleys in the energy landscape correspond to the regions where the data cluster. [sent-101, score-0.71]
67 To minimize the rejection rate we map points in one ellipse to “corresponding” points in another ellipse as follows. [sent-102, score-0.237]
68 Let Σdepart and Σarrive be the covariance matrices of the wormholes in question. [sent-103, score-0.446]
69 The following transformation maps iso-probability contours in one wormhole to iso-probability contours in another, 1/2 −1/2 T xarrive − µarrive = −Uarrive Sarrive Sdepart Udepart (xdepart − µdepart ) (8) with µ the center location of the ellipse. [sent-105, score-0.491]
70 The negative sign in front of the transformation is to promote better exploration when the target wormhole turns out to be the same as the wormhole from which the jump is initiated. [sent-106, score-1.264]
71 Thus, wormholes are sampled from P EQ and proposed moves are accepted according to equation 6. [sent-108, score-0.656]
72 For both the deterministic and the stochastic moves we may also want to consider regions that overlap. [sent-109, score-0.205]
73 For instance, if we generate wormholes by fitting a mixture of Gaussians it 2 Other methods that respect the constraint 4 are possible but are suboptimal in the sense that they mix slower to the equilibrium distribution. [sent-110, score-0.485]
74 (b) Probability density of the model learned with contrastive divergence. [sent-112, score-0.174]
75 First define narrive as the total number of regions that contain point xarrive and similarly for ndepart . [sent-116, score-0.24]
76 Detailed balance can still be maintained for both deterministic and stochastic moves if we adapt the Metropolis acceptance rule as follows, Paccept = min 1, ndepart Parrive narrive Pdepart (9) Further details can be found in [6]. [sent-117, score-0.372]
77 5 An experimental comparison of the three methods To highlight the difference between the point and the region wormhole sampler, we sampled 1024 data points along two very narrow orthogonal ridges (see figure 3a), with half of the cases in each mode. [sent-118, score-0.691]
78 A model with the same architecture as depicted in figure 1 was learned using contrastive divergence, but with “Cauchy” nonlinearities of the form f (x) = log(1 + x2 ) instead of the logistic function. [sent-119, score-0.158]
79 Clearly, the lack of mixing between the modes has resulted in one mode being much stronger than the other one. [sent-121, score-0.263]
80 Subsequently, learning was resumed using a Markov chain that proposed a long-range jump for all confabulations after each brief HMC run. [sent-122, score-0.709]
81 The regions in the region wormhole sampler were generated by fitting a mixture of two Gaussians to the data using EM, and setting α = 10. [sent-123, score-0.688]
82 Both the point wormhole method and the region wormhole method were able to correct the asymmetry in the solution but the region method does so much faster as shown in figure 4b. [sent-124, score-1.017]
83 The reason is that a much smaller fraction of the confabulations succeed in making a long-range jump as shown in figure 4a. [sent-125, score-0.517]
84 We then compared all three wormhole algorithms on a family of datasets of varying dimensionality. [sent-126, score-0.446]
85 The first two components of each point were sampled uniformly from two axis-aligned narrow orthogonal ridges and then rotated by 45◦ around the origin to ensure that the diagonal approximation to the Hessian, used by the local optimization-based algorithm, was not unfairly accurate. [sent-128, score-0.198]
86 (b) Log-odds of the probability masses contained in small volumes surrounding the two modes for the point wormhole method (dashed line) and the region wormhole method (solid line). [sent-133, score-1.327]
87 The second hidden layer consisted of Cauchy units, while the first hidden layer consisted of some Cauchy and some sigmoid units. [sent-136, score-0.246]
88 This prevented HMC from being slowed down by the narrow energy ravines resulting from the tight constraints on the last n − 2 components. [sent-147, score-0.275]
89 After the model was trained (without wormholes), we compared the performance of the three jump samplers by allowing each sampler to make a proposal for each training case and then comparing the acceptance rates. [sent-148, score-0.65]
90 The average number of successful jumps between modes per iteration is shown in the table below. [sent-151, score-0.336]
91 Dimensionality 2 4 8 16 32 Network architecture 10+10, 2 20+10, 4 20+10, 6 40+10, 8 50+10, 10 Relative run time Simple wormholes 10 6 3 1 1 1 Optimizationbased 15 17 19 13 9 2. [sent-152, score-0.446]
92 6 Region wormholes 372 407 397 338 295 1 The network architecture column shows the number of units in the hidden layers with each entry giving the number of Cauchy units plus the number of sigmoid units in the first hidden layer and the number of Cauchy units in the second hidden layer. [sent-153, score-0.987]
93 Minimizing contrastive divergence is much easier than maximizing likelihood but the brief Markov chain does not have time to mix between separated modes in the distribution3 . [sent-155, score-0.638]
94 Their success relies on the fact that the data distribution provides valuable suggestions about the location of the modes of a good model. [sent-158, score-0.266]
95 Since the probability of the model distribution is expected to be substantial in these regions they can be successfully used as target locations for long-range moves in a MCMC sampler. [sent-159, score-0.241]
96 The MCMC sampler with point-to-point wormholes is simple but has a high rejection rate when the modes are not aligned. [sent-160, score-0.91]
97 Performing local gradient descent after a jump significantly increases the acceptance rate, but only leads to a modest improvement in efficiency because of the extra computations required to maintain detailed balance. [sent-161, score-0.561]
98 The MCMC sampler with region-to-region wormholes targets its moves to regions that are likely to have high probability under the model and therefore has a much better acceptance rate, provided the distribution can be modelled well by a mixture. [sent-162, score-0.881]
99 None of the methods we have proposed will work well for high-dimensional, approximately factorial distributions that have exponentially many modes formed by the cross-product of multiple lower-dimensional distributions. [sent-163, score-0.233]
100 3 However, note that in cases where the modes are well separated, even Markov chains that run for an extraordinarily long time will not mix properly between those modes, and the results of this paper become relevant. [sent-217, score-0.272]
wordName wordTfidf (topN-words)
[('wormhole', 0.446), ('wormholes', 0.446), ('jump', 0.339), ('modes', 0.233), ('energy', 0.197), ('confabulations', 0.178), ('chain', 0.154), ('sampler', 0.124), ('mcmc', 0.104), ('acceptance', 0.103), ('jumps', 0.103), ('contrastive', 0.099), ('cauchy', 0.097), ('moves', 0.093), ('hmc', 0.089), ('accepted', 0.088), ('mx', 0.082), ('narrow', 0.078), ('rejection', 0.078), ('markov', 0.073), ('hidden', 0.073), ('regions', 0.072), ('units', 0.068), ('move', 0.066), ('minima', 0.062), ('ridges', 0.058), ('metropolis', 0.058), ('hessian', 0.054), ('gx', 0.053), ('monte', 0.052), ('detailed', 0.051), ('layer', 0.05), ('mass', 0.049), ('training', 0.047), ('energies', 0.047), ('region', 0.046), ('balance', 0.046), ('darting', 0.045), ('dataspace', 0.045), ('deep', 0.045), ('masses', 0.045), ('narrive', 0.045), ('ndepart', 0.045), ('paccept', 0.045), ('parrive', 0.045), ('pdepart', 0.045), ('xarrive', 0.045), ('datapoints', 0.044), ('density', 0.044), ('probability', 0.043), ('toronto', 0.042), ('steps', 0.041), ('gaussians', 0.04), ('updates', 0.04), ('deterministic', 0.04), ('mix', 0.039), ('gz', 0.039), ('depart', 0.039), ('initiate', 0.039), ('mz', 0.039), ('shallow', 0.039), ('separated', 0.038), ('inside', 0.038), ('brief', 0.038), ('vj', 0.038), ('gradient', 0.038), ('divergence', 0.037), ('particle', 0.037), ('proposal', 0.037), ('compensate', 0.035), ('started', 0.035), ('feedforward', 0.035), ('ellipse', 0.035), ('volumes', 0.035), ('steepest', 0.035), ('hx', 0.035), ('carlo', 0.034), ('target', 0.033), ('increment', 0.033), ('rejected', 0.033), ('suggestions', 0.033), ('point', 0.033), ('learned', 0.031), ('fortunately', 0.031), ('offered', 0.031), ('eq', 0.031), ('points', 0.03), ('maintain', 0.03), ('vi', 0.03), ('broad', 0.03), ('mode', 0.03), ('momentum', 0.029), ('dy', 0.029), ('equation', 0.029), ('deal', 0.029), ('uniformly', 0.029), ('rate', 0.029), ('check', 0.029), ('running', 0.028), ('logistic', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 196 nips-2003-Wormholes Improve Contrastive Divergence
Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton
Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.
2 0.096853852 176 nips-2003-Sequential Bayesian Kernel Regression
Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet
Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identifies the number and locations of the kernels. Our algorithm overcomes some of the computational difficulties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efficient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.
3 0.083177693 32 nips-2003-Approximate Expectation Maximization
Author: Tom Heskes, Onno Zoeter, Wim Wiegerinck
Abstract: We discuss the integration of the expectation-maximization (EM) algorithm for maximum likelihood learning of Bayesian networks with belief propagation algorithms for approximate inference. Specifically we propose to combine the outer-loop step of convergent belief propagation algorithms with the M-step of the EM algorithm. This then yields an approximate EM algorithm that is essentially still double loop, with the important advantage of an inner loop that is guaranteed to converge. Simulations illustrate the merits of such an approach. 1
4 0.076882951 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky
Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1
5 0.071911268 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis
Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1
6 0.070549622 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
7 0.069531038 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
8 0.05948982 37 nips-2003-Automatic Annotation of Everyday Movements
9 0.059427053 169 nips-2003-Sample Propagation
10 0.052846983 130 nips-2003-Model Uncertainty in Classical Conditioning
11 0.051827088 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science
12 0.049974255 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
13 0.046914384 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
14 0.046686526 155 nips-2003-Perspectives on Sparse Bayesian Learning
15 0.045244005 102 nips-2003-Large Scale Online Learning
16 0.043737046 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
17 0.043297864 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
18 0.041262604 111 nips-2003-Learning the k in k-means
19 0.041116692 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
20 0.041088395 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
topicId topicWeight
[(0, -0.162), (1, 0.014), (2, -0.003), (3, 0.043), (4, -0.005), (5, -0.001), (6, 0.095), (7, 0.044), (8, 0.023), (9, 0.012), (10, -0.033), (11, -0.009), (12, 0.008), (13, -0.006), (14, 0.049), (15, -0.052), (16, -0.034), (17, 0.009), (18, 0.097), (19, -0.017), (20, -0.048), (21, 0.028), (22, -0.034), (23, 0.052), (24, -0.081), (25, -0.063), (26, -0.003), (27, 0.011), (28, 0.028), (29, -0.038), (30, -0.076), (31, -0.011), (32, 0.019), (33, 0.018), (34, 0.081), (35, -0.025), (36, -0.159), (37, -0.225), (38, -0.005), (39, 0.184), (40, -0.043), (41, 0.022), (42, -0.095), (43, 0.091), (44, -0.069), (45, 0.006), (46, -0.034), (47, 0.068), (48, -0.149), (49, 0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.92629439 196 nips-2003-Wormholes Improve Contrastive Divergence
Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton
Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.
2 0.62962538 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
Author: Daniel B. Neill, Andrew W. Moore
Abstract: Given an N ×N grid of squares, where each square has a count and an underlying population, our goal is to find the square region with the highest density, and to calculate its significance by randomization. Any density measure D, dependent on the total count and total population of a region, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff’s spatial scan statistic DK to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N 3 ) time, and is generally computationally infeasible. We present a novel algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in optimal O(N 2 ) time, in practice resulting in significant (10-200x) speedups. 1
3 0.60652769 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky
Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1
4 0.60276961 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science
Author: Woojae Kim, Daniel J. Navarro, Mark A. Pitt, In J. Myung
Abstract: Despite the popularity of connectionist models in cognitive science, their performance can often be difficult to evaluate. Inspired by the geometric approach to statistical model selection, we introduce a conceptually similar method to examine the global behavior of a connectionist model, by counting the number and types of response patterns it can simulate. The Markov Chain Monte Carlo-based algorithm that we constructed Þnds these patterns efficiently. We demonstrate the approach using two localist network models of speech perception. 1
5 0.55045259 176 nips-2003-Sequential Bayesian Kernel Regression
Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet
Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identifies the number and locations of the kernels. Our algorithm overcomes some of the computational difficulties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efficient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.
6 0.46593833 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
7 0.44377196 187 nips-2003-Training a Quantum Neural Network
8 0.43884328 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
9 0.41196579 175 nips-2003-Sensory Modality Segregation
10 0.40527219 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
11 0.38649625 51 nips-2003-Design of Experiments via Information Theory
12 0.38492936 169 nips-2003-Sample Propagation
13 0.382898 123 nips-2003-Markov Models for Automated ECG Interval Analysis
14 0.37113485 102 nips-2003-Large Scale Online Learning
15 0.35272795 181 nips-2003-Statistical Debugging of Sampled Programs
16 0.35159087 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
17 0.35031056 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
18 0.32086891 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality
19 0.31951338 130 nips-2003-Model Uncertainty in Classical Conditioning
20 0.31848735 126 nips-2003-Measure Based Regularization
topicId topicWeight
[(0, 0.055), (11, 0.016), (29, 0.012), (30, 0.017), (33, 0.02), (35, 0.064), (53, 0.084), (55, 0.013), (69, 0.386), (71, 0.053), (76, 0.032), (85, 0.059), (91, 0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.83155406 196 nips-2003-Wormholes Improve Contrastive Divergence
Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton
Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.
2 0.79096472 51 nips-2003-Design of Experiments via Information Theory
Author: Liam Paninski
Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1
3 0.52215141 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky
Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The first is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating significant improvements over existing methods. 1
4 0.48731506 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
5 0.47124499 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
6 0.46601048 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
7 0.45426214 126 nips-2003-Measure Based Regularization
8 0.45374668 172 nips-2003-Semi-Supervised Learning with Trees
9 0.44447958 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
10 0.44265923 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
11 0.44062281 143 nips-2003-On the Dynamics of Boosting
12 0.43863669 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
13 0.43366244 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
14 0.43355107 176 nips-2003-Sequential Bayesian Kernel Regression
15 0.43181974 115 nips-2003-Linear Dependent Dimensionality Reduction
16 0.42868379 55 nips-2003-Distributed Optimization in Adaptive Networks
17 0.42823017 102 nips-2003-Large Scale Online Learning
18 0.42727467 169 nips-2003-Sample Propagation
19 0.4266603 78 nips-2003-Gaussian Processes in Reinforcement Learning
20 0.4258765 194 nips-2003-Warped Gaussian Processes