nips nips2013 nips2013-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. [sent-9, score-0.428]
2 We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. [sent-10, score-0.692]
3 We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. [sent-11, score-0.551]
4 AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. [sent-12, score-0.339]
5 AIS requires defining a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. [sent-20, score-0.497]
6 Typically, one uses geometric averages of the initial and target distributions. [sent-21, score-0.324]
7 Tantalizingly, [12] derived the optimal paths for some toy models in the context of path sampling and showed that they vastly outperformed geometric averages. [sent-22, score-0.461]
8 However, as choosing an optimal path is generally intractable, geometric averages still predominate. [sent-23, score-0.437]
9 We propose a novel path defined by averaging moments of the initial and target distributions. [sent-25, score-0.55]
10 We show that geometric averages and moment averages optimize different variational objectives, derive an asymptotically optimal piecewise linear schedule, and analyze the asymptotic performance of both paths. [sent-26, score-0.598]
11 Our proposed path often outperforms geometric averages at estimating partition functions of restricted Boltzmann machines (RBMs). [sent-27, score-0.518]
12 1 2 Estimating Partition Functions Suppose we have a probability distribution pb (x) = fb (x)/Zb defined on a space X , where fb (x) can be computed efficiently for a given x ∈ X , and we are interested in estimating the partition function Zb . [sent-28, score-0.315]
13 In particular, one must specify a sequence of K + 1 intermediate distributions pk (x) = fk (x)/Zk for k = 0, . [sent-30, score-0.398]
14 K, where pa (x) = p0 (x) is a tractable initial distribution, and pb (x) = pK (x) is the intractable target distribution. [sent-33, score-0.406]
15 AIS alternates between MCMC transitions and importance sampling updates, as shown in Alg 1. [sent-38, score-0.2]
16 However, unless the intermediate distributions and transition operators are w(i) ← Za ˆ carefully chosen, Zb may have high variance for k = 1 to K do fk (xk−1 ) and be far from Zb with high probability. [sent-41, score-0.311]
17 However, one typically defines a M ˆ return Zb = i=1 w(i) /M path γ : [0, 1] → P through some family P of distributions. [sent-43, score-0.221]
18 The intermediate distributions pk are chosen to be points along this path corresponding to a schedule 0 = β0 < β1 < . [sent-44, score-0.701]
19 One typically uses the geometric path γGA , defined in terms of geometric averages of pa and pb : pβ (x) = fβ (x)/Z(β) = fa (x)1−β fb (x)β /Z(β). [sent-48, score-0.783]
20 As in simulated annealing, the “hotter” distributions often allow faster mixing between modes which are isolated in pb . [sent-51, score-0.208]
21 AIS is closely related to a broader family of techniques for posterior inference and partition function estimation, all based on the following identity from statistical physics: 1 d log Zb − log Za = Ex∼pβ log fβ (x) dβ. [sent-52, score-0.297]
22 (2) dβ 0 Thermodynamic integration [14] estimates (2) using numerical quadrature, and path sampling [12] does so with Monte Carlo integration. [sent-53, score-0.271]
23 The choices of a path and a schedule are central to all of these methods. [sent-56, score-0.329]
24 Most work on adapting paths has focused on tuning schedules along a geometric path [15, 16, 17]. [sent-57, score-0.531]
25 [15] showed that the geometric schedule was optimal for annealing the scale parameter of a Gaussian, and [16] extended this result more broadly. [sent-58, score-0.364]
26 The aim of this paper is to propose, analyze, and evaluate a novel alternative to γGA based on averaging moments of the initial and target distributions. [sent-59, score-0.366]
27 Assuming perfect transitions, the expected log weights are given by: K−1 E[log w(i) ] = log Za + K−1 Epk [log fk+1 (x) − log fk (x)] = log Zb − k=0 DKL (pk pk+1 ). [sent-67, score-0.407]
28 (3) k=0 2 In other words, each log w(i) can be seen as a biased estimator of log Zb , where the bias δ = K−1 log Zb − E[log w(i) ] is given by the sum of KL divergences k=0 DKL (pk pk+1 ). [sent-68, score-0.206]
29 Suppose P is a family of probability distributions parameterized by θ ∈ Θ, and the K + 1 distributions p0 , . [sent-69, score-0.201]
30 , pK are chosen to be linearly spaced along a path γ : [0, 1] → P. [sent-72, score-0.298]
31 Suppose K + 1 distributions pk are linearly spaced along a path γ. [sent-76, score-0.493]
32 ] This result reveals a relationship with path sampling, as [12] showed that the variance of the path sampling estimator is proportional to the same functional. [sent-79, score-0.409]
33 In particular, the value of F(γ) using the optimal schedule is given by (γ)2 /2, where is the Riemannian path length defined by 1 ˙ ˙ θ(β)T Gθ (β)θ(β)dβ. [sent-81, score-0.354]
34 (γ) = (5) 0 Intuitively, the optimal schedule allocates more distributions to regions where pβ changes quickly. [sent-82, score-0.305]
35 While [12] derived the optimal paths and schedules for some simple examples, they observed that this is intractable in most cases and recommended using geometric paths in practice. [sent-83, score-0.536]
36 The former source of variance is well modeled by the perfect transitions analysis and can be made small by adding more intermediate distributions. [sent-86, score-0.367]
37 4 Moment Averaging As discussed in Section 2, the typical choice of intermediate distributions for AIS is the geometric averages path γGA given by (1). [sent-89, score-0.647]
38 In this section, we propose an alternative path for exponential family models. [sent-90, score-0.266]
39 In exponential families, geometric averages correspond to averaging the natural parameters: η(β) = (1 − β)η(0) + βη(1). [sent-93, score-0.342]
40 (7) Exponential families can also be parameterized in terms of their moments s = E[g(x)]. [sent-94, score-0.247]
41 one whose sufficient statistics are linearly independent), there is a one-to-one mapping between moments and natural parameters [18, p. [sent-97, score-0.235]
42 We propose an alternative to γGA called the moment averages path, denoted γM A , and defined by averaging the moments of the initial and target distributions: s(β) = (1 − β)s(0) + βs(1). [sent-99, score-0.611]
43 (8) This path exists for any exponential family model, since the set of realizable moments is convex [18]. [sent-100, score-0.467]
44 The moments are E[x] = µ and − 2 E[xxT ] = − 1 (Σ + µµT ). [sent-103, score-0.201]
45 (11) The parameters of the model are the visible biases a, the hidden biases b, and the weights W. [sent-110, score-0.22]
46 Since these parameters are also the natural parameters in the exponential family representation, γGA reduces to linearly averaging the biases and the weights. [sent-111, score-0.233]
47 Since it would be infeasible to moment match every βk even approximately, we introduce the moment averages spline (MAS) path, denoted γM AS . [sent-117, score-0.409]
48 , βR called knots, and solve for the natural parameters η(βj ) to match the moments s(βj ) for each knot. [sent-121, score-0.201]
49 We then interpolate between the knots using geometric averages. [sent-122, score-0.209]
50 For geometric averages, the intermediate distribution γGA (β) minimizes a weighted sum of KL divergences to the initial and target distributions: (GA) pβ = arg min (1 − β)DKL (q pa ) + βDKL (q pb ). [sent-127, score-0.572]
51 By contrast, (16) encourages the distribution to be spread out in order to capture all high probability regions of both pa and pb . [sent-133, score-0.212]
52 This interpretation helps explain why the intermediate distributions in the Gaussian example of Figure 1 take the shape that they do. [sent-134, score-0.235]
53 In our experiments, we found that γM A often gave more accurate results than γGA because the intermediate distributions captured regions of the target distribution which were missed by γGA . [sent-135, score-0.378]
54 Intriguingly, both paths always result in the same value of the cost functional F(γ) of Theorem 1 for any exponential family model. [sent-138, score-0.246]
55 For any exponential family model with natural parameters η and moments s, all three paths share the same value of the cost functional: 1 F(γGA ) = F(γM A ) = F(γM AS ) = (η(1) − η(0))T (s(1) − s(0)). [sent-140, score-0.409]
56 Because γGA and γM A linearly 2 interpolate the natural parameters and moments respectively, 1 1 1 ˙ F(γGA ) = (η(1) − η(0))T s(β) dβ = (η(1) − η(0))T (s(1) − s(0)) (18) 2 2 0 F(γM A ) = 1 (s(1) − s(0))T 2 1 ˙ η(β) dβ = 0 1 (s(1) − s(0))T (η(1) − η(0)). [sent-146, score-0.278]
57 2 (19) Finally, to show that F(γM AS ) = F(γM A ), observe that γM AS uses the geometric path between each pair of knots γ(βj ) and γ(βj+1 ), while γM A uses the moments path. [sent-147, score-0.551]
58 This analysis shows that all three paths result in the same expected log weights asymptotically, assuming perfect transitions. [sent-149, score-0.284]
59 For instance, consider annealing between two Gaussians pa = N (µa , σ) and pb = N (µb , σ). [sent-154, score-0.274]
60 The optimal schedule for γGA is a linear schedule with cost F(γGA ) = O(d2 ), where d = |µb − µa |/σ. [sent-155, score-0.337]
61 Using a linear schedule, the moment path also has cost O(d2 ), consistent with Theorem 2. [sent-156, score-0.33]
62 However, most of the cost of the path results from instability near the endpoints, where the variance changes suddenly. [sent-157, score-0.206]
63 Using an optimal schedule, which allocates more distributions near the endpoints, the cost functional falls to O((log d)2 ), which is within a constant factor of the optimal path derived by [12]. [sent-158, score-0.404]
64 ) In other words, while F(γGA ) = F(γM A ), they achieve this value for different reasons: γGA follows an optimal schedule along a bad path, while γM A follows a bad schedule along a near-optimal path. [sent-160, score-0.363]
65 3 for choosing a schedule, moment averages may result in large reductions in the cost functional for some models. [sent-162, score-0.305]
66 However, consider the set of binned schedules, where the path is divided into segments, some number Kj of intermediate distributions are allocated to each segment, and the distributions are spaced linearly within each segment. [sent-165, score-0.699]
67 Under the assumption of perfect transitions, there is a simple formula for an asymptotically optimal binned schedule which requires only the parameters obtained through moment averaging: Theorem 3. [sent-166, score-0.511]
68 Let γ be any path for an exponential family model defined by a set of knots βj , each with natural parameters ηj and moments sj , connected by segments of either γGA or γM A paths. [sent-167, score-0.605]
69 Then, under the assumption of perfect transitions, an asymptotically optimal allocation of intermediate distributions to segments is given by: Kj ∝ (ηj+1 − ηj )T (sj+1 − sj ). [sent-168, score-0.47]
70 5 Figure 2: Estimates of log Zb for a normalized Gaussian as K, the number of intermediate distributions, is varied. [sent-173, score-0.212]
71 ) 5 Experimental Results In order to compare our proposed path with geometric averages, we ran AIS using each path to estimate partition functions of several probability distributions. [sent-177, score-0.558]
72 First, we show the estimates of log Z as a function of K, the number of intermediate distributions, in order to visualize the amount of computation necessary to obtain reasonable accuracy. [sent-179, score-0.258]
73 We also compared linear schedules with the optimal binned schedules of Section 4. [sent-192, score-0.335]
74 Observe that with 1,000 intermediate distributions, all paths yielded accurate estimates of log Zb . [sent-195, score-0.388]
75 However, γM A needed fewer intermediate distributions to achieve accurate estimates. [sent-196, score-0.261]
76 However, because the γM A intermediate distributions were broader, enough samples landed in high probability regions to yield reasonable estimates of log Zb . [sent-204, score-0.365]
77 Since it is hard to compute γM A exactly for RBMs, we used the moments spline path γM AS of Section 4 with the 9 knot locations 0. [sent-210, score-0.425]
78 6 Figure 3: Estimates of log Zb for the tractable PCD(20) RBM as K, the number of intermediate distributions, is varied. [sent-222, score-0.251]
79 ) CD1(20) PCD(20) pa (v) path & schedule log Zb ˆ log Zb ESS log Zb ˆ log Zb ESS uniform uniform uniform uniform GA linear GA optimal binned MAS linear MAS optimal binned 279. [sent-225, score-1.016]
80 (left) base rate RBM, β = 0 (top) geometric path (bottom) MAS path (right) target RBM, β = 1. [sent-241, score-0.572]
81 partition function and moments and to generate exact samples by exhaustively summing over all 220 hidden configurations. [sent-242, score-0.329]
82 The moments of the target RBMs were computed exactly, and moment matching was performed with conjugate gradient using the exact gradients. [sent-243, score-0.384]
83 Under perfect transitions, γGA and γM AS were both able to accurately estimate log Zb using as few as 100 intermediate distributions. [sent-245, score-0.307]
84 However, using the Gibbs transition operator, γM AS gave accurate estimates using fewer intermediate distributions and achieved a higher ESS at K = 100,000. [sent-246, score-0.366]
85 To check that the improved performance didn’t rely on accurate moments of pb , we repeated the experiment with highly biased moments. [sent-247, score-0.353]
86 2 The differences ˆ in log Zb and ESS compared to the exact moments condition were not statistically significant. [sent-248, score-0.26]
87 Full-size, Intractable RBMs: For intractable RBMs, moment averaging required approximately solving two intractable problems: moment estimation for the target RBM, and moment matching. [sent-249, score-0.668]
88 We estimated the moments from 1,000 independent Gibbs chains, using 10,000 Gibbs steps with 1,000 steps of burn-in. [sent-250, score-0.201]
89 ) In addition, we ran a cheap, inaccurate moment matching scheme (denoted MAS cheap) where visible moments were estimated from the empirical MNIST base rate and the hidden moments from the conditional distributions of the hidden units given the MNIST digits. [sent-254, score-0.816]
90 Results of both methods are 2 In particular, we computed the biased moments from the conditional distributions of the hidden units given the MNIST training examples, where each example of digit class i was counted i + 1 times. [sent-256, score-0.355]
91 ) CD1(500) PCD(500) CD25(500) pa (v) path ˆ log Zb ESS ˆ log Zb ESS ˆ log Zb ESS uniform uniform uniform GA linear MAS linear MAS cheap linear 341. [sent-260, score-0.594]
92 As with the tractable RBMs, we found that optimal binned schedules made little difference in performance, so we focus here on linear schedules. [sent-286, score-0.262]
93 The most serious failure was γGA for CD1(500) with uniform initialization, which underestimated our best estimates of the log partition function (and hence overestimated held-out likelihood) by nearly 20 nats. [sent-287, score-0.23]
94 The geometric path from uniform to PCD(500) and the moments path from uniform to CD1(500) also resulted in underestimates, though less drastic. [sent-288, score-0.76]
95 In this case, the hidden activations were closer to uniform conditioned on a blank image than on a digit, so γGA preferred blank images. [sent-295, score-0.21]
96 6 Conclusion We presented a theoretical analysis of the performance of AIS paths and proposed a novel path for exponential families based on averaging moments. [sent-297, score-0.448]
97 We gave a variational interpretation of this path and derived an asymptotically optimal piecewise linear schedule. [sent-298, score-0.342]
98 Moment averages performed well empirically at estimating partition functions of RBMs. [sent-299, score-0.204]
99 We hope moment averaging can also improve other path-based sampling algorithms which typically use geometric averages, such as path sampling [12], parallel tempering [23], and tempered transitions [15]. [sent-300, score-0.758]
100 Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. [sent-350, score-0.306]
wordName wordTfidf (topN-words)
[('zb', 0.407), ('ga', 0.36), ('ais', 0.279), ('ess', 0.266), ('rbms', 0.206), ('moments', 0.201), ('path', 0.184), ('mas', 0.177), ('intermediate', 0.153), ('schedule', 0.145), ('pb', 0.126), ('moment', 0.124), ('averages', 0.121), ('pcd', 0.119), ('transitions', 0.119), ('pk', 0.113), ('schedules', 0.112), ('geometric', 0.107), ('rbm', 0.106), ('paths', 0.104), ('perfect', 0.095), ('annealing', 0.087), ('binned', 0.086), ('intractable', 0.084), ('partition', 0.083), ('distributions', 0.082), ('dkl', 0.078), ('tempered', 0.073), ('vht', 0.073), ('annealed', 0.071), ('kj', 0.071), ('averaging', 0.069), ('bootstrap', 0.064), ('pa', 0.061), ('knots', 0.059), ('target', 0.059), ('log', 0.059), ('spaced', 0.056), ('fb', 0.053), ('visible', 0.053), ('fk', 0.05), ('biases', 0.048), ('za', 0.047), ('families', 0.046), ('estimates', 0.046), ('cheap', 0.046), ('hidden', 0.045), ('exponential', 0.045), ('interpolate', 0.043), ('blank', 0.042), ('uniform', 0.042), ('segments', 0.041), ('sampling', 0.041), ('fj', 0.04), ('spline', 0.04), ('deep', 0.04), ('importance', 0.04), ('activations', 0.039), ('tractable', 0.039), ('gibbs', 0.038), ('base', 0.038), ('mnist', 0.038), ('sj', 0.038), ('functional', 0.038), ('initial', 0.037), ('boltzmann', 0.037), ('family', 0.037), ('thermodynamic', 0.037), ('asymptotically', 0.036), ('piecewise', 0.034), ('toronto', 0.034), ('linearly', 0.034), ('gave', 0.033), ('bolded', 0.032), ('endpoints', 0.032), ('contrastive', 0.032), ('variational', 0.03), ('chains', 0.03), ('tk', 0.029), ('geoffrey', 0.029), ('divergences', 0.029), ('allocates', 0.028), ('interpolated', 0.028), ('xk', 0.028), ('segment', 0.027), ('gaussians', 0.027), ('units', 0.027), ('accurate', 0.026), ('transition', 0.026), ('weights', 0.026), ('radford', 0.025), ('regions', 0.025), ('optimal', 0.025), ('along', 0.024), ('fa', 0.024), ('tommi', 0.023), ('monte', 0.023), ('machines', 0.023), ('allocated', 0.022), ('cost', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
2 0.13853455 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
3 0.13271904 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
4 0.10691851 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
5 0.089913003 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
6 0.08024279 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
7 0.076325588 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
8 0.075591981 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
9 0.07458546 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
10 0.072202422 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
11 0.071791664 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
12 0.066014774 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
13 0.061672557 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
14 0.061306767 75 nips-2013-Convex Two-Layer Modeling
15 0.061291352 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
16 0.060414229 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
17 0.058914669 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
18 0.058429606 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
19 0.057907995 160 nips-2013-Learning Stochastic Feedforward Neural Networks
20 0.057491653 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
topicId topicWeight
[(0, 0.154), (1, 0.039), (2, -0.034), (3, -0.008), (4, 0.036), (5, 0.033), (6, 0.06), (7, 0.004), (8, 0.043), (9, -0.101), (10, 0.129), (11, -0.028), (12, -0.041), (13, 0.003), (14, 0.061), (15, 0.051), (16, 0.016), (17, -0.064), (18, -0.059), (19, 0.026), (20, -0.036), (21, -0.011), (22, 0.053), (23, -0.026), (24, -0.017), (25, -0.026), (26, -0.13), (27, 0.006), (28, -0.11), (29, 0.076), (30, 0.054), (31, 0.07), (32, -0.058), (33, -0.057), (34, 0.061), (35, -0.107), (36, -0.003), (37, 0.067), (38, -0.045), (39, 0.006), (40, -0.048), (41, 0.096), (42, -0.021), (43, 0.045), (44, 0.052), (45, -0.034), (46, -0.031), (47, 0.05), (48, 0.086), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.92925018 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
2 0.77927989 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
3 0.68200952 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
4 0.66087264 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent
Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1
5 0.61141336 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
Author: Benigno Uria, Iain Murray, Hugo Larochelle
Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of onedimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradientbased optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case. 1
6 0.57799935 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
7 0.54432964 331 nips-2013-Top-Down Regularization of Deep Belief Networks
8 0.52942055 160 nips-2013-Learning Stochastic Feedforward Neural Networks
9 0.47282529 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
10 0.46986967 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
11 0.45628348 161 nips-2013-Learning Stochastic Inverses
12 0.44882143 299 nips-2013-Solving inverse problem of Markov chain with partial observations
13 0.44573316 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
14 0.44532874 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
15 0.44427735 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
16 0.42637226 134 nips-2013-Graphical Models for Inference with Missing Data
17 0.41643098 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.40953597 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
19 0.40465215 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
20 0.39625886 70 nips-2013-Contrastive Learning Using Spectral Methods
topicId topicWeight
[(11, 0.191), (16, 0.042), (33, 0.176), (34, 0.094), (41, 0.051), (49, 0.08), (56, 0.102), (70, 0.059), (85, 0.014), (89, 0.03), (93, 0.053), (95, 0.019)]
simIndex simValue paperId paperTitle
1 0.95598376 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Author: Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, Eric Xing
Abstract: We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes. 1
2 0.9322933 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
Author: Nathaniel J. Smith, Noah Goodman, Michael Frank
Abstract: Language users are remarkably good at making inferences about speakers’ intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings. 1
same-paper 3 0.86221933 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
4 0.78320938 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1
5 0.7808755 121 nips-2013-Firing rate predictions in optimal balanced networks
Author: David G. Barrett, Sophie Denève, Christian K. Machens
Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1
6 0.78008109 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
7 0.77641845 331 nips-2013-Top-Down Regularization of Deep Belief Networks
8 0.77630353 64 nips-2013-Compete to Compute
9 0.77619487 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
10 0.77512378 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
11 0.7744863 301 nips-2013-Sparse Additive Text Models with Low Rank Background
12 0.77438408 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
13 0.7724359 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
14 0.7697655 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.76815122 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
16 0.76638561 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
17 0.76589847 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
18 0.76520151 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
19 0.76225442 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
20 0.76212829 30 nips-2013-Adaptive dropout for training deep neural networks