nips nips2013 nips2013-315 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. [sent-6, score-0.054]
2 Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. [sent-7, score-0.204]
3 An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. [sent-8, score-0.325]
4 To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. [sent-9, score-0.18]
5 We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. [sent-10, score-0.356]
6 In particular, unsupervised feature learning is an important component of many Deep Learning algorithms (Bengio, 2009), such as those based on auto-encoders (Bengio et al. [sent-12, score-0.104]
7 Unfortunately, auto-encoders and RBMs are computationally inconvenient when it comes to handling such high-dimensional sparse input vectors, because they require a form of reconstruction of the input vector, for all the elements of the input vector, even the ones that were zero. [sent-18, score-0.243]
8 In Section 2, we recapitulate the Reconstruction Sampling algorithm (Dauphin et al. [sent-19, score-0.066]
9 The basic idea is to use an 1 importance sampling scheme to stochastically select a subset of the input elements to reconstruct, and importance weights to obtain an unbiased estimator of the reconstruction error gradient. [sent-21, score-0.518]
10 In Section 3 we briefly review the basics of RBMs and the Gibbs chain involved in training them. [sent-23, score-0.06]
11 Ratio matching (Hyv¨ rinen, 2007), is an inductive principle and training criterion that can be applied to train a RBMs but does not require a Gibbs chain. [sent-24, score-0.255]
12 In Section 4, we present and justify a novel algorithm based on ratio matching order to achieve our objective of taking advantage of highly sparse inputs. [sent-25, score-0.329]
13 An interesting and unexpected result is that we find the biased version of the algorithm (without reweighting) to yield more discriminant features. [sent-28, score-0.065]
14 2 Reconstruction Sampling An auto-encoder learns an encoder function f mapping inputs x to features h = f (x), and a decoding or reconstruction function g such that g(f (x)) ≈ x for training examples x. [sent-29, score-0.194]
15 by flipping some bits) and trained to make g(f (˜)) ≈ x. [sent-34, score-0.076]
16 To avoid the expensive recon˜ x struction g(h) when the input is very high-dimensional, Dauphin et al. [sent-35, score-0.083]
17 (2011) propose that for each example, a small random subset of the input elements be selected for which gi (h) and the associated reconstruction error is computed. [sent-36, score-0.124]
18 To make the corresponding estimator of reconstruction error (and its gradient) unbiased, they propose to use an importance weighting scheme whereby the loss on the i-th input is weighted by the inverse of the probability that it be selected. [sent-37, score-0.302]
19 To reduce the variance of the estimator, they propose to always reconstruct the i-th input if it was one of the non-zeros in x or in x, and to choose uniformly at random an equal number of zero elements. [sent-38, score-0.065]
20 They show that the ˜ unbiased estimator yields the expected linear speedup in training time compared to the deterministic gradient computation, while maintaining good performance for unsupervised feature learning. [sent-39, score-0.295]
21 3 Restricted Boltzmann Machines A restricted Boltzmann machine (RBM) is an undirected graphical model with binary variables (Hinton et al. [sent-41, score-0.076]
22 The RBM can be trained by following the gradient of the negative log-likelihood − ∂ log P (x) ∂F (x) ∂F (x) = Edata − Emodel ∂θ ∂θ ∂θ where F (x) is the free energy (unnormalized log-probability associated with P (x)). [sent-45, score-0.118]
23 Starting from xi a step in this chain is taken by sampling hi ∼ P (h|xi ), then we have xi+1 ∼ P (x|hi ). [sent-48, score-0.148]
24 Training the RBM using SML-1 is on the order of O(dn) where d is the dimension of the input variables and n is the number of hidden variables. [sent-50, score-0.066]
25 More precisely, sampling P (h|x) 2 (inference) can take advantage of sparsity and costs O(pn) computations while “reconstruction”, i. [sent-52, score-0.085]
26 Thus scaling to larger input sizes n yields a linear increase in training time even if the number of non-zeros p in the input remains constant. [sent-55, score-0.132]
27 4 Ratio Matching Ratio matching (Hyv¨ rinen, 2007) is an estimation method for statistical models where the normala ization constant is not known. [sent-56, score-0.143]
28 It is similar to score matching (Hyv¨ rinen, 2005) but applied on a discrete data whereas score matching is limited to continuous inputs, and both are computationally simple and yield consistent estimators. [sent-57, score-0.352]
29 The core idea of ratio matching is to match ratios of probabilities between the data and the model. [sent-59, score-0.262]
30 In this form, we can see the similarity between score matching and ratio matching. [sent-67, score-0.274]
31 The normalization constant is canceled because P (x) e−F (x) x P (¯ i ) = e−F (¯ i ) , however this objective requires access to the true distribution Px which is rarely x available. [sent-68, score-0.072]
32 Hyv¨ rinen (2007) shows that the Ratio Matching (RM) objective can be simplified to a d JRM (x) = g i=1 P (x) P (¯ i ) x 2 (2) which does not require knowledge of the true distribution Px . [sent-69, score-0.138]
33 This objective can be described as ensuring that the training example x has the highest probability in the neighborhood of points at hamming distance 1. [sent-70, score-0.098]
34 The gradients x ∂F (x) ∂F (¯ i ) x − ∂θ ∂θ (4) 3 − σ(F (x) − F (¯ i )) . [sent-76, score-0.055]
35 x A naive implementation of this objective is O(d2 n) because it requires d computations of the free energy per example. [sent-77, score-0.059]
36 This can be done by saving the results of the computations α = cT x and βj = i Wji xi +bj when computing F (x). [sent-81, score-0.085]
37 3 5 Stochastic Ratio Matching We propose Stochastic Ratio Matching (SRM) as a more efficient form of ratio matching for highdimensional sparse distributions. [sent-85, score-0.269]
38 The ratio matching objective requires the summation of d terms in O(n). [sent-86, score-0.279]
39 If we rewrite the ratio matching objective as an expectation over a discrete distribution d 1 2 P (x) P (x) JRM (x) = d g = dE g 2 (5) d P (¯ i ) x P (¯ i ) x i=1 we can use Monte Carlo methods to estimate JRM without computing all the terms in Equation 2. [sent-88, score-0.279]
40 However, in practice this estimator has a high variance. [sent-89, score-0.075]
41 The solution proposed for SRM is to use an Importance Sampling scheme to obtain a lower variance estimator of JRM . [sent-91, score-0.1]
42 Combining Monte Carlo with importance sampling, we obtain the SRM objective d JSRM (x) = i=1 γi 2 g E[γi ] P (x) P (¯ i ) x (6) where γi ∼ P (γi = 1|x) is the so-called proposal distribution of our importance sampling scheme. [sent-92, score-0.314]
43 The proposal distribution determines which terms will be used to estimate the objective since only the terms where γi = 1 are non-zero. [sent-93, score-0.095]
44 , d E[JSRM (x)] = i=1 = E[γi ] 2 g E[γi ] P (x) P (¯ i ) x JRM (x) The intuition behind importance sampling is that the variance of the estimator can be reduced by focusing sampling on the largest terms of the expectation. [sent-96, score-0.279]
45 More precisely, it is possible to show that the variance of the estimator is minimized when P (γi = 1|x) ∝ g 2 (P (x)/P (¯ i )). [sent-97, score-0.075]
46 The challenge is finding a good approximation for (xi − P (xi = 1|x−i ))2 and to define a proposal distribution that is efficient to sample from. [sent-99, score-0.057]
47 If xi = 0 then the error (0 − P (xi = 1|x−i ))2 is likely small because the model has a high bias towards P (xi = 0). [sent-105, score-0.085]
48 In other words, the model will mostly make errors for terms where xi = 1 and a small number of dimensions where xi = 0. [sent-107, score-0.17]
49 We can use this to define the heuristic proposal distribution P (γi = 1|x) = 1 p/(d − j if xi = 1 1xj >0 ) otherwise (7) where p is the average number of non-zeros in the data. [sent-108, score-0.142]
50 The idea is to always sample the terms where xi = 1 and a subset of k of the (d − j 1xj >0 ) remaining terms where xi = 0. [sent-109, score-0.17]
51 However, instead of sampling those γi bits independently, we find that much smaller variance is obtained by sampling a number of zeros k that is constant for all examples, i. [sent-111, score-0.149]
52 A random k can cause very significant variance in the gradients and this makes stochastic gradient descent more difficult. [sent-114, score-0.115]
53 The computational cost of SRM per training example is O(pn), as opposed to O(dn) for RM. [sent-116, score-0.06]
54 While simple, we find that this heuristic proposal distribution works well in practice, as shown below. [sent-117, score-0.057]
55 4 For comparison, we also perform experiments with a biased version of Equation 6 d γi g 2 JBiasedSRM (x) = i=1 P (x) P (¯ i ) x . [sent-118, score-0.065]
56 (8) This will allow us to gauge the effectiveness of our importance weights for unbiasing the objective. [sent-119, score-0.099]
57 The biased objective can be thought as down-weighting the ratios where xi = 0 by a factor of E[γi ]. [sent-120, score-0.209]
58 6 Experimental Results In this section, we demonstrate the effectiveness of SRM for training RBMs. [sent-127, score-0.081]
59 Additionally, we show that RBMs are useful features extractors for topic classification. [sent-128, score-0.086]
60 The training set contains 23,149 documents and the test set has 781,265. [sent-133, score-0.086]
61 20 Newsgroups is a collection of Usenet posts composing a training set of 11,269 examples and 7505 test examples. [sent-138, score-0.06]
62 We use the by-date train/test split which ensures that the training set contains postings preceding the test examples in time. [sent-141, score-0.096]
63 We compare RBMs trained with ratio matching, stochastic ratio matching and biased stochastic ratio matching. [sent-147, score-0.656]
64 We include experiments with RBMs trained with SML-1 for comparison. [sent-148, score-0.076]
65 We use the RBM to pretrain the hidden layers of a feed-forward neural network (Hinton et al. [sent-150, score-0.077]
66 The hyper-parameters are cross-validated on a validation set consisting of 5% of the training set. [sent-154, score-0.093]
67 The learning rate for the RBMs is sampled from 10−[0,3] , the number of hidden units from [500, 2000] and the number of training epochs from [5, 20]. [sent-158, score-0.134]
68 It is trained for 32 epochs using early-stopping based on the validation set. [sent-160, score-0.153]
69 We regularize the MLP by dropping out 50% of the hidden units during training (Hinton et al. [sent-161, score-0.137]
70 5 Table 1: Log-probabilities estimated by AIS for the RBMs trained with the different estimation methods. [sent-167, score-0.076]
71 As seen in Table 1, SRM is a good estimator for training RBMs and is a good approximation of RM. [sent-218, score-0.135]
72 We see that with the same budget of epochs SRM achieves log-likelihoods comparable with RM on both datasets. [sent-219, score-0.068]
73 The striking difference of more than 500 nats with Biased SRM shows that the importance weights successfully unbias the estimator. [sent-220, score-0.078]
74 We observe this is an optimization problem since the training log-likelihood is also higher than RM. [sent-224, score-0.06]
75 Figure 1: Average speedup in the calculation of gradients by using the SRM objective compared to RM. [sent-227, score-0.134]
76 Figure 1 shows that as expected SRM achieves a linear speed-up compared to RM, reaching speedups of 2 orders of magnitude. [sent-229, score-0.057]
77 In fact, we observed that the computation time of the gradients for RM scale linearly with the size of the input while the computation time of SRM remains fairly constant because the number of non-zeros varies little. [sent-230, score-0.091]
78 tgz 6 Figure 2: Average norm of the gradients for the terms in Equation 2 where xi = 1 and xi = 0. [sent-234, score-0.225]
79 Confirming the hypothesis for the proposal distribution the terms where xi = 1 are 2 orders of magnitude larger. [sent-235, score-0.195]
80 The importance sampling scheme of SRM (Equation 7) relies on the hypothesis that terms where xi = 1 produce a larger gradient than terms where xi = 0. [sent-236, score-0.377]
81 We can verify this by monitoring the average gradients during learning on RCV1. [sent-237, score-0.055]
82 Figure 2 demonstrates that the average gradients for the terms where xi = 1 is 2 orders of magnitudes larger than those where xi = 0. [sent-238, score-0.258]
83 This confirms the hypothesis underlying the sampling scheme of SRM. [sent-239, score-0.108]
84 2 Using RBMs as feature extractors for NLP Having established that SRM is an efficient unbiased estimator of RM, we turn to the task of using RBMs not as generative models but as feature extractors. [sent-241, score-0.228]
85 This is similar to the known result that contrastive divergence, which is biased, yields better classification results than persistent contrastive divergence, which is unbiased. [sent-243, score-0.077]
86 The superior performance of the biased objective suggests that the non-zero features contain more information about the classification task. [sent-245, score-0.123]
87 The RBM trained with SRM improves on the state-of-the-art (Lewis et al. [sent-258, score-0.123]
88 The total training time for this RBM using SRM is 57 minutes. [sent-260, score-0.06]
89 We also train a Deep Belief Net (DBN) by stacking an RBM trained with SML on top of the RBMs learned with SRM. [sent-261, score-0.103]
90 This type of 2-layer deep architecture is able to significantly improve the performance on that task (Table 2). [sent-262, score-0.075]
91 In particular the DBN does significantly better than a stack of denoising auto-encoders we trained using biased reconstruction sampling (Dauphin et al. [sent-263, score-0.339]
92 We apply RBMs trained with SRM on 20 newsgroups with all 61,188 dimensions. [sent-266, score-0.153]
93 This result is closely followed by the DAE trained with reconstruction sampling which in our experiments reaches 20. [sent-269, score-0.227]
94 5 % simpler RBM trained by SRM is able to beat the more powerful HD-RBM model because it uses all the 61,188 dimensions. [sent-282, score-0.101]
95 7 Conclusion We have proposed a very simple algorithm called Stochastic Ratio Matching (SRM) to take advantage of sparsity in high-dimensional data when training discrete RBMs. [sent-283, score-0.082]
96 It can be used to estimate gradients in O(np) computation where p is the number of non-zeros, yielding linear speedup against the O(nd) of Ratio Matching (RM) where d is the input size. [sent-284, score-0.132]
97 It does so while providing an unbiased estimator of the ratio matching gradient. [sent-285, score-0.357]
98 Using this efficient estimator we train RBMs as features extractors and achieve state-of-the-art results on 2 text classification benchmarks. [sent-286, score-0.217]
99 Rcv1: A new benchmark collection for text categorization research. [sent-385, score-0.053]
100 Training restricted Boltzmann machines using approximations to the likelihood gradient. [sent-401, score-0.056]
wordName wordTfidf (topN-words)
[('srm', 0.701), ('rbms', 0.304), ('jrm', 0.199), ('rbm', 0.166), ('sml', 0.154), ('dauphin', 0.144), ('matching', 0.143), ('px', 0.126), ('bengio', 0.125), ('iased', 0.11), ('rinen', 0.1), ('ratio', 0.098), ('hyv', 0.089), ('reconstruction', 0.088), ('xi', 0.085), ('importance', 0.078), ('newsgroups', 0.077), ('trained', 0.076), ('estimator', 0.075), ('deep', 0.075), ('larochelle', 0.071), ('ais', 0.067), ('extractors', 0.066), ('jsrm', 0.066), ('biased', 0.065), ('sampling', 0.063), ('boltzmann', 0.063), ('rm', 0.062), ('training', 0.06), ('proposal', 0.057), ('gradients', 0.055), ('tieleman', 0.054), ('hinton', 0.053), ('dbn', 0.048), ('et', 0.047), ('est', 0.046), ('marlin', 0.046), ('loglikelihoods', 0.044), ('epochs', 0.044), ('classi', 0.043), ('mlp', 0.041), ('lewis', 0.041), ('unbiased', 0.041), ('speedup', 0.041), ('dn', 0.04), ('erhan', 0.039), ('odel', 0.039), ('younes', 0.039), ('stochastic', 0.039), ('objective', 0.038), ('postings', 0.036), ('input', 0.036), ('stochastically', 0.034), ('unsupervised', 0.034), ('wji', 0.034), ('montr', 0.034), ('rarely', 0.034), ('orders', 0.033), ('score', 0.033), ('validation', 0.033), ('ec', 0.032), ('hidden', 0.03), ('text', 0.029), ('reconstruct', 0.029), ('restricted', 0.029), ('sparse', 0.028), ('machines', 0.027), ('train', 0.027), ('dahl', 0.027), ('courville', 0.027), ('persistent', 0.027), ('bergstra', 0.027), ('documents', 0.026), ('inputs', 0.026), ('beat', 0.025), ('contrastive', 0.025), ('scheme', 0.025), ('murray', 0.025), ('inductive', 0.025), ('categorization', 0.024), ('achieves', 0.024), ('bits', 0.023), ('feature', 0.023), ('benchmarks', 0.022), ('advantage', 0.022), ('gradient', 0.021), ('table', 0.021), ('energy', 0.021), ('ratios', 0.021), ('salakhutdinov', 0.021), ('effectiveness', 0.021), ('carlo', 0.021), ('vincent', 0.021), ('monte', 0.021), ('features', 0.02), ('gibbs', 0.02), ('hypothesis', 0.02), ('inconvenient', 0.019), ('rocchio', 0.019), ('recapitulate', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
2 0.21237992 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.15007827 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
4 0.14737378 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent
Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1
5 0.13853455 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
6 0.1138764 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
7 0.10319473 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
8 0.09937606 5 nips-2013-A Deep Architecture for Matching Short Texts
9 0.090191774 75 nips-2013-Convex Two-Layer Modeling
10 0.08669804 251 nips-2013-Predicting Parameters in Deep Learning
11 0.078918457 30 nips-2013-Adaptive dropout for training deep neural networks
12 0.068155065 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
13 0.067261763 65 nips-2013-Compressive Feature Learning
14 0.066489607 139 nips-2013-How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal
15 0.06625919 160 nips-2013-Learning Stochastic Feedforward Neural Networks
16 0.054851383 161 nips-2013-Learning Stochastic Inverses
17 0.052460842 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
18 0.049956679 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
19 0.049924873 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
20 0.049878772 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
topicId topicWeight
[(0, 0.151), (1, 0.071), (2, -0.078), (3, -0.05), (4, 0.091), (5, -0.028), (6, -0.022), (7, 0.067), (8, 0.043), (9, -0.112), (10, 0.179), (11, 0.002), (12, -0.056), (13, -0.009), (14, 0.067), (15, 0.073), (16, 0.04), (17, -0.009), (18, -0.042), (19, 0.044), (20, -0.026), (21, -0.035), (22, 0.075), (23, -0.005), (24, 0.044), (25, 0.006), (26, -0.137), (27, -0.01), (28, -0.043), (29, 0.041), (30, 0.058), (31, 0.058), (32, -0.018), (33, -0.059), (34, 0.111), (35, -0.089), (36, -0.0), (37, 0.036), (38, -0.088), (39, 0.034), (40, -0.053), (41, 0.11), (42, 0.014), (43, 0.069), (44, -0.0), (45, -0.048), (46, 0.017), (47, 0.064), (48, 0.121), (49, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.90741712 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
2 0.77149171 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
3 0.76536876 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
4 0.74275661 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
5 0.70949012 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
6 0.61890185 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
7 0.56219244 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
8 0.52994162 160 nips-2013-Learning Stochastic Feedforward Neural Networks
9 0.49709401 5 nips-2013-A Deep Architecture for Matching Short Texts
10 0.49620828 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
11 0.46453717 75 nips-2013-Convex Two-Layer Modeling
12 0.44931141 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
13 0.44057932 251 nips-2013-Predicting Parameters in Deep Learning
14 0.43897432 30 nips-2013-Adaptive dropout for training deep neural networks
15 0.40733802 64 nips-2013-Compete to Compute
16 0.40666226 161 nips-2013-Learning Stochastic Inverses
17 0.36130211 244 nips-2013-Parametric Task Learning
18 0.36066815 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
19 0.35919806 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
20 0.35453585 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
topicId topicWeight
[(2, 0.027), (16, 0.028), (33, 0.191), (34, 0.084), (38, 0.224), (41, 0.034), (49, 0.056), (56, 0.096), (70, 0.025), (85, 0.026), (89, 0.032), (93, 0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.8144393 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
2 0.79945046 299 nips-2013-Solving inverse problem of Markov chain with partial observations
Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide
Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.
3 0.78193253 247 nips-2013-Phase Retrieval using Alternating Minimization
Author: Praneeth Netrapalli, Prateek Jain, Sujay Sanghavi
Abstract: Phase retrieval problems involve solving linear equations, but with missing sign (or phase, for complex numbers). Over the last two decades, a popular generic empirical approach to the many variants of this problem has been one of alternating minimization; i.e. alternating between estimating the missing phase information, and the candidate solution. In this paper, we show that a simple alternating minimization algorithm geometrically converges to the solution of one such problem – finding a vector x from y, A, where y = |AT x| and |z| denotes a vector of element-wise magnitudes of z – under the assumption that A is Gaussian. Empirically, our algorithm performs similar to recently proposed convex techniques for this variant (which are based on “lifting” to a convex matrix problem) in sample complexity and robustness to noise. However, our algorithm is much more efficient and can scale to large problems. Analytically, we show geometric convergence to the solution, and sample complexity that is off by log factors from obvious lower bounds. We also establish close to optimal scaling for the case when the unknown vector is sparse. Our work represents the only known theoretical guarantee for alternating minimization for any variant of phase retrieval problems in the non-convex setting. 1
4 0.74259156 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
5 0.74171722 30 nips-2013-Adaptive dropout for training deep neural networks
Author: Jimmy Ba, Brendan Frey
Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1
6 0.73541564 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
7 0.73471916 251 nips-2013-Predicting Parameters in Deep Learning
8 0.73356235 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
9 0.73348826 301 nips-2013-Sparse Additive Text Models with Low Rank Background
10 0.73114204 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
11 0.72891676 99 nips-2013-Dropout Training as Adaptive Regularization
12 0.72888839 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
13 0.72841954 335 nips-2013-Transfer Learning in a Transductive Setting
14 0.72838879 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
15 0.7281031 64 nips-2013-Compete to Compute
16 0.72693825 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
17 0.72588295 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
18 0.72515142 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
19 0.72506988 183 nips-2013-Mapping paradigm ontologies to and from the brain
20 0.7235269 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning