nips nips2013 nips2013-221 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we first develop some tools and simulation results which relate RBMs to certain easierto-analyze approximations, and to neural networks with 1 hidden layer of threshold units, for which many results about representational efficiency are already known (Maass, 1992; Maass et al. [sent-40, score-0.41]
2 Section 2 characterizes the unnormalized loglikelihood as a type of neural network (called an “RBM network”) and shows how this type is related to single hidden layer neural networks of threshold neurons, and to an easier-to-analyze approximation (which we call a “hardplus RBM network”). [sent-54, score-0.505]
3 2 6 softplus hardplus Function values 5 4 3 2 1 0 −4 −2 0 2 4 6 y Figure 1: Left: An illustration of a basic RBM network with n = 3 and m = 5. [sent-57, score-0.793]
4 Definition 1 A RBM network with parameters W, b is defined as a neural network with one hidden layer containing m softplus neurons and weights and biases given by W and b, so that each neuron j’s output is soft([W ]j + bj ). [sent-75, score-1.03]
5 The output layer contains one neuron whose weights and bias are given by 1 ≡ [11. [sent-76, score-0.375]
6 For convenience, we include the bias constant B so that RBM networks shift their output by an additive constant (which does not affect the probability distribution implied by the RBM network since any additive constant is canceled out by log Z in the full log probability). [sent-80, score-0.323]
7 We define a hardplus RBM network in the obvious way: as an RBM network with the softplus activation functions of the hidden neurons replaced with hardplus functions. [sent-83, score-1.655]
8 The strategy we use to prove many of the results in this paper is to first establish them for hardplus RBM networks, and then show how they can be adapted to the standard softplus case via simulation results given in the following section. [sent-84, score-0.722]
9 4 Hardplus RBM networks versus (Softplus) RBM networks In this section we present some approximate simulation results which relate hardplus and standard (softplus) RBM networks. [sent-86, score-0.634]
10 The first result formalizes the simple observation that for large input magnitudes, the softplus and hardplus functions behave very similarly (see Figure 1, and Proposition 11 in the Appendix). [sent-87, score-0.729]
11 Suppose we have a softplus and hardplus RBM networks with identical sizes and parameters. [sent-89, score-0.753]
12 If, for each possible input x ∈ {0, 1}n , the magnitude of the input to each neuron is bounded from below by C, then the two networks compute the same real-valued function, up to an error (measured by | · |) which is bounded by m exp(−C). [sent-90, score-0.397]
13 The next result demonstrates how to approximately simulate a RBM network with a hardplus RBM network while incurring an approximation error which shrinks as the number of neurons increases. [sent-91, score-0.796]
14 The basic idea is to simulate individual softplus neurons with groups of hardplus neurons that compute what amounts to a piece-wise linear approximation of the smooth region of a softplus function. [sent-92, score-1.113]
15 Suppose we have a (softplus) RBM network with m hidden neurons with parameters bounded in magnitude by C. [sent-94, score-0.379]
16 Then there exists a hardplus RBM network with ≤ 2m2 p log(mp) + m hidden neurons and with parameters bounded in magnitude by C which computes the same function, up to an approximation error of 1/p. [sent-96, score-0.886]
17 Note that if p and m are polynomial functions of n, then the simulation produces hardplus RBM networks whose size is also polynomial in n. [sent-97, score-0.648]
18 While the output of a thresholded RBM network does not correspond to the log probability of an RBM, the following observation spells out how we can use thresholded RBM networks to establish lower bounds on the size of an RBM network required to compute certain simple functions (i. [sent-106, score-0.632]
19 If an RBN network of size m can compute a real-valued function g which represents f with margin δ, then there exists a thresholded RBM network that computes f with margin δ. [sent-109, score-0.515]
20 Using Theorem 3 we can prove a more interesting result which states that any lower bound result for thresholded hardplus RBMs implies a somewhat weaker lower bound result for standard RBM networks: Proposition 5. [sent-111, score-0.624]
21 RBM networks are distinguished from these, primarily because they have a single hidden layer and because the upper level weights are constrained to be 1. [sent-122, score-0.294]
22 Intuitively, the j th softplus neuron acts as a “feature detector”, which when “activated” by an input s. [sent-124, score-0.383]
23 By comparison, standard threshold networks only requires 1 hidden neuron to compute such a function. [sent-130, score-0.409]
24 In fact, it is easy to show1 that without the constraint on upper level weights, an RBM network would be, up to a linear factor, at least as efficient at representing real-valued functions as a neural network with 1 hidden layer of threshold neurons. [sent-131, score-0.537]
25 (1994), it follows that a thresholded RBM network is, up to a polynomial increase in size, at least as efficient at computing Boolean functions as 1-hidden layer neural networks with any “sigmoid-like” activation function2 , and polynomially bounded weights. [sent-134, score-0.553]
26 1 To see this, note that we could use 2 softplus neurons to simulate a single neuron with a “sigmoid-like” activation function (i. [sent-135, score-0.533]
27 Right: The output total of the hardplus RBM network constructed in Theorem 7. [sent-142, score-0.626]
28 7 Simulating hardplus RBM networks by a one-hidden-layer threshold network Here we provide a natural simulation of hardplus RBM networks by threshold networks with one hidden layer. [sent-146, score-1.536]
29 Because this is an efficient (polynomial) and exact simulation, it implies that a hardplus RBM network can be no more powerful than a threshold network with one hidden layer, for which several lower bound results are already known. [sent-147, score-0.879]
30 Let f be a real-valued function computed by a hardplus RBM network of size m. [sent-149, score-0.578]
31 Furthermore, if the weights of the RBM network have magnitude at most C, then the weights of the corresponding threshold network have magnitude at most (n + 1)C. [sent-151, score-0.489]
32 3 n2 + 1-sized RBM networks can compute any symmetric function In this section we present perhaps the most surprising results of this paper: a construction of an n2 -sized RBM network (or hardplus RBM network) for computing any given symmetric function of x. [sent-152, score-0.753]
33 Symmetric functions are already known3 to be computable by single hidden layer threshold networks (Hajnal et al. [sent-156, score-0.405]
34 Given that hardplus RBM networks appear to be strictly less expressive than such threshold networks (as discussed in Section 2. [sent-160, score-0.728]
35 Then (i) there exists a hardplus RBM network, of size n2 + 1, and with weights polynomial in n and t1 , . [sent-164, score-0.529]
36 , tk that computes f exactly, and (ii) for every there is a softplus RBM network of size n2 +1, and with weights polynomial in n, t0 , . [sent-167, score-0.489]
37 Our hardplus RBM network consists of n “building blocks”, each composed of n hardplus neurons, plus one additional hardplus neuron, for a total size of m = n2 + 1. [sent-172, score-1.502]
38 The main technical challenge is then to choose the parameters of these building blocks so that the sum of n of these “rectified quadratics”, plus the output of the extra hardplus neuron (which handles 3 The construction in Hajnal et al. [sent-175, score-0.817]
39 Note that this construction is considerably more complex than the well-known construction used for computing symmetric functions with 1 hidden layer threshold networks Hajnal et al. [sent-182, score-0.466]
40 While we cannot prove that ours is the most efficient possible construction RBM networks, we can prove that a construction directly analogous to the one used for 1 hidden layer threshold networks—where each individual neuron computes a symmetric function—cannot possibly work for RBM networks. [sent-184, score-0.585]
41 To see this, first observe that any neuron that computes a symmetric function must compute a function of the form g(βX + b), where g is the activation function and β is some scalar. [sent-185, score-0.304]
42 Then because the positive sum of convex functions is convex, the output of the RBM network (which is the unweighted sum of the output of its neurons, plus a constant) is itself convex in X. [sent-187, score-0.281]
43 1 Lower bounds on the size of RBM networks for certain functions Existential results In this section we prove a result which establishes the existence of functions which cannot be computed by RBM networks that are not exponentially large. [sent-190, score-0.272]
44 Let Fm,δ,n represent the set of those Boolean functions on {0, 1}n that can be computed by a thresholded RBM network of size m with margin δ. [sent-195, score-0.31]
45 In this sub-section we present strong evidence that this is not the case by exhibiting a simple Boolean function that provably requires exponentially many neurons to be computed by a thresholded RBM network, provided that the margin is not allowed to be exponentially smaller than the weights. [sent-203, score-0.311]
46 Using the simulation of hardplus RBM networks by 1 hidden layer threshold networks (Theorem 6), and Proposition 5, and an existing result about the hardness of computing IP by 1 hidden layer thresholded networks of bounded weights due to Hajnal et al. [sent-213, score-1.351]
47 If m < min 2n/3 C , 2n/6 δ 4C log(2/δ) , 2n/9 3 δ 4C then no RBM network of size m, whose weights are bounded in magnitude by C, can compute a function which represents ndimensional IP with margin δ. [sent-215, score-0.339]
48 If m < 2·max{log 2,nC+log 2} · 2n/4 , then no RBM network of size m, whose weights are upper bounded in value by C, can compute a function which represents n-dimensional IP with margin δ. [sent-223, score-0.3]
49 We treated the RBM’s unnormalized log likelihood as a neural network which allowed us to relate an RBM’s representational efficiency to that of threshold networks, which are much better understood. [sent-228, score-0.3]
50 Consider a single neuron in the RBM network and the corresponding neuron in the hardplus RBM network, whose net-input are given by y = w x + b. [sent-331, score-0.897]
51 Suppose we have a softplus RBM network with a number of hidden neurons given by m. [sent-336, score-0.515]
52 To simulate this with a hardplus RBM network we will replace each neuron with a group of hardplus neurons with weights and biases chosen so that the sum of their outputs approximates the output of the original softplus neuron, to within a maximum error of 1/p where p is some constant > 0. [sent-337, score-1.69]
53 First we describe the construction for the simulation of a single softplus neurons by a group of hardplus neurons. [sent-338, score-0.853]
54 At a high level, this construction works by approximating soft(y), where y is the input to the neuron, by a piece-wise linear function expressed as the sum of a number of hardplus functions, whose “corners” all lie inside [−a, a]. [sent-341, score-0.543]
55 Outside this range of values, we use the fact that soft(y) converges exponentially fast (in a) to 0 on the left, and y on the right (which can both be trivially computed by hardplus functions). [sent-342, score-0.487]
56 , g, g + 1, we will set the weight vector wi and bias bi of the i-th hardplus neuron in our group so that the neuron outputs hard(ηi (y − qi )). [sent-358, score-0.914]
57 This is accomplished by taking wi = ηi w and bi = ηi (b − qi ), where w and b (without the subscripts), are the weight vector and bias of the original softplus neuron. [sent-359, score-0.317]
58 Note that since |ηi | ≤ 1 we have that the weights of these hard neurons are smaller in magnitude than the weights of the original soft neuron and thus bounded by C as required. [sent-360, score-0.629]
59 The total output (sum) for this group is: g+1 hard(ηi (y − qi )) T (y) = i=1 We will now bound the approximation error |T (y) − soft(y)| of our single neuron simulation. [sent-361, score-0.322]
60 Note that for a given y we have that the i-th hardplus neuron in the group has a non-negative input iff y ≥ qi . [sent-362, score-0.733]
61 qi ≤ y, then each neuron from i = 1 to i = j will have positive input and each neuron from i = j + 1 to i = g + 1 will have negative input. [sent-366, score-0.398]
62 Thus the approximation error at such y’s may be bounded as: |T (y)− soft(y)| = |(νj (y − qj ) + soft(qj ) − soft(q1 )) − soft(y)| ≤ max{|νj (y − qj ) + soft(qj ) − soft(y)|, soft(q1 )} 2a , exp(−a) ≤ max{qj+1 − qj , exp(−a)} = max g where we have also used soft(q1 ) = soft(−a) ≤ exp(−a). [sent-373, score-0.376]
63 This gives: m max 2a 1 , 2mpa mp 2a 1 , 2mpa mp 1 1 1 = , p p p ≤ m max = max Thus we see that with m(g + 1) = m( 2mp log(mp) + 1) ≤ 2m2 p log(mp) + m neurons we can produce a hardplus RBM network which approximates the output of our softplus RBM network with error bounded by 1/p. [sent-377, score-1.205]
64 Suppose that there is an RBM network of size m with weights bounded in magnitude by C computes a function g which represent f with margin δ. [sent-385, score-0.35]
65 Then taking p = 2/δ and applying Theorem 3 we have that there exists an hardplus RBM network of size 4m2 log(2m/δ)/δ + m which computes a function g s. [sent-386, score-0.623]
66 Let f be a Boolean function on n variables computed by a size s hardplus RBM network, with parameters (W, b, d) . [sent-394, score-0.462]
67 We will first construct a three layer hybrid Boolean/threshold circuit/network where the output gate is a simple weighted sum, the middle layer consists of AND gates, and the bottom hidden layer consists of threshold neurons. [sent-395, score-0.523]
68 It is not hard to see that our three layer netork computes the same Boolean function as the original hardplus RBM network. [sent-399, score-0.617]
69 In order to obtain a single hidden layer threshold network, we replace each sub-network rooted at an AND gate of the middle layer by a single threshold neuron. [sent-400, score-0.47]
70 Consider a general sub-network n consisting of an AND of: (1) a variable xj and (2) a threshold neuron computing ( i=1 ai xi ≥ b). [sent-401, score-0.283]
71 Note that if the input x is such that i ai xi ≥ b and xj = 1, then i ai xi + Qαj will be at least b + Q, so the threshold gate will output 1. [sent-404, score-0.292]
72 We will first describe how to construct a hardplus RBM network which satisfies the properties required for part (i). [sent-410, score-0.578]
73 It will be composed of n special groups of hardplus neurons (which are defined and discussed below), and one additional one we call the “zero-neuron”, which will be defined later. [sent-411, score-0.564]
74 Next we show how the parameters of the n building blocks used in our construction can be set to produce a hardplus RBM network with the desired output. [sent-416, score-0.701]
75 In addition to the n building blocks, our hardplus RBM will include an addition unit that we will call the zero-neuron, which handles x = 0. [sent-432, score-0.504]
76 Finally, the output bias B of our hardplus RBM network will be set to −d. [sent-434, score-0.649]
77 The total output of the network is simply the sum of the outputs of the n different building blocks, the zero neuron, and constant bias −d. [sent-435, score-0.27]
78 ak = −(tk + d) k2 bk = 2(tk + d) k bk = −2kak This claim is self-evidently true by examining basic definitions of γj and ej and realizing that ak and bk are telescoping sums. [sent-438, score-0.28]
79 For all other building blocks (γj , ej ), j < n, since ej ≤ j + 1, this block outputs zero since γj X(ej − X) is less than or equal to zero. [sent-444, score-0.343]
80 Thus the sum of all of the building blocks when X = n is just the output of the (γn , en )-block which is γn · n(en − n) = −γn · n2 + γn en · n = −(tn + d) + 2(tn + d) = tn + d as desired. [sent-445, score-0.282]
81 Plugging in the above expressions for ak and bk from Claim 2, we see that the value of this polynomial at X = k is: ak k 2 + bk k = −(tk + d) 2 2(tk + d) k + k = −(tk + d) + 2(tk + d) = tk + d k2 k Finally, it remains to ensure that our hardplus RBM network outputs t0 for X = 0. [sent-451, score-0.779]
82 When X = 0, this neuron will output t0 + d, making the total output of the network −d + t0 + d = t0 as needed. [sent-454, score-0.363]
83 And note that the magnitude of the input to each neuron is lower bounded by a positive linear function of M and d (a non-trivial fact which we will prove below). [sent-458, score-0.272]
84 From these two observations it follows that to achieve the condition that the magnitude of the input to each neuron is greater than C(n) for some function C of n, the weights need to grow linearly with C. [sent-459, score-0.271]
85 There are two cases where a hardplus neuron in building block j has a negative input. [sent-461, score-0.671]
86 Let f : {0, 1}n → {0, 1} be a Boolean function computed by a threshold neuron with arbitrary real incoming weights and bias. [sent-476, score-0.312]
87 There exists a constant K and another threshold neuron computing f , all of whose incoming weights and bias are integers with magnitude at most 2Kn log n . [sent-477, score-0.421]
88 For each 0 < α < 1, let fα,n be the subset of such Boolean functions that are computable by threshold networks with one hidden layer with at most s neurons. [sent-481, score-0.389]
89 By using Fact 14 repeatedly for each of the hidden neurons, we obtain another threshold network having still s hidden units computing the same Boolean function such that the incoming weights and biases of all hidden neurons is bounded by 2Kn log n . [sent-485, score-0.752]
90 Clearly, there are 2 at most 2Kn log n ways to choose the incoming weights of a given neuron in the hidden layer. [sent-489, score-0.341]
91 Consider any thresholded RBM network with m hidden units that is computing a n-dimensional Boolean function with margin δ. [sent-495, score-0.393]
92 Using Proposition 5, we can obtain a thresholded hardplus RBM network of size 4m2 /δ · log(2m/δ) + m that computes the same Boolean function as the thresholded original RBM network. [sent-496, score-0.817]
93 Applying Theorem 6 and thresholding the output, we obtain a thresholded network with 1 hidden layer of thresholds which is the same size and computes the same Boolean function. [sent-497, score-0.428]
94 This argument shows that the set of Boolean functions computed by thresholded RBM networks of m hidden units and margin δ is a subset of the Boolean functions computed by 1-hidden-layer threshold networks of size 4m2 n/δ ·log(2m/δ)+mn. [sent-498, score-0.582]
95 3 implies that a threshold neuron in a thresholded network of size m is a 2δ -discriminator (α is the sum of the α 18 absolute values of the 2nd-level weights in their notation). [sent-510, score-0.512]
96 For a neural network of size m with a single hidden layer of threshold neurons and weights n/3 bounded by C that computes a function that represents IP with margin δ, we have m ≥ 6δ2 . [sent-517, score-0.683]
97 By Proposition 5 it suffices to show that no thresholded hardplus RBM network of size ≤ 4m2 log(2m/δ)/δ + m with parameters bounded by C can compute IP with margin δ/2. [sent-519, score-0.794]
98 Then by Theorem 6 there exists a single hidden layer threshold network of size ≤ 4m2 n log(2m/δ)/δ+mn with weights bounded in magnitude by (n + 1)C that computes the same function, i. [sent-521, score-0.541]
99 Finally, given a set of activation functions H and a set of inner functions I, an (H, I)- network is one each of whose hidden unit is a neuron of the form Gh, for some h ∈ H and ∈ I. [sent-541, score-0.519]
100 Then, every (H, I) network with one layer of m hidden units computing IP with a margin of δ must satisfy the following: m≥ δ 2 (H, I) 2n/4 . [sent-545, score-0.384]
wordName wordTfidf (topN-words)
[('rbm', 0.552), ('hardplus', 0.462), ('tj', 0.364), ('softplus', 0.215), ('soft', 0.179), ('boolean', 0.153), ('neuron', 0.151), ('network', 0.116), ('qj', 0.106), ('ej', 0.105), ('rbms', 0.105), ('neurons', 0.102), ('thresholded', 0.097), ('hajnal', 0.096), ('layer', 0.088), ('ip', 0.088), ('threshold', 0.083), ('hidden', 0.082), ('qi', 0.079), ('networks', 0.076), ('activation', 0.065), ('gh', 0.063), ('margin', 0.062), ('maass', 0.058), ('parity', 0.058), ('aj', 0.056), ('blocks', 0.051), ('exp', 0.051), ('boltzmann', 0.048), ('weights', 0.048), ('thresh', 0.048), ('output', 0.048), ('tk', 0.046), ('gate', 0.046), ('computes', 0.045), ('en', 0.043), ('building', 0.042), ('bounded', 0.04), ('jaj', 0.04), ('magnitude', 0.039), ('tn', 0.038), ('units', 0.036), ('functions', 0.035), ('claim', 0.032), ('montufar', 0.032), ('expressive', 0.031), ('bk', 0.031), ('construction', 0.03), ('incoming', 0.03), ('log', 0.03), ('xi', 0.03), ('representational', 0.028), ('proposition', 0.027), ('yj', 0.027), ('bj', 0.026), ('mp', 0.026), ('unnormalized', 0.026), ('symmetric', 0.026), ('hj', 0.025), ('prove', 0.025), ('exponentially', 0.025), ('computable', 0.025), ('theorem', 0.025), ('ak', 0.025), ('outputs', 0.024), ('existential', 0.024), ('group', 0.024), ('bias', 0.023), ('energy', 0.023), ('hard', 0.022), ('evaluates', 0.021), ('biases', 0.021), ('mn', 0.02), ('bound', 0.02), ('machines', 0.02), ('simulation', 0.02), ('quali', 0.019), ('polynomial', 0.019), ('hinton', 0.019), ('kinds', 0.019), ('salakhutdinov', 0.019), ('ai', 0.019), ('inner', 0.018), ('lhs', 0.018), ('distributions', 0.018), ('max', 0.018), ('deep', 0.017), ('mass', 0.017), ('sum', 0.017), ('whose', 0.017), ('input', 0.017), ('free', 0.017), ('arbitrarily', 0.017), ('neural', 0.017), ('compute', 0.017), ('hardness', 0.017), ('monotone', 0.017), ('grow', 0.016), ('et', 0.016), ('block', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
2 0.28165838 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.15007827 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
4 0.12544033 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
5 0.11381154 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.10691851 36 nips-2013-Annealing between distributions by averaging moments
7 0.10555501 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
8 0.10206158 30 nips-2013-Adaptive dropout for training deep neural networks
9 0.095454812 64 nips-2013-Compete to Compute
10 0.086438417 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
11 0.086170167 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
12 0.084389292 121 nips-2013-Firing rate predictions in optimal balanced networks
13 0.082644068 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
14 0.080969848 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
15 0.077042341 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
16 0.075995311 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
17 0.072705559 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
18 0.071386993 210 nips-2013-Noise-Enhanced Associative Memories
19 0.069800839 5 nips-2013-A Deep Architecture for Matching Short Texts
20 0.068634652 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
topicId topicWeight
[(0, 0.147), (1, 0.078), (2, -0.096), (3, -0.084), (4, -0.039), (5, -0.064), (6, -0.038), (7, -0.011), (8, 0.046), (9, -0.122), (10, 0.241), (11, 0.006), (12, -0.009), (13, 0.026), (14, 0.061), (15, 0.061), (16, -0.009), (17, -0.023), (18, -0.028), (19, 0.015), (20, 0.005), (21, 0.012), (22, 0.14), (23, 0.023), (24, 0.063), (25, -0.015), (26, -0.21), (27, 0.005), (28, -0.009), (29, 0.003), (30, -0.02), (31, -0.056), (32, -0.029), (33, -0.09), (34, 0.07), (35, -0.081), (36, 0.006), (37, 0.043), (38, -0.076), (39, 0.022), (40, -0.093), (41, 0.079), (42, 0.044), (43, 0.041), (44, -0.034), (45, -0.055), (46, -0.028), (47, 0.078), (48, 0.147), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.93342376 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
2 0.73179525 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.71356326 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Author: Yann Dauphin, Yoshua Bengio
Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1
4 0.61906368 36 nips-2013-Annealing between distributions by averaging moments
Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov
Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1
5 0.6105395 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
Author: Ian Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio
Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MPDBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1 1
6 0.54976416 64 nips-2013-Compete to Compute
7 0.54904431 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
8 0.49386337 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
9 0.46900386 75 nips-2013-Convex Two-Layer Modeling
10 0.46328396 251 nips-2013-Predicting Parameters in Deep Learning
11 0.45809314 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
12 0.45433292 5 nips-2013-A Deep Architecture for Matching Short Texts
13 0.43013716 267 nips-2013-Recurrent networks of coupled Winner-Take-All oscillators for solving constraint satisfaction problems
14 0.42952704 264 nips-2013-Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively
15 0.42198527 121 nips-2013-Firing rate predictions in optimal balanced networks
16 0.42096779 30 nips-2013-Adaptive dropout for training deep neural networks
17 0.4103424 160 nips-2013-Learning Stochastic Feedforward Neural Networks
18 0.3551172 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
19 0.34460363 210 nips-2013-Noise-Enhanced Associative Memories
20 0.34292334 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
topicId topicWeight
[(16, 0.029), (33, 0.137), (34, 0.069), (41, 0.028), (49, 0.371), (56, 0.105), (70, 0.041), (85, 0.028), (89, 0.02), (93, 0.034), (95, 0.034)]
simIndex simValue paperId paperTitle
1 0.94823557 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do
Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1
2 0.93615192 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
Author: Jasper Snoek, Richard Zemel, Ryan P. Adams
Abstract: Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data. 1
3 0.91003895 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
4 0.89153332 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani
Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1
5 0.85575545 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
Author: Suvrit Sra, Reshad Hosseini
Abstract: Hermitian positive definite (hpd) matrices recur throughout machine learning, statistics, and optimisation. This paper develops (conic) geometric optimisation on the cone of hpd matrices, which allows us to globally optimise a large class of nonconvex functions of hpd matrices. Specifically, we first use the Riemannian manifold structure of the hpd cone for studying functions that are nonconvex in the Euclidean sense but are geodesically convex (g-convex), hence globally optimisable. We then go beyond g-convexity, and exploit the conic geometry of hpd matrices to identify another class of functions that remain amenable to global optimisation without requiring g-convexity. We present key results that help recognise g-convexity and also the additional structure alluded to above. We illustrate our ideas by applying them to likelihood maximisation for a broad family of elliptically contoured distributions: for this maximisation, we derive novel, parameter free fixed-point algorithms. To our knowledge, ours are the most general results on geometric optimisation of hpd matrices known so far. Experiments show that advantages of using our fixed-point algorithms. 1
6 0.83366221 70 nips-2013-Contrastive Learning Using Spectral Methods
same-paper 7 0.82406777 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
8 0.76606417 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
9 0.75458777 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
10 0.73533767 121 nips-2013-Firing rate predictions in optimal balanced networks
11 0.6825251 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
12 0.64530224 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
13 0.63779533 64 nips-2013-Compete to Compute
14 0.63253856 246 nips-2013-Perfect Associative Learning with Spike-Timing-Dependent Plasticity
15 0.61175138 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
16 0.60800862 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
18 0.6008029 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
19 0.59978533 301 nips-2013-Sparse Additive Text Models with Low Rank Background
20 0.59897482 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization