nips nips2013 nips2013-221 nips2013-221-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Martens, Arkadev Chattopadhya, Toni Pitassi, Richard Zemel
Abstract: This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM’s unnormalized log-likelihood function as a type of neural network, and through a series of simulation results relate these networks to ones whose representational properties are better understood. We show the surprising result that RBMs can efficiently capture any distribution whose density depends on the number of 1’s in their input. We also provide the first known example of a particular type of distribution that provably cannot be efficiently represented by an RBM, assuming a realistic exponential upper bound on the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones. 1
Aaron Courville, James Bergstra, and Yoshua Bengio. Unsupervised models of images by spikeand-slab RBMs. In Proceedings of the 28th International Conference on Machine Learning, pages 952–960, 2011. Maria Anglica Cueto, Jason Morton, and Bernd Sturmfels. Geometry of the Restricted Boltzmann Machine. arxiv:0908.4425v1, 2009. J. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. J. Comput. Syst. Sci., 65(4):612–625, 2002. Yoav Freund and David Haussler. Unsupervised learning of distributions on binary vectors using two layer networks, 1994. A. Hajnal, W. Maass, P. Pudl´ k, M. Szegedy, and G. Tur´ n. Threshold circuits of bounded depth. J. a a Comput. System. Sci., 46:129–154, 1993. G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. ISSN 1095-9203. Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:2002, 2000. Geoffrey E. Hinton, Simon Osindero, Max Welling, and Yee Whye Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive Science, 30(4):725–731, 2006. Nicolas Le Roux and Yoshua Bengio. Representational power of Restricted Boltzmann Machines and deep belief networks. Neural Computation, 20(6):1631–1649, 2008. Philip Long and Rocco Servedio. Restricted Boltzmann Machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning, pages 952–960, 2010. Wolfgang Maass. Bounds for the computational power and learning complexity of analog neural nets (extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing, pages 335–344, 1992. Wolfgang Maass, Georg Schnitger, and Eduardo D. Sontag. A comparison of the computational power of sigmoid and boolean threshold circuits. In Theoretical Advances in Neural Computation and Learning, pages 127–151. Kluwer, 1994. Benjamin M. Marlin, Kevin Swersky, Bo Chen, and Nando de Freitas. Inductive principles for Restricted Boltzmann Machine learning. Journal of Machine Learning Research - Proceedings Track, 9:509–516, 2010. G. Montufar, J. Rauh, and N. Ay. Expressive power and approximation errors of Restricted Boltzmann Machines. In Advances in Neural Information Processing Systems, 2011. Guido Montufar and Nihat Ay. Refinements of universal approximation results for deep belief networks and Restricted Boltzmann Machines. Neural Comput., 23(5):1306–1319, May 2011. Saburo Muroga. Threshold logic and its applications. Wiley, 1971. Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. Journal of Machine Learning Research - Proceedings Track, 5:448–455, 2009. Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of Deep Belief Networks. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 872–879. Omnipress, 2008. Yichuan Tang and Ilya Sutskever. Data normalization in the learning of Restricted Boltzmann Machines. Technical Report UTML-TR-11-2, Department of Computer Science, University of Toronto, 2011. 9 A A.1 Appendix Free-energy derivation The following is a derivation of the well-known formula for the free-energy of an RBM. This tractable form is made possible by the bipartite interaction structure of the RBM’s units: p(x) = h 1 exp(x W h + c x + b h) Zθ 1 = exp(c x) Zθ = = = A.2 exp(x [W ]j hj + bj hj ) j hj ∈{0,1} 1 exp(c x) exp( Zθ 1 exp(c x + Zθ (log j exp(x [W ]j hj + bj hj ))) hj ∈{0,1} log[1 + exp(x [W ]j + bj )]) j 1 exp(−Fθ (x)) Zθ Proofs for Section 2.4 We begin with a useful technical result: Proposition 11. For arbitrary y ∈ R the following basic facts for the softplus function hold: y − soft(y) = − soft(−y) soft(y) ≤ exp(y) Proof. The first fact follows from: y− soft(y) = log(exp(y)) − log(1 + exp(y)) = log = log 1 exp(−y) + 1 exp(y) 1 + exp(y) = − log(1 + exp(y)) = − soft(−y) To prove the second fact, we will show that the function f (y) = exp(y) − soft(y) is positive. Note that f tends to 0 as y goes to −∞ since both exp(y) and soft(y) do. It remains to show that f is monotonically increasing, which we establish by showing that its derivative is positive: 1 f (y) = exp(y) − >0 1 + exp(−y) 1 + exp(−y) ⇔ exp(y)(1 + exp(−y)) − >0 1 + exp(−y) ⇔ exp(y) + 1 − 1 > 0 ⇔ exp(y) > 0 Proof of Lemma 2. Consider a single neuron in the RBM network and the corresponding neuron in the hardplus RBM network, whose net-input are given by y = w x + b. For each x, there are two cases for y. If y ≥ 0, we have by hypothesis that y ≥ C, and so: | hard(y) − soft(y)| = |y − soft(y)| = | − soft(−y)| = soft(−y) ≤ exp(−y) ≤ exp(−C) And if y < 0, we have by hypothesis that y ≤ −C and so: | hard(y) − soft(y)| = |0 − soft(y)| = soft(y) ≤ exp(y) ≤ exp(−C) 10 Thus, each corresponding pair of neurons computes the same function up to an error bounded by exp(−C). From this it is easy to show that the entire circuits compute the same function, up to an error bounded by m exp(−C), as required. Proof of Theorem 3. Suppose we have a softplus RBM network with a number of hidden neurons given by m. To simulate this with a hardplus RBM network we will replace each neuron with a group of hardplus neurons with weights and biases chosen so that the sum of their outputs approximates the output of the original softplus neuron, to within a maximum error of 1/p where p is some constant > 0. First we describe the construction for the simulation of a single softplus neurons by a group of hardplus neurons. Let g be a positive integer and a > 0. We will define these more precisely later, but for what follows their precise value is not important. At a high level, this construction works by approximating soft(y), where y is the input to the neuron, by a piece-wise linear function expressed as the sum of a number of hardplus functions, whose “corners” all lie inside [−a, a]. Outside this range of values, we use the fact that soft(y) converges exponentially fast (in a) to 0 on the left, and y on the right (which can both be trivially computed by hardplus functions). Formally, for i = 1, 2, ..., g, g + 1, let: qi = (i − 1) 2a −a g For i = 1, 2, ..., g, let: νi = soft(qi+1 ) − soft(qi ) qi+1 − qi and also let ν0 = 0 and νg+1 = 1. Finally, for i = 1, 2, ..., g, g + 1, let: ηi = νi − νi−1 With these definitions it is straightforward to show that 1 ≥ νi > 0, νi > νi−1 and consequently 0 < ηi < 1 for each i. It is also easy to show that qi > qi−1 , q0 = −a and qg+1 = a. For i = 1, 2, ..., g, g + 1, we will set the weight vector wi and bias bi of the i-th hardplus neuron in our group so that the neuron outputs hard(ηi (y − qi )). This is accomplished by taking wi = ηi w and bi = ηi (b − qi ), where w and b (without the subscripts), are the weight vector and bias of the original softplus neuron. Note that since |ηi | ≤ 1 we have that the weights of these hard neurons are smaller in magnitude than the weights of the original soft neuron and thus bounded by C as required. The total output (sum) for this group is: g+1 hard(ηi (y − qi )) T (y) = i=1 We will now bound the approximation error |T (y) − soft(y)| of our single neuron simulation. Note that for a given y we have that the i-th hardplus neuron in the group has a non-negative input iff y ≥ qi . Thus for y < −a all of the neurons have a negative input. And for y ≥ −a , if we take j to be the largest index i s.t. qi ≤ y, then each neuron from i = 1 to i = j will have positive input and each neuron from i = j + 1 to i = g + 1 will have negative input. Consider the case that y < −a. Since the input to each neuron is negative, they each output 0 and thus T (y) = 0. This results in an approximation error ≤ exp(−a): |T (y) − soft(y)| = |0 − soft(y)| = soft(y) < soft(−a) ≤ exp(−a) 11 Next, consider the case that y ≥ −a, and let j be as given above. In such a case we have: g+1 j hard(ηi (y − qi )) = T (y) = i=1 ηi (y − qi ) + 0 i=1 j (νi − νi−1 )(y − qi ) = i=1 j j (νi − νi−1 ) − =y i=1 (νi − νi−1 )qi i=1 j−1 = yνj − yν0 − νj qj + νi (qi+1 − qi ) + ν0 q1 i=1 j−1 = νj (y − qj ) + (soft(qi+1 ) − soft(qi )) i=1 = νj (y − qj ) + soft(qj ) − soft(q1 ) For y ≤ a we note that νj (y − qj ) + soft(qj ) is a secant approximation to soft(y) generated by the secant from qj to qj+1 and upperbounds soft(y) for y ∈ [qj , qj+1 ]. Thus a crude bound on the error is soft(qj+1 ) − soft(qj ), which only makes use of the fact that soft(y) is monotonic. Then because the slope (derivative) of soft(y) is σ(y) = 1/(1 + exp(−y)) < 1, we can further (crudely) bound this by qj+1 − qj . Thus the approximation error at such y’s may be bounded as: |T (y)− soft(y)| = |(νj (y − qj ) + soft(qj ) − soft(q1 )) − soft(y)| ≤ max{|νj (y − qj ) + soft(qj ) − soft(y)|, soft(q1 )} 2a , exp(−a) ≤ max{qj+1 − qj , exp(−a)} = max g where we have also used soft(q1 ) = soft(−a) ≤ exp(−a). For the case y > a, all qi > y and the largest index j such that qj ≤ y is j = g + 1. So νj (y − qj ) + soft(qj ) − soft(q1 ) = y − a + soft(a) − soft(−a) = y. Thus the approximation error at such y’s is: |y − soft(y)| = | − soft(−y)| = soft(−y) ≤ soft(−a) ≤ exp(−a) Having covered all cases for y we conclude that the general approximation error for a single softplus neuron satisfies the following bound: 2a , exp(−a) |y − soft(y)| ≤ max g For a softplus RBM network with m neurons, our hardplus RBM neurons constructed by replacing each neuron with a group of hardplus neurons as described above will require a total of m(g + 1) neurons, and have an approximation error bounded by the sum of the individual approximation errors, which is itself bounded by: 2a m max , exp(−a) g Taking a = log(mp), g = 2mpa . This gives: m max 2a 1 , 2mpa mp 2a 1 , 2mpa mp 1 1 1 = , p p p ≤ m max = max Thus we see that with m(g + 1) = m( 2mp log(mp) + 1) ≤ 2m2 p log(mp) + m neurons we can produce a hardplus RBM network which approximates the output of our softplus RBM network with error bounded by 1/p. 12 Remark 12. Note that the construction used in the above lemma is likely far from optimal, as the placement of the qi ’s could be done more carefully. Also, the error bound we proved is crude and does not make strong use of the properties of the softplus function. Nonetheless, it seems good enough for our purposes. A.3 Proofs for Section 2.5 Proof of Proposition 5. Suppose that there is an RBM network of size m with weights bounded in magnitude by C computes a function g which represent f with margin δ. Then taking p = 2/δ and applying Theorem 3 we have that there exists an hardplus RBM network of size 4m2 log(2m/δ)/δ + m which computes a function g s.t. |g(x) − g (x)| ≤ 1/p = δ/2 for all x. Note that f (x) = 1 ⇒ thresh(g(x)) = 1 ⇒ g(x) ≥ δ ⇒ g (x) ≥ δ − δ/2 = δ/2 and similarly, f (x) = 0 ⇒ thresh(g(x)) = 0 ⇒ g(x) ≤ −δ ⇒ g (x) ≤ −δ + δ/2 = −δ/2. Thus we conclude that g represents f with margin δ/2. A.4 Proofs for Section 2.7 Proof of Theorem 6. Let f be a Boolean function on n variables computed by a size s hardplus RBM network, with parameters (W, b, d) . We will first construct a three layer hybrid Boolean/threshold circuit/network where the output gate is a simple weighted sum, the middle layer consists of AND gates, and the bottom hidden layer consists of threshold neurons. There will be n·m AND gates, one for every i ∈ [n] and j ∈ [m]. The (i, j)th AND gate will have inputs: (1) xi and (2) (x [W ]j ≥ bj ). The weights going from the (i, j)th AND gate to the output will be given by [W ]i,j . It is not hard to see that our three layer netork computes the same Boolean function as the original hardplus RBM network. In order to obtain a single hidden layer threshold network, we replace each sub-network rooted at an AND gate of the middle layer by a single threshold neuron. Consider a general sub-network n consisting of an AND of: (1) a variable xj and (2) a threshold neuron computing ( i=1 ai xi ≥ b). Let Q be some number greater than the sum of all the ai ’s. We replace this sub-network by a single n threshold gate that computes ( i=1 ai xi + Qxj ≥ b + Q). Note that if the input x is such that i ai xi ≥ b and xj = 1, then i ai xi + Qαj will be at least b + Q, so the threshold gate will output 1. In all other cases, the threshold will output zero. (If i ai xi < b, then even if xj = 1, the sum will still be less than Q + b. Similarly, if xj = 0, then since i ai xi is never greater than i ai , the total sum will be less than Q ≤ (n + 1)C.) A.5 Proof of Theorem 7 Proof. We will first describe how to construct a hardplus RBM network which satisfies the properties required for part (i). It will be composed of n special groups of hardplus neurons (which are defined and discussed below), and one additional one we call the “zero-neuron”, which will be defined later. Definition 13 A “building block” is a group of n hardplus neurons, parameterized by the scalars γ and e, where the weight vector w ∈ Rn between the i-th neuron in the group and the input layer is given by wi = M − γ and wj = −γ for j = i and the bias will be given by b = γe − M , where M is a constant chosen so that M > γe. For a given x, the input to the i-th neuron of a particular building block is given by: n wj xj + b = wi xi + j=1 w j xj + b j=i = (M − γ)xi − γ(X − xi ) + γe − M = γ(e − X) − M (1 − xi ) 13 When xi = 0, this is γ(e − X) − M < 0, and so the neuron will output 0 (by definition of the hardplus function). On the other hand, when xi = 1, the input to the neuron will be γ(e − X) and thus the output will be max(0, γ(e − X)). In general, we have that the output will be given by: xi max(0, γ(e − X)) From this it follows that the combined output from the neurons in the building block is: n n (xi max(0, γ(e − X))) = max(0, γ(e − X)) i=1 xi i=1 = max(0, γ(e − X))X = max(0, γX(e − X)) Note that whenever X is positive, the output is a concave quadratic function in X, with zeros at X = 0 and X = e, and maximized at X = e/2, with value γe2 /4. Next we show how the parameters of the n building blocks used in our construction can be set to produce a hardplus RBM network with the desired output. First, define d to be any number greater than or equal to 2n2 j |tj |. Indexing the building blocks by j for 1 ≤ j ≤ n we define their respective parameters γj , ej as follows: γn = tn + d , n2 en = 2n, tj + d tj+1 + d − j2 (j + 1)2 2 tj + d tj+1 + d − ej = γj j j+1 γj = where we have assumed that γj = 0 (which will be established, along with some other properties of these definitions, in the next claim). Claim 1. For all j, 1 ≤ j ≤ n, (i) γj > 0 and (ii) for all j, 1 ≤ j ≤ n − 1, j ≤ ej ≤ j + 1. Proof of Claim 1. Part (i): For j = n, by definition we know that γn = tn +d . For d ≥ n2 2n2 j |tj | > |tn |, the numerator will be positive and therefore γn will be positive. For j < n, we have: γj > 0 tj + d tj+1 + d ⇔ > 2 j (j + 1)2 ⇔ (j + 1)2 (tj + d) > j 2 (tj+1 + d) ⇔ d((j + 1)2 − j 2 ) > j 2 tj+1 − (j + 1)2 tj ⇔ d> j 2 tj+1 − (j + 1)2 tj 2j + 1 (j+1)2 (|t |+|t |) j+1 j The right side of the above inequality is less than or equal to ≤ (j+1)(|tj+1 |+|tj |) 2j+1 which is strictly upper bounded by 2n2 j |tj |, and thus by d. So it follows that γj > 0 as needed. Part (ii): 14 2 tj + d tj+1 + d − γj j j+1 tj + d tj+1 + d ⇔ jγj ≤ 2 − j j+1 tj + d tj+1 + d tj + d j(tj+1 + d) ≤2 ⇔ − − j (j + 1)2 j j+1 j(tj+1 + d) tj + d tj+1 + d ⇔ − ≤ −2 (j + 1)2 j j+1 j ≤ ej = ⇔ − (tj+1 + d)j 2 ≤ (tj + d)(j + 1)2 − 2(tj+1 + d)j(j + 1) ⇔ d(j 2 − 2j(j + 1) + (j + 1)2 ) ≥ −j 2 tj+1 + 2j(j + 1)tj+1 − (j + 1)2 tj ⇔ d ≥ −j 2 tj+1 + 2j(j + 1)tj+1 − (j + 1)2 tj where we have used j 2 − 2j(j + 1) + (j + 1)2 = (j − (j + 1))2 = 12 = 1 at the last line. Thus it suffices to make d large enough to ensure that j ≤ ej . For our choice of d, this will be true. For the upper bound we have: ⇔ tj + d tj+1 + d − = ej ≤ j + 1 j j+1 tj + d tj+1 + d (j + 1)(tj + d) tj+1 + d 2 − ≤ (j + 1)γj = − j j+1 j2 j+1 tj + d tj+1 + d (j + 1)(tj + d) 2 − ≤ j j+1 j2 2(tj + d)j(j + 1) − (tj+1 + d)j 2 ≤ (j + 1)2 (tj + d) (d + tj ) (d + tj ) −(d − tj+1 ) +2 ≤ (j + 1) j+1 j j2 − j 2 (d + tj+1 ) + 2j(j + 1)(d + tj ) ≤ (j + 1)2 (d + tj ) ⇔ d(j 2 − 2j(j + 1) + (j + 1)2 ) 2 γj ⇔ ⇔ ⇔ ⇔ ≥ −j 2 tj+1 + 2j(j + 1)tj − (j + 1)2 tj ⇔ 2 d ≥ −j tj+1 + 2j(j + 1)tj − (j + 1)2 tj where we have used j 2 − 2j(j + 1) + (j + 1)2 = 1 at the last line. Again, for our choice of d the above inequality is satisfied. Finally, define M to be any number greater than max(t0 + d, maxi {γi ei }). In addition to the n building blocks, our hardplus RBM will include an addition unit that we will call the zero-neuron, which handles x = 0. The zero-neuron will have weights w defined by wi = −M for each i, and b = t0 + d. Finally, the output bias B of our hardplus RBM network will be set to −d. The total output of the network is simply the sum of the outputs of the n different building blocks, the zero neuron, and constant bias −d. To show part (i) of the theorem we want to prove that for all k, whenever X = k, our circuit outputs the value tk . We make the following definitions: n ak ≡ − n bk ≡ γj j=k γj ej j=k 15 Claim 2. ak = −(tk + d) k2 bk = 2(tk + d) k bk = −2kak This claim is self-evidently true by examining basic definitions of γj and ej and realizing that ak and bk are telescoping sums. Given these facts, we can prove the following: Claim 3. For all k, 1 ≤ k ≤ n, when X = k the sum of the outputs of all the n building blocks is given by tk + d. Proof of Claim 3. For X = n, the (γn , en )-block computes max(0, γn X(en − X)) = max(0, −γn X 2 +γn en X). By the definition of en , n ≤ en , and thus when X ≤ n, γn X(en −X) ≥ 0. For all other building blocks (γj , ej ), j < n, since ej ≤ j + 1, this block outputs zero since γj X(ej − X) is less than or equal to zero. Thus the sum of all of the building blocks when X = n is just the output of the (γn , en )-block which is γn · n(en − n) = −γn · n2 + γn en · n = −(tn + d) + 2(tn + d) = tn + d as desired. For X = k, 1 ≤ k < n the argument is similar. For all building blocks j ≥ k, by Claim 1 we know that ej ≥ j and therefore this block on X = k is nonnegative and therefore contributes to the sum. On the other hand, for all building blocks j < k, by Claim 1 we know that ej ≤ j + 1 and therefore this outputs 0 and so does not contribute to the sum. Thus the sum of all of the building blocks is equal to the sum of the non-zero regions of the building blocks j for j ≥ k. Since each of this is a quadratic function of X, it can written as a single quadratic polynomial of the form ak X 2 + bk X where ak and bk are defined as before. Plugging in the above expressions for ak and bk from Claim 2, we see that the value of this polynomial at X = k is: ak k 2 + bk k = −(tk + d) 2 2(tk + d) k + k = −(tk + d) + 2(tk + d) = tk + d k2 k Finally, it remains to ensure that our hardplus RBM network outputs t0 for X = 0. Note that the sum of the outputs of all n building blocks and the output bias is −d at X = 0. To correct this, we set the incoming weights and the bias of the zero-neuron according to wi = −M for each i, and b = t0 + d. When X = 0, this neuron will output t0 + d, making the total output of the network −d + t0 + d = t0 as needed. Furthermore, note that the addition of the zero-neuron does not affect the output of the network when X = k > 0 because the zero-neuron outputs 0 on all of these inputs as long as M ≥ t0 + d. This completes the proof of part (i) of the theorem and it remains to prove part (ii). Observe that the size of the weights grows linearly in M and d, which follows directly from their definitions. And note that the magnitude of the input to each neuron is lower bounded by a positive linear function of M and d (a non-trivial fact which we will prove below). From these two observations it follows that to achieve the condition that the magnitude of the input to each neuron is greater than C(n) for some function C of n, the weights need to grow linearly with C. Noting that error bound condition ≤ (n2 + 1) exp(−C) in Lemma 2 can be rewritten as C ≤ log((n2 + 1)) + log(1/ ), from which part (ii) of the theorem then follows. There are two cases where a hardplus neuron in building block j has a negative input. Either the input is γj (ej − X) − M , or it is γj (ej − X) for X ≥ j + 1. In the first case it is clear that as M grows the net input becomes more negative since ej doesn’t depend on M at all. 16 The second case requires more work. First note that from its defintion, ej can be rewritten as (j+1)aj+1 −jaj 2 . Then for any X ≥ j + 1 and j ≤ n − 1 we have: γj γj (ej − X) ≤ γj (ej − (j + 1)) (j + 1)aj+1 − jaj = γj 2 − (j + 1) γj = 2(j + 1)aj+1 − 2jaj − (j + 1)γj = 2(j + 1)aj+1 − 2jaj − (j + 1)(aj+1 − aj ) = (j + 1)aj+1 − 2jaj + (j + 1)aj −(d − tj+1 ) (d + tj+1 ) d + tj+1 = +2 − (j + 1) j+1 j j2 2 −j (d + tj+1 ) + 2j(j + 1)(d + tj ) − (j + 1)2 (d + tj ) = j 2 (j + 1) 2 −(j − 2j(j + 1) + (j + 1)2 )d − j 2 tj + 2j(j + 1)tj = j 2 (j + 1) 2 2 −(j − (j + 1)) d − j tj + 2j(j + 1)tj = j 2 (j + 1) 2 −d − j tj + 2j(j + 1)tj = j 2 (j + 1) −d −j 2 tj + 2j(j + 1)tj = 2 + j (j + 1) j 2 (j + 1) So we see that as d increases, this bound guarantees that γj (ej − X) becomes more negative for each X ≥ j + 1. Also note that for the special zero-neuron, for X ≥ 1 the net input will be −M X + t0 + d ≤ −M + t0 + d, which will shrink as M grows. For neurons belonging to building block j which have a positive valued input, we have that X < ej . Note that for any X ≤ j and j < n we have: γj (ej − X) ≥ γj (ej − j) = γj 2 (j + 1)aj+1 − jaj −j γj = 2(j + 1)aj+1 − 2jaj − jγj = 2(j + 1)aj+1 − 2jaj − j(aj+1 − aj ) = 2(j + 1)aj+1 − jaj − jaj+1 −(d + tj+1 ) (d + tj ) (d + tj+1 ) =2 + +j j+1 j (j + 1)2 −2j(j + 1)(d + tj+1 ) + (j + 1)2 (d + tj ) + j 2 (d + tj+1 ) = j(j + 1)2 ((j + 1)2 − 2j(j + 1) + j 2 )d + (j + 1)2 tj − 2j(j + 1)tj+1 + j 2 tj+1 = j(j + 1)2 (j + 1 − j)2 d + (j + 1)2 tj − 2j(j + 1)tj+1 + j 2 tj+1 = j(j + 1)2 2 d + (j + 1) tj − 2j(j + 1)tj+1 + j 2 tj+1 = j(j + 1)2 d (j + 1)2 tj − 2j(j + 1)tj+1 + j 2 tj+1 = + j(j + 1)2 j(j + 1)2 And for the case j = n, we have for X ≤ j that: γj (ej − X) ≥ γj (ej − j) = d + tn d tn (2n − n) = + n2 n n 17 So in all cases we see that as d increases, this bound guarantees that γj (ej − X) grows linearly. Also note that for the special zero-neuron, the net input will be t0 + d for X = 0, which will grow linearly as d increases. A.6 A.6.1 Proofs for Section 4 Proof of Theorem 8 We first state some basic facts which we need. Fact 14 (Muroga (1971)). Let f : {0, 1}n → {0, 1} be a Boolean function computed by a threshold neuron with arbitrary real incoming weights and bias. There exists a constant K and another threshold neuron computing f , all of whose incoming weights and bias are integers with magnitude at most 2Kn log n . A direct consequence of the above fact is the following fact, by now folklore, whose simple proof we present for the sake of completeness. Fact 15. Let fn be the set of all Boolean functions on {0, 1}n . For each 0 < α < 1, let fα,n be the subset of such Boolean functions that are computable by threshold networks with one hidden layer with at most s neurons. Then, there exits a constant K such that, fα,n ≤ 2K(n 2 s log n+s2 log s) . Proof. Let s be the number of hidden neurons in our threshold network. By using Fact 14 repeatedly for each of the hidden neurons, we obtain another threshold network having still s hidden units computing the same Boolean function such that the incoming weights and biases of all hidden neurons is bounded by 2Kn log n . Finally applying Fact 14 to the output neuron, we convert it to a threshold gate with parameters bounded by 2Ks log s . Henceforth, we count only the total number of Boolean functions that can be computed by such threshold networks with integer weights. We do this by establishing a simple upper bound on the total number of distinct such networks. Clearly, there are 2 at most 2Kn log n ways to choose the incoming weights of a given neuron in the hidden layer. There are s incoming weights to choose for the output threshold, each of which is an integer of magnitude 2 2 at most 2Ks log s . Combining these observations, there are at most 2Ks·n log n × 2Ks log s distinct networks. Hence, the total number of distinct Boolean functions that can be computed is at most 2 2 2K(n s log n+s log s) . With these basic facts in hand, we prove below Theorem 8 using Proposition 5 and Theorem 6. Proof of Theorem 8. Consider any thresholded RBM network with m hidden units that is computing a n-dimensional Boolean function with margin δ. Using Proposition 5, we can obtain a thresholded hardplus RBM network of size 4m2 /δ · log(2m/δ) + m that computes the same Boolean function as the thresholded original RBM network. Applying Theorem 6 and thresholding the output, we obtain a thresholded network with 1 hidden layer of thresholds which is the same size and computes the same Boolean function. This argument shows that the set of Boolean functions computed by thresholded RBM networks of m hidden units and margin δ is a subset of the Boolean functions computed by 1-hidden-layer threshold networks of size 4m2 n/δ ·log(2m/δ)+mn. Hence, invoking Fact 15 establishes our theorem. A.6.2 Proof of Theorem 9 Note that the theorems from Hajnal et al. (1993) assume integer weights, but this hypthosis can be easily removed from their Theorem 3.6. In particular, Theorem 3.6 assumes nothing about the lower weights, and as we will see, the integrality assumption on the top level weights can be easily replaced with a margin condition. First note that their Lemma 3.3 only uses the integrality of the upper weights to establish that the margin must be ≥ 1. Otherwise it is easy to see that with a margin δ, Lemma 3.3 implies that a threshold neuron in a thresholded network of size m is a 2δ -discriminator (α is the sum of the α 18 absolute values of the 2nd-level weights in their notation). Then Theorem 3.6’s proof gives m ≥ δ2(1/3− )n for sufficiently large n (instead of just m ≥ 2(1/3− )n ). A more precise bound that they n/3 implictly prove in Theorem 3.6 is m ≥ 6δ2 . C Thus we have the following fact adapted from Hajnal et al. (1993): Fact 16. For a neural network of size m with a single hidden layer of threshold neurons and weights n/3 bounded by C that computes a function that represents IP with margin δ, we have m ≥ 6δ2 . C Proof of Theorem 9. By Proposition 5 it suffices to show that no thresholded hardplus RBM network of size ≤ 4m2 log(2m/δ)/δ + m with parameters bounded by C can compute IP with margin δ/2. Well, suppose by contradiction that such a thresholded RBM network exists. Then by Theorem 6 there exists a single hidden layer threshold network of size ≤ 4m2 n log(2m/δ)/δ+mn with weights bounded in magnitude by (n + 1)C that computes the same function, i.e. one which represents IP with margin δ/2. Applying the above Fact we have 4m2 n log(2m/δ)/δ + mn ≥ 3δ2n/3 (n+1)C . It is simple to check that this bound is violated if m is bounded as in the statement of this theorem. A.6.3 Proof of Theorem 10 We prove a more general result here from which we easily derive Theorem 10 as a special case. To state this general result, we introduce some simple notions. Let h : R → R be an activation function. We say h is monotone if it satisfies the following: Either h(x) ≤ h(y) for all x < y OR h(x) ≥ h(y) for all x < y. Let : {0, 1}n → R be an inner function. An (h, ) gate/neuron Gh, is just one that is obtained by composing h and in the natural way, i.e. Gh, x = h (x) . We notate (h, ) ∞ = maxx∈{0,1}n Gh, (x) . We assume for the discussion here that the number of input variables (or observables) is even and is divided into two halves, called x and y, each being a Boolean string of n bits. In this language, the inner production Boolean function, denoted by IP (x, y), is just defined as x1 y1 +· · ·+xn yn (mod 2). We call an inner function of a neuron/gate to be (x, y)-separable if it can be expressed as g(x)+f (y). For instance, all affine inner functions are (x, y)-separable. Finally, given a set of activation functions H and a set of inner functions I, an (H, I)- network is one each of whose hidden unit is a neuron of the form Gh, for some h ∈ H and ∈ I. Let (H, I) ∞ = sup (h, ) ∞ : h ∈ H, ∈ I . Theorem 17. Let H be any set of monotone activation functions and I be a set of (x, y) separable inner functions. Then, every (H, I) network with one layer of m hidden units computing IP with a margin of δ must satisfy the following: m≥ δ 2 (H, I) 2n/4 . ∞ In order to prove Theorem 17, it would be convenient to consider the following 1/-1 valued function: (−1)IP(x,y) = (−1)x1 y1 +···+xn yn . Please note that when IP evaluates to 0, (−1)IP evaluates to 1 and when IP evaluates to 1, (−1)IP evaluates to -1. We also consider a matrix Mn with entries in {1, −1} which has 2n rows and 2n columns. Each row of Mn is indexed by a unique Boolean string in {0, 1}n . The columns of the matrix are also indexed similarly. The entry Mn [x, y] is just the 1/-1 value of (−1)IP(x,y) . We need the following fact that is a special case of the classical result of Lindsey. Lemma 18 (Chor and Goldreich,1988). The magnitude of the sum of elements in every r × s sub√ matrix of Mn is at most rs2n . We use Lemma 18 to prove the following key fact about monotone activation functions: Lemma 19. Let Gh, be any neuron with a monotone activation function h and inner function that is (x, y)-separable. Then, 19 Ex,y Gh, x, y (−1)IP x,y ≤ ||(h, )||∞ · 2−Ω(n) . (2) Proof. Let (x, y) = g(x) + f (y) and let 0 < α < 1 be some constant specified later. Define a total order g on {0, 1}n by setting x g x whenever g(x) ≤ g(x ) and x occurs before x in the lexicographic ordering. We divide {0, 1}n into t = 2(1−α)n groups of equal size as follows: the first group contains the first 2αn elements in the order specified by g , the second group has the next 2αn elements and so on. The ith such group is denoted by Xi for i ≤ 2(1−α)n . Likewise, we define the total order f and use it to define equal sized blocks Y1 , . . . , Y2(1−α)n . The way we estimate the LHS of (2) is to pair points in the block (Xi , Yj ) with (Xi+1 , Yj+1 ) in the following manner: wlog assume that the activation function h in non-decreasing. Then, Gh, (x, y) ≤ Gh, (x , y ) for each (x, y) ∈ (Xi , Yj ) and (x , y ) ∈ (Xi+1 , Yj+1 ). Further, applying Lemma 18, we will argue that the total number of points in (Xi , Yj ) at which the product in the LHS evaluates negative (positive) is very close to the number of points in (Xi+1 , Yj+1 ) at which the product evaluates to positive (negative). Moreover, by assumption, the composed function (h, ) does not take very large values in our domain by assumption. These observations will be used to show that the points in blocks that are diagonally across each other will almost cancel each other’s contribution to the LHS. There are too few uncancelled blocks and hence the sum in the LHS will be small. Forthwith the details. + − Let Pi,j = {(x, y) ∈ (Xi , Yj ) | IP(x, y) = 1} and Pi,j = {(x, y) ∈ (Xi , Yi ) | IP(x, y) = −1}. (1−α)n Let t = 2 . Let hi,j be the max value that the gate takes on points in (Xi , Yj ). Note that the non-decreasing assumption on h implies that hi,j ≤ hi+1,j+1 . Using this observation, we get the following: Ex,y Gh, x, y (−1)IP x,y ≤ 1 4n + − Pi,j − Pi+1,j+1 hi,j + (i,j)