nips nips2011 nips2011-92 knowledge-graph by maker-knowledge-mining

92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Source: pdf

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufﬁciently rich model containing different classes of distributions and respecting a given error tolerance. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. [sent-4, score-0.505]

2 We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). [sent-5, score-0.648]

3 In this way we can specify the number of hidden units that guarantees a sufﬁciently rich model containing different classes of distributions and respecting a given error tolerance. [sent-6, score-0.526]

4 1 Introduction A Restricted Boltzmann Machine (RBM) [24, 10] is a learning system consisting of two layers of binary stochastic units, a hidden layer and a visible layer, with a complete bipartite interaction graph. [sent-7, score-0.316]

5 RBMs are used as generative models to simulate input distributions of binary data. [sent-8, score-0.18]

6 In the sequel we denote by RBMn,m the set of all probability distributions on {0, 1}n which can be approximated arbitrarily well by a visible distribution generated by the RBM with m hidden and n visible units for an appropriate choice of the parameter values. [sent-13, score-0.865]

7 On the other hand, if RBMn,m equals the set P of all probability distributions on {0, 1}n , then it must have at least dim(P) = 2n − 1 parameters, and thus at least ⌈2n /(n + 1)⌉ − 1 hidden units [21]. [sent-15, score-0.523]

8 However, the geometry of RBMn,m is intricate, and even an RBM of dimension 2n − 1 is not guaranteed to contain all visible distributions, see [20] for counterexamples. [sent-17, score-0.268]

9 In summary, an RBM that can approximate any distribution arbitrarily well must have a very large number of parameters and hidden units. [sent-18, score-0.17]

10 However, there are at least two reasons why in many cases this is not necessary: 1 • An appropriate approximation of distributions is sufﬁcient for most purposes. [sent-20, score-0.202]

11 • The interesting distributions the system shall simulate belong to a small class of distributions. [sent-21, score-0.154]

12 On the other hand, usually it is very hard to mathematically describe a set containing the optimal solutions to general problems, or a set of interesting probability distributions (for example the class of distributions generating natural images). [sent-24, score-0.34]

13 Due to these difﬁculties the number of hidden units m is often chosen on the basis of experience [12], or m is considered as a hyperparameter which is optimized by extensive search, depending on the distributions to be simulated by the RBM. [sent-26, score-0.457]

14 In this paper we give an explicit description of classes of distributions that are contained in RBMn,m , and which are representative for the expressive power of this model. [sent-27, score-0.34]

15 Using this description, we estimate the maximal Kullback-Leibler divergence between an arbitrary probability distribution and the best approximation within RBMn,m . [sent-28, score-0.348]

16 The set RBMn,m does not contain every probability distribution, unless the number of hidden units is very large, as we outlined in the introduction. [sent-37, score-0.362]

17 Therefore, we have an approximation error given by the distance of pData to the best approximation pData contained RBM in the RBM model. [sent-38, score-0.176]

18 In this paper we study the expressive power of the RBM model and the Kullback-Leibler divergence from an arbitrary distribution to its best representation within the RBM model. [sent-44, score-0.288]

19 Then the maximal error when approximating probability distributions with an RBM is upper bounded by the maximal error when approximating with M. [sent-47, score-0.424]

20 The set P = P(X ) of all probability distributions on X is a (|X | − 1)-dimensional simplex in R|X | . [sent-50, score-0.186]

21 From the point of view of information theory, a more meaningful distance notion for probability distributions is the Kullback-Leibler divergence: D(p q) := p(x) log x p(x) . [sent-53, score-0.228]

22 If the support of q does not contain the support of p it is deﬁned 2 q=p q= 128 255 relative error 0 1 |X | 1 Figure 1: This ﬁgure gives an intuition on what the size of an error means for probability distributions on images with 16 × 16 pixels. [sent-56, score-0.355]

23 0) ) within a partition model with 2 randomly chosen cubical blocks, containing (0 . [sent-63, score-0.507]

24 Note that an RBM with 1 hidden unit can approximate p with arbitrary accuracy, see Theorem 4. [sent-72, score-0.166]

25 In order to assess an RBM or some other model M we use the maximal approximation error with respect to the KL-divergence when approximating arbitrary probability distributions using M: DM := max {D(p M) : p ∈ P} . [sent-80, score-0.395]

26 For example, the maximal KL-divergence to the uniform distribution delta distributions δx , x ∈ X , and amounts to: D{ 1 } = D(δx |X | 3 3. [sent-81, score-0.316]

27 is attained by any Dirac (1) Model Classes Exponential families and product measures In this work we only need a restricted class of exponential families, namely exponential families on a ﬁnite set with uniform reference measure. [sent-83, score-0.509]

28 The boundary of discrete exponential families is discussed in [23], which uses a similar notation. [sent-85, score-0.173]

29 The exponential family EA with sufﬁcient statistics A consists of all probability distributions of the form pλ , λ ∈ Rd , where pλ (x) = exp(λ⊤ Ax ) , ⊤ x exp(λ Ax ) for all x ∈ X . [sent-89, score-0.3]

30 3 The most important exponential families in this work are the independence models. [sent-94, score-0.268]

31 The independence model of n binary random variables consists of all probability distributions on {0, 1}n that factorize: n pi (xi ) for some pi ∈ P({0, 1}) . [sent-95, score-0.407]

32 , xn ) = i=1 It is the closure of an n-dimensional exponential family En . [sent-99, score-0.176]

33 An element of the independence model is called a product distribution. [sent-101, score-0.172]

34 The global maximizers are the distributions of the form 2 (δx + δy ), where x, y ∈ n {0, 1} satisfy xi + yi = 1 for all i. [sent-106, score-0.283]

35 Although the independence model is much larger than the 1 set { |X | }, the maximal divergence decreases only by 1. [sent-108, score-0.275]

36 As shown in [22], if E is any exponential family of dimension k, then DE ≥ log(|X |/(k + 1)). [sent-109, score-0.145]

37 The exponential families satisfying DE = log(|X |/(k+1)) are partition models; they will be deﬁned in the following section. [sent-111, score-0.328]

38 2 Partition models and mixtures of products with disjoint supports The mixture of m models M1 , . [sent-113, score-0.382]

39 If each Mi is an exponential family, then the mixture is also an exponential family (this is not true if the supports of the models Mi are not disjoint). [sent-148, score-0.356]

40 If each Mi equals the set containing just the uniform distribution on Xi , then M is called the partition model of ξ, denoted with Pξ . [sent-150, score-0.266]

41 The partition model Pξ is given by all distributions with constant value on each block Xi , i. [sent-151, score-0.309]

42 This is the closure of the exponential family with sufﬁcient statistics Ax = (χ1 (x), χ2 (x), . [sent-154, score-0.176]

43 The partition models include the set of ﬁnite exchangeable distributions (see e. [sent-159, score-0.364]

44 [9]), where the blocks of the partition are the sets of binary vectors which have the same number of entries equal to one. [sent-161, score-0.235]

45 Left: The blue line represents the partition model Pξ with partition ξ = {(11), (01)}∪{(00), (10)}. [sent-173, score-0.31]

46 Right: The mixture of the product distributions E1 and E2 with disjoint supports on {(11), (01)} and {(00), (10)} corresponding to the same partition ξ equals the whole simplex P. [sent-175, score-0.658]

47 The vertices of a k-dimensional face of the n-cube are given by ﬁxing the values of x in n − k positions: {x ∈ {0, 1}n : xi = xi , ∀i ∈ I, for some I ⊆ {1, . [sent-178, score-0.186]

48 , n}, |I| = n − k} ˜ We call such a subset Y ⊆ X cubical or a face of the n-cube. [sent-181, score-0.388]

49 A cubical subset of cardinality 2k can be naturally identiﬁed with {0, 1}k . [sent-182, score-0.402]

50 This identiﬁcation allows to deﬁne independence models and product measures on P(Y) ⊆ P(X ). [sent-183, score-0.198]

51 Note that product measures on Y are also product measures on X , and the independence model on Y is a subset of the independence model on X . [sent-184, score-0.344]

52 , Xm } be a partition of X = {0, 1}n into cubical sets. [sent-189, score-0.507]

53 For any i let Ei be the independence model on Xi , and let M be the mixture of E1 , . [sent-190, score-0.197]

54 ,m See Figure 1 for an intuition on the approximation error of partition models, and see Figure 2 for small examples of a partition model and of a mixture of products with disjoint support. [sent-198, score-0.625]

55 4 Classes of distributions that RBMs can learn Consider a set ξ = {Xi }m of m disjoint cubical sets Xi in X . [sent-199, score-0.609]

56 Such a ξ is a partition of some subset i=1 ∪ξ = ∪i Xi of X into m disjoint cubical sets. [sent-200, score-0.61]

57 1 RBMn,m contains the following distributions: • Any mixture of one arbitrary product distribution, m − k product distributions with support on arbitrary but disjoint faces of the n-cube, and k arbitrary distributions with support on any edges of the n-cube, for any 0 ≤ k ≤ m. [sent-203, score-0.867]

58 In particular: • Any mixture of m + 1 product distributions with disjoint cubical supports. [sent-204, score-0.788]

59 In consequence, RBMn,m contains the partition model of any partition in Gm+1 . [sent-205, score-0.31]

60 Restricting the cubical sets of the second item to edges, i. [sent-206, score-0.352]

61 2 RBMn,m contains the following distributions: • Any distribution with a support set that can be covered by m + 1 pairs of vectors differing in one entry. [sent-210, score-0.139]

62 2 implies that an RBM with m ≥ 2n−1 − 1 hidden units is a universal approximator of distributions on {0, 1}n , i. [sent-213, score-0.499]

63 Assume m + 1 = 2k and let ξ be a partition of X into m + 1 disjoint cubical sets of equal size. [sent-216, score-0.61]

64 Let us denote by Pξ,1 the set of all distributions which can be written as a mixture of m + 1 product distributions with support on the elements of ξ. [sent-217, score-0.524]

65 m+1 The dimension of the set of visible distribution represented by an RBM is at most equal to the number of paramters, see [21], this is m · n + m + n. [sent-219, score-0.242]

66 This means that the class given above has roughly the same dimension of the set of distributions that can be represented. [sent-220, score-0.185]

67 This means that the class of distributions Pξ,1 which by Theorem 4. [sent-222, score-0.154]

68 An RBM with no hidden units can represent precisely the independence model, i. [sent-226, score-0.398]

69 all product distributions, and in particular any uniform distribution on a face of the n-cube. [sent-228, score-0.19]

70 For any choice of the parameters W ∈ Rm−1×n , B ∈ Rn , C ∈ Rm−1 we can write the resulting distribution on the visible units as: p(v) = z(v, h) , ′ ′ v ′ ,h′ z(v , h ) h (3) where z(v, h) = exp(hW v + Bv + Ch). [sent-230, score-0.39]

71 Appending one additional hidden unit, with connection weights w to the visible units and bias c, produces a new distribution which can be written as follows: pw,c (v) = (1 + exp(wv + c)) h z(v, h) . [sent-231, score-0.514]

72 For this choice, equation (4) yields: can choose λ = β − η , and exp(λc ) = α K P exp(β I ·v) v∈F pw,c = (α − 1)p + αˆ , p where p is a product distribution with support in F and arbitrary natural parameters β I , and α is ˆ an arbitrary mixture weight in [0, 1]. [sent-246, score-0.346]

73 Finally, the product distributions on edges of the cube are arbitrary, see [19] or [21] for details, and hence the restriction of any p to any edge is a product distribution. [sent-247, score-0.308]

74 6 RBMs with 4 visible units D(pparity pRBM ) 50 2 D D(pparity pRBM ) RBMs with 3 visible units 2. [sent-248, score-0.688]

75 5 0 0 4 Number of hidden units m 1 2 3 4 5 6 7 8 Number of hidden units m Figure 3: This ﬁgure demonstrates our results for n = 3 and n = 4 visible units. [sent-254, score-0.771]

76 We ﬁxed pparity as target distribution, the uniform distribution on binary length n vectors with an even number of ones. [sent-257, score-0.304]

77 The distribution pparity is not the KL-maximizer from RBMn,m , but it is in general difﬁcult to represent. [sent-258, score-0.247]

78 Qualitatively, samples from pparity look like uniformly distributed, and representing pparity requires the maximal number of product mixture components [20, 19]. [sent-259, score-0.666]

79 Randomly chosen distributions in RBMn,m are likely to be very far from the target distribution. [sent-266, score-0.154]

80 1 all partition models for partitions of {0, 1}n into m + 1 cubical sets are contained in RBMn,m . [sent-273, score-0.62]

81 3 to such a partition where the cardinality of all blocks is at most 2n−⌊log(m+1)⌋ yields the bound DRBMn,m ≤ n − ⌊log(m + 1)⌋. [sent-275, score-0.259]

82 Then the maximal Kullback-Leibler divergence from any distribution on {0, 1}n to RBMn,m is upper bounded by max D(p RBMn,m ) ≤ (n − 1) − log(m + 1) . [sent-282, score-0.226]

83 For m = 2n−1 − 1 the error vanishes, corresponding to the fact that an RBM with that many hidden units is a universal approximator. [sent-284, score-0.379]

84 Let M be the union of all mixtures of independent models corresponding to all cubical partitions of X into blocks of cardinalities ni −1 2n1 , . [sent-292, score-0.765]

85 If n > 1, then order the ni such that n1 ≥ n2 ≥ · · · ≥ nm ≥ 0. [sent-300, score-0.223]

86 Let p ∈ P(X ), and let Y be a cubical subset of X of cardinality 2n−1 such that p(Y) ≤ 1 . [sent-302, score-0.402]

87 7 Let M′ be the union of all mixtures of independence models corresponding to all cubical partitions ξ = {X1 , . [sent-307, score-0.605]

88 ′ In the following, the symbol i shall denote summation over all indices i such that ni > 1. [sent-314, score-0.165]

89 By induction k ′ D(p M) ≤ D(p M ) ≤ p(Y) i=1 ′ ni − 1 + p(X \ Y) 2n−1−ni m ′ j=k+1 nj − 1 . [sent-315, score-0.285]

90 Note that ji+1 ′ j=ji ni − 1 ni − 1 nj − 1 ≤ n−1 (2nji + · · · + 2nji+1 −1 ) = n−1−ni , 2n−1−nj 2 2 and therefore 1 (2 ji+1 −1 ′ ni − 1 1 − p(Y)) n−1−ni + ( 2 − p(X \ Y)) 2 j=ji nj − 1 ≥0. [sent-317, score-0.683]

91 , k to the right hand side of equation (5) yields D(p M) ≤ 1 2 k i=1 ′ 1 ni − 1 + 2n−1−ni 2 m ′ j=k+1 nj − 1 , 2n−1−nj from which the assertions follow. [sent-321, score-0.259]

92 1 we know that RBMn,m contains the union M of all mixtures of independent models corresponding to all partitions with up to m + 1 cubical blocks. [sent-324, score-0.51]

93 6 Conclusion We studied the expressive power of the Restricted Boltzmann Machine model with n visible and m hidden units. [sent-330, score-0.394]

94 We presented a hierarchy of explicit classes of probability distributions that an RBM can represent. [sent-331, score-0.221]

95 These classes include large collections of mixtures of m + 1 product distributions. [sent-332, score-0.176]

96 In particular any mixture of an arbitrary product distribution and m further product distributions with disjoint supports. [sent-333, score-0.601]

97 The geometry of these submodels is easier to study than that of the RBM models, while these subsets still capture many of the distributions contained in the RBM models. [sent-334, score-0.289]

98 That is, given any target distribution, there is a distribution within the RBM model for which the Kullback-Leibler divergence between both is not larger than that number. [sent-337, score-0.141]

99 Unsupervised learning of distributions on binary vectors using 2-layer networks. [sent-409, score-0.18]

100 Finding the maximizers of the information divergence from an exponential family. [sent-481, score-0.23]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rbm', 0.528), ('cubical', 0.352), ('rbms', 0.216), ('pparity', 0.201), ('units', 0.179), ('visible', 0.165), ('ni', 0.165), ('partition', 0.155), ('distributions', 0.154), ('pdata', 0.126), ('pxi', 0.126), ('hidden', 0.124), ('boltzmann', 0.112), ('disjoint', 0.103), ('mixture', 0.102), ('mont', 0.101), ('independence', 0.095), ('divergence', 0.095), ('nj', 0.094), ('families', 0.092), ('ay', 0.09), ('ea', 0.09), ('maximal', 0.085), ('exponential', 0.081), ('dm', 0.078), ('product', 0.077), ('xi', 0.075), ('deep', 0.07), ('mi', 0.068), ('expressive', 0.065), ('mm', 0.064), ('dim', 0.064), ('mixtures', 0.064), ('pi', 0.063), ('closure', 0.062), ('corollary', 0.06), ('nm', 0.058), ('xm', 0.058), ('fe', 0.057), ('restricted', 0.055), ('blocks', 0.054), ('maximizers', 0.054), ('ax', 0.051), ('leipzig', 0.05), ('nihat', 0.05), ('prbm', 0.05), ('cardinality', 0.05), ('approximation', 0.048), ('santa', 0.048), ('ji', 0.047), ('distribution', 0.046), ('contained', 0.046), ('geometry', 0.045), ('contrastive', 0.045), ('machines', 0.045), ('submodels', 0.044), ('exp', 0.043), ('pe', 0.043), ('cd', 0.043), ('universal', 0.042), ('log', 0.042), ('arbitrary', 0.042), ('en', 0.042), ('partitions', 0.041), ('power', 0.04), ('support', 0.037), ('face', 0.036), ('wv', 0.036), ('roux', 0.036), ('cardinalities', 0.036), ('classes', 0.035), ('nl', 0.035), ('equals', 0.034), ('theorem', 0.034), ('error', 0.034), ('family', 0.033), ('ams', 0.033), ('representational', 0.033), ('supports', 0.033), ('training', 0.033), ('probability', 0.032), ('dimension', 0.031), ('lemma', 0.031), ('gm', 0.031), ('uniform', 0.031), ('rm', 0.03), ('differing', 0.03), ('vanishes', 0.03), ('belief', 0.03), ('exchangeable', 0.029), ('rn', 0.028), ('products', 0.028), ('pm', 0.028), ('union', 0.027), ('contain', 0.027), ('bipartite', 0.027), ('induction', 0.026), ('vectors', 0.026), ('models', 0.026), ('proof', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

2 0.18312788 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus

Abstract: We present a type of Temporal Restricted Boltzmann Machine that deﬁnes a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efﬁcient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1

3 0.17883229 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efﬁciently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artiﬁcial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A ﬁxedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efﬁcient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efﬁciently represent highly-varying functions [4, 3]. In this paper, we measure “efﬁciency” in terms of the number of computational units in the network. An efﬁcient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a ﬁxed amount of training samples and computational power, better generalization is expected. The ﬁrst successful algorithms for training deep architectures appeared in 2006, with efﬁcient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classiﬁcation [19, 10], natural language processing [9, 36, 21, 32], collaborative ﬁltering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artiﬁcial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we deﬁne “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we deﬁne the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to ﬁve. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efﬁciently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efﬁciently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difﬁculty efﬁciently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efﬁciently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efﬁcient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efﬁcient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We ﬁrst formally deﬁne sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efﬁcient than the deep representation. 2 Sum-product networks Deﬁnition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our deﬁnition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Deﬁnition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally deﬁne what we mean by deep vs. shallow networks in the rest of the paper. Deﬁnition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Deﬁnition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Deﬁnition The ﬁrst family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now deﬁne f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We ﬁrst prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which speciﬁc factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the ﬁnal output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the ﬁrst factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the ﬁrst factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Deﬁnition 1), then the ﬁnal output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Deﬁnition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this speciﬁc structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a speciﬁc architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to deﬁne functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efﬁciency of deep vs. shallow sum-product networks in the case of a ﬁxed dataset. 4.1 Deﬁnition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, deﬁne the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the deﬁnition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus ﬁnd a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a speciﬁc combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two speciﬁc families of functions: ﬁnding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions deﬁned only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefﬁcients (since the number of input combinations becomes ﬁnite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on ﬁnite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random ﬁeld. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efﬁcient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative ﬁltering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9

4 0.16859666 197 nips-2011-On Tracking The Partition Function

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classiﬁcation. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman ﬁltering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1

5 0.14478052 249 nips-2011-Sequence learning with hidden units in spiking neural networks

Author: Johanni Brea, Walter Senn, Jean-pascal Pfister

Abstract: We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons signiﬁcantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.

6 0.094585799 156 nips-2011-Learning to Learn with Compound HD Models

7 0.081574187 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

8 0.079116389 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

9 0.066771708 261 nips-2011-Sparse Filtering

10 0.066479094 276 nips-2011-Structured sparse coding via lateral inhibition

11 0.058954015 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

12 0.05762003 306 nips-2011-t-divergence Based Approximate Inference

13 0.05634122 198 nips-2011-On U-processes and clustering performance

14 0.053410865 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

15 0.05212225 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

16 0.051815402 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

17 0.051812217 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

18 0.051071875 217 nips-2011-Practical Variational Inference for Neural Networks

19 0.050530419 244 nips-2011-Selecting Receptive Fields in Deep Networks

20 0.048630659 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, 0.023), (2, 0.013), (3, -0.026), (4, -0.02), (5, -0.047), (6, 0.029), (7, -0.035), (8, -0.015), (9, -0.168), (10, -0.052), (11, -0.115), (12, 0.046), (13, -0.05), (14, -0.01), (15, -0.084), (16, -0.02), (17, -0.001), (18, -0.078), (19, 0.147), (20, 0.067), (21, 0.14), (22, -0.118), (23, -0.044), (24, 0.012), (25, -0.051), (26, 0.063), (27, -0.115), (28, 0.059), (29, 0.067), (30, 0.088), (31, 0.052), (32, 0.023), (33, 0.089), (34, 0.118), (35, 0.079), (36, 0.104), (37, -0.001), (38, -0.028), (39, 0.022), (40, -0.075), (41, -0.111), (42, 0.106), (43, -0.016), (44, 0.082), (45, 0.005), (46, 0.037), (47, -0.157), (48, 0.041), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94151473 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

2 0.75610298 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

3 0.69834149 197 nips-2011-On Tracking The Partition Function

Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville

4 0.68194115 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus

5 0.58857 156 nips-2011-Learning to Learn with Compound HD Models

Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Speciﬁcally we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efﬁcient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1

6 0.54574394 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

7 0.53068036 249 nips-2011-Sequence learning with hidden units in spiking neural networks

8 0.48182651 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

9 0.45860058 306 nips-2011-t-divergence Based Approximate Inference

10 0.45305204 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

11 0.44235176 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

12 0.42984039 69 nips-2011-Differentially Private M-Estimators

13 0.41823462 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

14 0.41099888 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

15 0.38237125 287 nips-2011-The Manifold Tangent Classifier

16 0.37892106 60 nips-2011-Confidence Sets for Network Structure

17 0.37166202 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

18 0.37090573 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

19 0.35716647 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

20 0.35463658 276 nips-2011-Structured sparse coding via lateral inhibition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.074), (4, 0.042), (20, 0.034), (26, 0.027), (31, 0.149), (33, 0.031), (38, 0.186), (39, 0.011), (43, 0.089), (45, 0.106), (57, 0.031), (74, 0.051), (83, 0.027), (99, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90186483 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

Abstract: We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. We prove that the causal graph of such models is generically identiﬁable in the bivariate, Gaussian-noise case. We also propose a method to learn such models from observational data. In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. We illustrate the proposed method on synthetic data. 1

same-paper 2 0.85888648 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

3 0.84265679 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classiﬁers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-speciﬁc word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difﬁcult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a ﬁnancial institution” or “a river bank”. In this paper, we propose a novel context-speciﬁc word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-speciﬁc, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-speciﬁc word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-speciﬁc word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-speciﬁc embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More speciﬁcally, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously ﬁnd the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to ﬁnd the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classiﬁer (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to ﬁnd a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context speciﬁc embeddings for the tokens (Algorithm 2). For supervised learning, these context speciﬁc embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as preﬁxes and sufﬁxes or membership in domain-speciﬁc lexicons, and used as features in a classiﬁer. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a signiﬁcant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to deﬁne the CCA function that computes the left and right singular vectors for a pair of matrices: Deﬁnition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Deﬁne the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Deﬁnition 2. Deﬁne the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Deﬁne A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of ﬁnding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Speciﬁcally, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufﬁcient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efﬁcient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of inﬁnite data) no beneﬁt to ﬁrst estimating state by ﬁnding the CCA between the left and right contexts and then ﬁnding the CCA between the estimated state and the words. One could instead just directly ﬁnd the CCA between the combined left and rights contexts and the words. However, because of the Zipﬁan distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas ﬁnding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully ﬁnd the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufﬁcient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisﬁed, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 ﬁnds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context speciﬁc embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Speciﬁc Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the ﬁnal embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-speciﬁc representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the ﬁnal classiﬁer. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Preﬁxes and sufﬁxes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classiﬁer, they do not exert extra inﬂuence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classiﬁer. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs signiﬁcantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the ﬁnal accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped signiﬁcantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efﬁcient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

4 0.75551546 301 nips-2011-Variational Gaussian Process Dynamical Systems

Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou

Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1

5 0.75453925 241 nips-2011-Scalable Training of Mixture Models via Coresets

Author: Dan Feldman, Matthew Faulkner, Andreas Krause

Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models ﬁtting the coreset will also provide a good ﬁt for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points sufﬁces for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efﬁciently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1

6 0.7506752 229 nips-2011-Query-Aware MCMC

7 0.74991906 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

8 0.74673748 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

9 0.74653566 221 nips-2011-Priors over Recurrent Continuous Time Processes

10 0.74591267 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

11 0.74457836 66 nips-2011-Crowdclustering

12 0.74396908 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

13 0.74392205 180 nips-2011-Multiple Instance Filtering

14 0.74311477 197 nips-2011-On Tracking The Partition Function

15 0.74257797 75 nips-2011-Dynamical segmentation of single trials from population neural data

16 0.7415815 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.74094427 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

18 0.74070269 258 nips-2011-Sparse Bayesian Multi-Task Learning

19 0.74066675 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

20 0.73954904 156 nips-2011-Learning to Learn with Compound HD Models