nips nips2011 nips2011-94 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. [sent-4, score-0.145]
2 We apply our model to a challenging real-world graphics problem: facial expression transfer. [sent-5, score-0.607]
3 Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. [sent-6, score-0.039]
4 1 Introduction Modeling temporal dependence is an important consideration in many learning problems. [sent-7, score-0.185]
5 One can capture temporal structure either explicitly in the model architecture, or implicitly through latent variables which can act as a “memory”. [sent-8, score-0.436]
6 Feedforward neural networks which incorporate fixed delays into their architecture are an example of the former. [sent-9, score-0.164]
7 A limitation of these models is that temporal context is fixed by the architecture instead of inferred from the data. [sent-10, score-0.348]
8 To address this shortcoming, recurrent neural networks incorporate connections between the latent variables at different time steps. [sent-11, score-0.329]
9 This enables them to capture arbitrary dynamics, yet they are more difficult to train [2]. [sent-12, score-0.046]
10 Another family of dynamical models that has received much attention are probabilistic models such as Hidden Markov Models and more general Dynamic Bayes nets. [sent-13, score-0.141]
11 Such models can be separated into two classes [19]: tractable models, which permit an exact and efficient procedure for inferring the posterior distribution over latent variables, and intractable models which require approximate inference. [sent-15, score-0.444]
12 Tractable models such as Linear Dynamical Systems and HMMs are widely applied and well understood. [sent-16, score-0.044]
13 These limitations are exactly what permit simple exact inference. [sent-18, score-0.105]
14 Intractable models, such as Switching LDS, Factorial HMMs, and other more complex variants of DBNs permit more complex regularities to be learned from data. [sent-19, score-0.105]
15 This comes at the cost of using approximate inference schemes, for example, Gibbs sampling or variational inference, which introduce either a computational burden or poorly approximate the true posterior. [sent-20, score-0.105]
16 In this paper we focus on Temporal Restricted Boltzmann Machines [19,20], a family of models that permits tractable inference but allows much more complicated structure to be extracted from time series data. [sent-21, score-0.118]
17 We concentrate on modeling the distribution of an output sequence conditional on an input sequence. [sent-23, score-0.254]
18 Recurrent neural networks address this problem, though in a non-probabilistic sense. [sent-24, score-0.032]
19 The InputOutput HMM [3] extends HMMs by conditioning both their dynamics and emission model on an input sequence. [sent-25, score-0.316]
20 Therefore we extend TRBMs to cope with input-output sequence pairs. [sent-27, score-0.037]
21 Given the conditional nature of a TRBM (its hidden states and observations are conditioned on short histories of these variables), conditioning on an external input is a natural extension to this model. [sent-28, score-0.413]
22 This includes motion-style transfer [9], economic forecasting with external indicators [13], and various tasks in natural language processing [6]. [sent-30, score-0.124]
23 Sequence classification is a special case of this setting, where a scalar target is conditioned on an input sequence. [sent-31, score-0.169]
24 In this paper, we consider facial expression transfer, a well-known problem in computer graphics. [sent-32, score-0.52]
25 Current methods considered by the graphics community are typically linear (e. [sent-33, score-0.087]
26 , methods based on blendshape mapping) and they do not take into account dynamical aspects of the facial motion itself. [sent-35, score-0.606]
27 This makes it difficult to retarget the facial articulations involved in speech. [sent-36, score-0.444]
28 We propose a model that can encode a complex nonlinear mapping from the motion of one individual to another which captures facial geometry and dynamics of both source and target. [sent-37, score-0.779]
29 2 Related work In this section we discuss several latent variable models which can map an input sequence to an output sequence. [sent-38, score-0.353]
30 We also briefly review our application field: facial expression transfer. [sent-39, score-0.52]
31 1 Temporal models Among probabilistic models, the Input-Output HMM [3] is most similar to the architecture we propose. [sent-41, score-0.131]
32 Like the HMM, the IOHMM is a generative model of sequences but it models the distribution of an output sequence conditional on an input, while the HMM simply models the distribution of an output sequence. [sent-42, score-0.362]
33 A similarity between IOHMMs and TRBMs is that in both models, the dynamics and emission distributions are formulated as neural networks. [sent-44, score-0.15]
34 However, the IOHMM state space is a multinomial while TRBMs have binary latent states. [sent-45, score-0.108]
35 A K-state TRBM can thus represent the history of a time series using 2K state configurations while IOHMMs are restricted to K settings. [sent-46, score-0.184]
36 The CPM has a discrete state-space and requires an input sequence. [sent-48, score-0.072]
37 However, unlike the IOHMM and our proposed model, the input is unobserved, making learning completely unsupervised. [sent-50, score-0.072]
38 The problem of simultaneously predicting multiple, correlated variables has received a great deal of recent attention [1]. [sent-52, score-0.059]
39 In Graph Transformer Networks [11] the dependency structure on the outputs is chosen to be sequential, which decouples the graph into pairwise potentials. [sent-54, score-0.033]
40 These models are trained discriminatively, typically with gradient descent, where our model is trained generatively using an approximate algorithm. [sent-56, score-0.111]
41 2 Facial expression transfer Facial expression transfer, also called motion retargeting or cross-mapping, is the act of adapting the motion of an actor to a target character. [sent-58, score-0.674]
42 It, as well as the related fields of facial performance capture and performance-driven animation, have been very active research areas over the last several years. [sent-59, score-0.49]
43 According to a review by Pighin [15], the two most important considerations for this task are facial model parameterization (called “the rig” in the graphics industry) and the nature of the chosen crossmapping. [sent-60, score-0.626]
44 A popular parameterization is “blendshapes” where a rig is a set of linearly combined facial expressions each controlled by a scalar weight. [sent-61, score-0.646]
45 Retargeting amounts to estimating a set of blending weights at each frame of the source data that accurately reconstructs the target frame. [sent-62, score-0.227]
46 There are many different ways of selecting blendshapes, from simply selecting a set of sufficient frames from the data, to creating models based on principal components analysis. [sent-63, score-0.159]
47 Another common parameterization is to simply represent the face by its vertex, polygon or spline geometry. [sent-64, score-0.095]
48 The downside of this approach is that this representation has many more degrees of freedom than are present in an actual facial expression. [sent-65, score-0.444]
49 While it is simple to estimate from data, it cannot produce subtle nonlinear motion required for realistic graphics applications. [sent-67, score-0.279]
50 An 2 example of this approach is [5] which uses a parametric model based on eigen-points to reliably synthesize simple facial expressions but ultimately fails to capture more subtle details. [sent-68, score-0.602]
51 [23] have proposed a multilinear mapping where variation in appearance across the source and target is explicitly separated from the variation in facial expression. [sent-70, score-0.63]
52 None of these models explicitly incorporate dynamics into the mapping, which is a limitation addressed by our approach. [sent-71, score-0.215]
53 [18] have used RBMs for facial expression generation, but not retargeting. [sent-73, score-0.52]
54 Their work is focused on static rather than temporal data. [sent-74, score-0.185]
55 3 Modeling dynamics with Temporal Restricted Boltzmann Machines In this section we review the Temporal Restricted Boltzmann Machine. [sent-75, score-0.094]
56 We then introduce the InputOutput Temporal Restricted Boltzmann Machine which extends the architecture to model an output sequence conditional on an input sequence. [sent-76, score-0.341]
57 1 Temporal Restricted Boltzmann Machines A Restricted Boltzmann Machine [17] is a bipartite Markov Random Field consisting of a layer of stochastic observed variables (“visible units”) connected to a layer of stochastic latent variables (“hidden units”). [sent-78, score-0.226]
58 The absence of connections between hidden units ensures they are conditionally independent given a setting of the visible units, and vice-versa. [sent-79, score-0.505]
59 The RBM can be extended to model temporal data by conditioning its visible units and/or hidden units on a short history of their activations. [sent-81, score-1.006]
60 Conditioning the model on the previous settings of the hidden units complicates inference. [sent-83, score-0.373]
61 Although one can approximate the posterior distribution with the filtering distribution (treating the past setting of the hidden units as fixed), we choose to use a simplified form of the model which conditions only on previous visible states [20]. [sent-84, score-0.464]
62 This model inherits the most important computational properties of the standard RBM: simple, exact inference and efficient approximate learning. [sent-85, score-0.071]
63 RBMs typically have binary observed variables and binary latent variables but to model real-valued data (e. [sent-86, score-0.226]
64 , the parameterization of a face), we can use a modified form of the TRBM with conditionally independent linear-Gaussian observed variables [7]. [sent-88, score-0.187]
65 1 (left), defines a joint probability distribution over a real-valued representation of the current frame of data, vt , and a collection of binary latent variables, ht , hj ∈ {0, 1}: p(vt , ht |v <=t ). [sent-90, score-0.62]
66 p(v(N +1):T |v1:N , s1:T ) = (7) t=N +1 where we have used the shorthand s<=t to describe a vector that concatenates a window over the input at time t, t−1, . [sent-91, score-0.181]
67 Note that in an offline setting, it is simple to generalize the model by conditioning the term inside the product on an arbitrary window of the source (which may include source observations past time t). [sent-95, score-0.292]
68 st-N st-1 Input Frames h Hidden Units j vt-N vt-1 Previous Output Frames j vt vt-N vt-1 st Ws h Hidden Units Wh j Δ Wv B A . [sent-106, score-0.169]
69 vt Predicted Output i k A Previous Output Frames st-1 i k . [sent-111, score-0.169]
70 4 We can easily adapt the TRBM to model p(vt |v <=t ) by modifying its energy function to incorporate the input. [sent-124, score-0.083]
71 The general form of energy function remains the same as Eq. [sent-125, score-0.038]
72 2 but it is now also conditioned on s<=t by redefining the dynamic biases (Eq. [sent-126, score-0.039]
73 3 and 4) as follows: ait = ai + ˆ Aki vk, <=t ˆjt = bj + b (8) Qlj sl,<=t k (9) l Bkj vk, <=t ). [sent-127, score-0.036]
74 It follows the approximate gradient of an objective function that is the difference between two Kullback-Leibler divergences [8]. [sent-131, score-0.065]
75 It is widely used in practice and tends to produce good generative models [4]. [sent-132, score-0.044]
76 The input and output history stay fixed during Gibbs sampling. [sent-134, score-0.242]
77 12 and 13 are alternated M times to arrive at the M -step quantities used in the weight updates. [sent-137, score-0.033]
78 3 Factored Third-order Input-Output Temporal Restricted Boltzmann Machines In an IOTRBM the input and target history can only modify the hidden units and current output through additive biases. [sent-141, score-0.64]
79 There has been recent interest in exploring higher-order RBMs in which variables interact multiplicatively [14, 16, 21]. [sent-142, score-0.104]
80 1 (right) shows an IOTRBM whose parameters W, Q and P have been replaced by a three-way weight tensor defining a multiplicative interaction between the three sets of variables. [sent-144, score-0.078]
81 The introduction of the tensor results in the number of model parameters becoming cubic and therefore we factor the tensor into three matrices: W s , W h , and W v . [sent-145, score-0.156]
82 These parameters connect the input, hidden units, and current target, respectively to a set of deterministic units which modulate the connections between variables. [sent-146, score-0.415]
83 5 The energy function of this model is: E(vt , ht |v <=t ) = i 1 (vi,t −ˆi,t )2 − a 2 hj,tˆj,t − b j v h s Wif Wjf Wlf vi,t hj,t sl,<=t (14) f ijl where f indexes factors and ai,t and ˆj,t are defined by Eq. [sent-148, score-0.209]
84 ˆ (16) l Experiments We evaluate the IOTRBM on two facial expression transfer datasets, one based on 2D motion capture and the other on 3D motion capture. [sent-153, score-0.874]
85 On both datasets we compare our model against three baselines: Linear regression (LR): We perform a regularized linear regression between each frame of the input to each frame of the output. [sent-154, score-0.218]
86 N th -order Autoregressive2 model (AR): This model improves on linear regression by also considering linear dynamics through the history of the input and output. [sent-157, score-0.244]
wordName wordTfidf (topN-words)
[('facial', 0.444), ('boltzmann', 0.219), ('units', 0.219), ('iohmm', 0.205), ('iotrbm', 0.205), ('temporal', 0.185), ('vt', 0.169), ('trbm', 0.164), ('ht', 0.135), ('trbms', 0.123), ('wif', 0.123), ('wjf', 0.123), ('wlf', 0.123), ('hidden', 0.121), ('frames', 0.115), ('motion', 0.109), ('sl', 0.108), ('rbms', 0.108), ('latent', 0.108), ('restricted', 0.106), ('hmms', 0.105), ('permit', 0.105), ('parameterization', 0.095), ('dynamics', 0.094), ('conditioning', 0.094), ('output', 0.092), ('visible', 0.09), ('transfer', 0.09), ('hmm', 0.089), ('architecture', 0.087), ('graphics', 0.087), ('blendshapes', 0.082), ('inputoutput', 0.082), ('iohmms', 0.082), ('retargeting', 0.082), ('tensor', 0.078), ('window', 0.078), ('history', 0.078), ('expression', 0.076), ('frame', 0.073), ('input', 0.072), ('vit', 0.072), ('recon', 0.072), ('rig', 0.072), ('cd', 0.07), ('jt', 0.065), ('machines', 0.061), ('source', 0.06), ('variables', 0.059), ('target', 0.058), ('emission', 0.056), ('conditional', 0.053), ('dynamical', 0.053), ('rbm', 0.052), ('subtle', 0.046), ('capture', 0.046), ('interact', 0.045), ('incorporate', 0.045), ('models', 0.044), ('recurrent', 0.043), ('connections', 0.042), ('gibbs', 0.042), ('factored', 0.041), ('wij', 0.04), ('vk', 0.04), ('conditioned', 0.039), ('intractable', 0.039), ('baselines', 0.039), ('act', 0.038), ('energy', 0.038), ('sequence', 0.037), ('tractable', 0.037), ('nonlinear', 0.037), ('inference', 0.037), ('aki', 0.036), ('ait', 0.036), ('blending', 0.036), ('actor', 0.036), ('ijl', 0.036), ('mapping', 0.035), ('expressions', 0.035), ('approximate', 0.034), ('external', 0.034), ('conditionally', 0.033), ('separated', 0.033), ('modulate', 0.033), ('graham', 0.033), ('animation', 0.033), ('lds', 0.033), ('alternated', 0.033), ('complicates', 0.033), ('decouples', 0.033), ('generatively', 0.033), ('limitation', 0.032), ('networks', 0.032), ('dbns', 0.031), ('concatenates', 0.031), ('synthesize', 0.031), ('divergences', 0.031), ('iain', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
2 0.18423958 249 nips-2011-Sequence learning with hidden units in spiking neural networks
Author: Johanni Brea, Walter Senn, Jean-pascal Pfister
Abstract: We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons significantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.
3 0.18312788 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
Author: Guido F. Montufar, Johannes Rauh, Nihat Ay
Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1
4 0.15124632 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
5 0.13952011 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
6 0.11241757 156 nips-2011-Learning to Learn with Compound HD Models
7 0.11126408 276 nips-2011-Structured sparse coding via lateral inhibition
8 0.1097886 301 nips-2011-Variational Gaussian Process Dynamical Systems
9 0.10331215 197 nips-2011-On Tracking The Partition Function
10 0.095276989 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
11 0.090929538 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
12 0.083508827 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
13 0.080945417 302 nips-2011-Variational Learning for Recurrent Spiking Networks
14 0.080639206 291 nips-2011-Transfer from Multiple MDPs
15 0.080468528 180 nips-2011-Multiple Instance Filtering
16 0.078665845 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
17 0.076551668 75 nips-2011-Dynamical segmentation of single trials from population neural data
18 0.074406989 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
19 0.071751341 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
20 0.071308419 303 nips-2011-Video Annotation and Tracking with Active Learning
topicId topicWeight
[(0, 0.2), (1, 0.049), (2, 0.094), (3, 0.004), (4, -0.007), (5, -0.075), (6, 0.047), (7, -0.055), (8, 0.02), (9, -0.127), (10, 0.011), (11, -0.222), (12, 0.1), (13, -0.002), (14, -0.047), (15, -0.171), (16, -0.105), (17, 0.045), (18, -0.064), (19, 0.134), (20, 0.018), (21, 0.109), (22, -0.178), (23, 0.013), (24, -0.009), (25, -0.097), (26, 0.031), (27, -0.061), (28, 0.061), (29, -0.058), (30, -0.036), (31, -0.033), (32, 0.093), (33, 0.037), (34, 0.012), (35, 0.097), (36, 0.032), (37, -0.027), (38, 0.039), (39, 0.044), (40, 0.123), (41, -0.068), (42, 0.117), (43, 0.014), (44, -0.03), (45, 0.023), (46, -0.047), (47, -0.166), (48, 0.006), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.95267987 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
2 0.65989447 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
Author: Guido F. Montufar, Johannes Rauh, Nihat Ay
Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufficiently rich model containing different classes of distributions and respecting a given error tolerance. 1
3 0.6479736 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
4 0.64475548 197 nips-2011-On Tracking The Partition Function
Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville
Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1
5 0.62500274 249 nips-2011-Sequence learning with hidden units in spiking neural networks
Author: Johanni Brea, Walter Senn, Jean-pascal Pfister
Abstract: We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons significantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.
6 0.60130215 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
7 0.54261374 156 nips-2011-Learning to Learn with Compound HD Models
8 0.52378219 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.44777977 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
10 0.44736764 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
11 0.44468972 75 nips-2011-Dynamical segmentation of single trials from population neural data
12 0.40949333 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
13 0.38216513 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
14 0.38081264 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
15 0.37918657 217 nips-2011-Practical Variational Inference for Neural Networks
16 0.37112516 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
17 0.37011462 221 nips-2011-Priors over Recurrent Continuous Time Processes
18 0.36542088 60 nips-2011-Confidence Sets for Network Structure
19 0.35539883 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
20 0.35375911 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
topicId topicWeight
[(0, 0.01), (4, 0.032), (20, 0.021), (31, 0.618), (43, 0.061), (45, 0.043), (57, 0.019), (74, 0.032), (83, 0.029), (99, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.97897768 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
2 0.97285008 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
3 0.97011995 225 nips-2011-Probabilistic amplitude and frequency demodulation
Author: Richard Turner, Maneesh Sahani
Abstract: A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequencydemodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings. 1
4 0.94384384 23 nips-2011-Active dendrites: adaptation to spike-based communication
Author: Balazs B. Ujfalussy, Máté Lengyel
Abstract: Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment fluctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. 1
5 0.90742618 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
6 0.90050739 240 nips-2011-Robust Multi-Class Gaussian Process Classification
7 0.8048324 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.800843 249 nips-2011-Sequence learning with hidden units in spiking neural networks
9 0.78299099 131 nips-2011-Inference in continuous-time change-point models
10 0.76747054 75 nips-2011-Dynamical segmentation of single trials from population neural data
11 0.75914299 87 nips-2011-Energetically Optimal Action Potentials
12 0.75425488 221 nips-2011-Priors over Recurrent Continuous Time Processes
13 0.7520045 229 nips-2011-Query-Aware MCMC
14 0.74979746 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
15 0.74264663 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
16 0.74074525 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
17 0.72638339 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
18 0.72443146 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
19 0.7159546 215 nips-2011-Policy Gradient Coagent Networks
20 0.71458966 86 nips-2011-Empirical models of spiking in neural populations