nips nips2011 nips2011-217 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
Reference: text
sentIndex sentText sentNum sentScore
1 However the approaches proposed so far have only been applicable to a few simple network architectures. [sent-4, score-0.243]
2 This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. [sent-5, score-0.327]
3 Along the way it revisits several common regularisers from a variational perspective. [sent-6, score-0.176]
4 It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. [sent-7, score-0.592]
5 Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. [sent-8, score-0.563]
6 1 Introduction In the eighteen years since variational inference was first proposed for neural networks [10] it has not seen widespread use. [sent-9, score-0.466]
7 We believe this is largely due to the difficulty of deriving analytical solutions to the required integrals over the variational posteriors. [sent-10, score-0.176]
8 Such solutions are complicated for even the simplest network architectures, such as radial basis networks [2] and single layer feedforward networks with linear outputs [10, 1, 14], and are generally unavailable for more complex systems. [sent-11, score-0.573]
9 The approach taken here is to forget about analytical solutions and search instead for variational distributions whose expectation values (and derivatives thereof) can be efficiently approximated with numerical integration. [sent-12, score-0.399]
10 While it may seem perverse to replace one intractable integral (over the true posterior) with another (over the variational posterior), the point is that the variational posterior is far easier to draw probable samples from, and correspondingly more amenable to numerical methods. [sent-13, score-0.506]
11 Another benefit is that recasting inference as optimisation makes it to easier to implement in existing, gradient-descent-based neural network software. [sent-16, score-0.439]
12 2 Neural Networks For the purposes of this paper a neural network is a parametric model that assigns a conditional probability Pr(D|w) to some dataset D, given a set w = {wi }W of real-valued parameters, or i=1 weights. [sent-17, score-0.32]
13 The network loss LN (w, D) is defined as the negative log probability of the data given the weights. [sent-20, score-0.281]
14 LN (w, D) = − ln Pr(D|w) = − ln Pr(y|x, w) (1) (x,y)∈D The logarithm could be taken to any base, but to avoid confusion we will use the natural logarithm ln throughout. [sent-21, score-0.663]
15 We assume that the partial derivatives of LN (w, D) with respect to the network weights can be efficiently calculated (using, for example, backpropagation or backpropagation through time [22]). [sent-22, score-0.607]
16 3 Variational Inference Performing Bayesian inference on a neural network requires the posterior distribution of the network weights given the data. [sent-23, score-0.82]
17 If the weights have a prior probability P (w|α) that depends on some parameters α, the posterior can be written Pr(w|D, α). [sent-24, score-0.287]
18 Unfortunately, for most neural networks Pr(w|D, α) cannot be calculated analytically, or even efficiently sampled from. [sent-25, score-0.278]
19 The approximation is fitted by minimising the variational free energy F with respect to the parameters β, where F = − ln Pr(D|w)P (w|α) Q(w|β) (2) w∼Q(β) and for some function g of a random variable x with distribution p(x), g x∼p denotes the expectation of g over p. [sent-27, score-0.49]
20 A fully Bayesian approach would infer the prior parameters α from a hyperprior; however in this paper they are found by simply minimising F with respect to α as well as β. [sent-28, score-0.171]
21 (3) is a lower bound on the expected amount of information (measured in nats, due to the use of natural logarithms) required to transmit the targets in D to a receiver who knows the inputs, using the outputs of a network whose weights are sampled from Q(β). [sent-33, score-0.504]
22 Since this term measures the cost of ‘describing’ the network weights to the receiver, we identify it as the complexity loss LC (α, β): LC (α, β) = DKL (Q(β)||P (α)) (5) C L (α, β) can be realised with bits-back coding [25, 10]. [sent-37, score-0.518]
23 If it has, we can be fairly certain that the network is learning underlying patterns in the data and not simply memorising the training set. [sent-41, score-0.31]
24 In practice we have found that as long as significant compression is taking place, decreasing L(α, β, D) on the training set does not increase LE (β, D) on the test set, and it is therefore unnecessary to sacrifice any training data for early stopping. [sent-43, score-0.248]
25 One is the cost of transmitting the model with w unspecified (for example software that implements the network architecture, the training algorithm etc. [sent-45, score-0.439]
26 All continuous distributions are implicitly assumed to be quantised at some very fine resolution, W and we will limit ourselves to diagonal posteriors of the form Q(β) = i=1 qi (βi ), meaning that W LC (α, β) = i=1 DKL (qi (βi )||P (α)). [sent-53, score-0.226]
27 1 Delta Posterior Perhaps the simplest nontrivial distribution for Q(β) is a delta distribution that assigns probability 1 to a particular set of weights w and 0 to all other weights. [sent-55, score-0.249]
28 Although C has no effect on the gradient used for training, it is usually large enough to ensure that the network cannot compress the data using the coding scheme described in the previous section3 . [sent-58, score-0.37]
29 If the prior is uniform, and all realisable weight values are equally likely then LC (α, β) is a constant and we recover ordinary maximum likelihood training. [sent-59, score-0.28]
30 If the prior is a Laplace distribution then α = {µ, b}, P (w|α) = 1 L (α, w) = W ln 2b + b W C |wi − µ| + C =⇒ i=1 W 1 i=1 2b −µ| exp − |wib and ∂LC (α, w) sgn(wi − µ) = ∂wi b (7) If µ = 0 and b is fixed, this is equivalent to ordinary L1 regularisation. [sent-60, score-0.336]
31 However we can instead ˆ determine the optimal prior parameters α for w as follows: µ = µ1/2 (w) (the median weight value) ˆ W 1 ˆ and ˆ = W i=1 |wi − µ|. [sent-61, score-0.243]
32 For a general network architecture we cannot compute either LE (β, D) or its derivatives exactly, so we resort to sampling. [sent-67, score-0.404]
33 if the network perfectly models the data); we would therefore expect it to improve as LE (β, D) decreases. [sent-76, score-0.243]
34 For simple networks whose second derivatives can be calculated efficiently the approximation is unnecessary and the diagonal Hessian can be sampled instead. [sent-77, score-0.368]
35 (11) is equivalent to adding zero-mean, fixed-variance Gaussian noise to the network weights during training. [sent-80, score-0.43]
36 In particular, if the prior P (α) is uniform and a single weight sample is taken for each element of D, then minimising L(α, β, D) is identical to minimising LN (w, D) with weight noise or synaptic noise [13]. [sent-81, score-0.742]
37 Note that the quantisation of the uniform prior adds a large constant to LC (α, β), making it unfeasible to compress the data with our MDL coding scheme; in practice early stopping is required to prevent overfitting when training with weight noise. [sent-82, score-0.578]
38 (14) added to the weight gradient and α optimised after every weight update. [sent-85, score-0.386]
39 But because the prior is no longer uniform the network is able to compress the data, making it feasible to dispense with early stopping. [sent-86, score-0.423]
40 (13) are the complexity costs of individual network weights. [sent-88, score-0.295]
41 These costs give valuable insight into the internal structure of the network, since (with a limited budget of bits to spend) the network will assign more bits to more important weights. [sent-89, score-0.401]
42 4 2 6 Optimisation If the derivatives of LE (β, D) are stochastic, we require an optimiser that can tolerate noisy gradient estimates. [sent-91, score-0.169]
43 Although stochastic derivatives should in principle be estimated using the same weight samples for the entire dataset, it is in practice much more efficient to pick different weight samples for each (x, y) ∈ D. [sent-93, score-0.445]
44 This is because the weights (to which the complexity cost applies) are only transmitted once for the entire dataset, whereas the error cost must be transmitted separately for each batch. [sent-97, score-0.343]
45 Ideally a trained network should be evaluated on some previously unseen input x using the expected distribution Pr(. [sent-100, score-0.292]
46 This is equivalent to removing weight noise during testing. [sent-104, score-0.277]
47 7 Pruning Removing weights from a neural network (a process usually referred to as pruning) has been repeatedly proposed as a means of reducing complexity and thereby improving generalisation [15, 7]. [sent-105, score-0.433]
48 This would seem redundant for variational inference, which automatically limits the network complexity. [sent-106, score-0.419]
49 However pruning can reduce the computational cost and memory demands of the network. [sent-107, score-0.294]
50 Furthermore we have found that if the network is retrained after pruning, the final performance can be improved. [sent-108, score-0.314]
51 A possible explanation is that pruning reduces the noise in the gradient estimates (because the pruned weights are not sampled) without increasing network complexity. [sent-109, score-0.769]
52 Weights w that are more probable under Q(β) tend to give lower LN (w, D) and pruning a weight is equivalent to fixing it to zero. [sent-110, score-0.459]
53 These two facts suggest a pruning heuristic where a weight is removed if its probability density at zero is sufficiently high under Q(β). [sent-111, score-0.401]
54 For a diagonal posterior we can define the relative probability of each wi at zero as the density of qi (βi ) at zero divided by the density of qi (βi ) at its mode. [sent-112, score-0.438]
55 We can then define a pruning heuristic by removing all weights whose relative probability at zero exceeds some threshold γ, with 0 ≤ γ ≤ 1. [sent-113, score-0.387]
56 √ where we have used the reparameterisation λ = −2 ln γ, with λ ≥ 0. [sent-117, score-0.221]
57 As λ grows the amount of pruning increases, and the probability of the pruned weight vector under Q(β) (and therefore the likely network performance) decreases. [sent-119, score-0.747]
58 A good rule of thumb for how high λ can safely be set is the point at which the pruned weights become less probable than an average weight sampled from qi (βi ). [sent-120, score-0.511]
59 83 (20) If the network is retrained after pruning, the cost of transmitting which weights have been removed should in principle be added to LC (α, β) (since this information could be used to overfit the training data). [sent-122, score-0.623]
60 However the extra cost does not depend on the network parameters, and can therefore be ignored for the purposes of optimisation. [sent-123, score-0.301]
61 8 Experiments We tested all the combinations of posterior and prior described in Section 5 on a hierarchical multidimensional recurrent neural network [9] trained to do phoneme recognition on the TIMIT speech corpus [4]. [sent-126, score-1.015]
62 We also assessed the pruning heuristic from Section 7 by applying it with various thresholds to a trained network and observing the impact on performance and network size. [sent-127, score-0.771]
63 The audio data was presented to the network in the form of spectrogram images. [sent-132, score-0.345]
64 Hierarchical multidimensional recurrent neural networks containing Long Short-Term Memory [11] hidden layers and a CTC output layer [8] have proven effective for offline handwriting recognition [9]. [sent-135, score-0.544]
65 The same architecture is employed here, with a spectrogram in place of a handwriting image, and phoneme labels in place of characters. [sent-136, score-0.42]
66 Since the network scans through the spectrogram in all directions, both vertical and horizontal correlations can be captured. [sent-137, score-0.433]
67 All network parameters were trained with online steepest descent (weight updates after every sequence) using a learning rate of 10−4 and a momentum of 0. [sent-141, score-0.366]
68 When parameters in the posterior or prior were fixed, the best value was found empirically. [sent-147, score-0.174]
69 All networks were initialised with random weights (or random weight means if the posterior was Gaussian), chosen from a Gaussian 6 Adaptive weight noise Adapt. [sent-148, score-0.825]
70 prior weight noise Weight noise Maximum likelihood Figure 2: Error curves for four networks during training. [sent-149, score-0.556]
71 Adaptive weight noise does not overfit, and normal weight noise overfits much more slowly than maximum likelihood. [sent-151, score-0.478]
72 Adaptive weight noise led to longer training times and noisier error curves. [sent-152, score-0.342]
73 All distribution parameters were learned by the network unless fixed values are specified. [sent-154, score-0.243]
74 ‘Error’ is the phoneme error rate on the core test set (total edit distance between the network transcriptions and the target transcriptions, multiplied by 100). [sent-155, score-0.523]
75 ‘Ratio’ is the compression ratio of the training set transcription targets relative to a uniform code over the 39 phoneme labels (≈ 5. [sent-157, score-0.442]
76 3 bits per phoneme); this could only be calculated for the networks with Gaussian priors and posteriors. [sent-158, score-0.296]
77 Name Posterior Prior Error Epochs Ratio Adaptive L1 Adaptive L2 Adaptive mean L2 L2 Maximum likelihood L1 Adaptive mean L1 Weight noise Adaptive prior weight noise Adaptive weight noise Delta Delta Delta Delta Delta Delta Delta Gauss σi = 0. [sent-159, score-0.63]
78 For the adaptive Gaussian posterior, the standard deviations of the weights were initialised to 0. [sent-177, score-0.246]
79 The networks with Gaussian posteriors and priors did not require early stopping and were trained on all 3696 utterances in the training set; all other networks used the validation set for early stopping and hence were trained on 3512 utterances. [sent-179, score-0.853]
80 These were also the only networks for which the transmission cost of the network weights could be measured (since it did not depend on the quantisation of the posterior or prior). [sent-180, score-0.777]
81 The networks were evaluated on the test set using the parameters giving lowest LE (β, D) on the training set (or validation set if present). [sent-181, score-0.272]
82 All experiments were stopped after 100 training epochs with no improvement in either L(α, β, D), LE (β, D) or the number of transcription errors on the training or validation set. [sent-182, score-0.316]
83 The reason for such conservative stopping criteria was that the error curves of some of the networks were extremely noisy (see Fig. [sent-183, score-0.247]
84 L2 regularisation was no better than unregularised maximum likelihood, while L1 gave a slight improvement; this is consistent with our previous experience of recurrent neural networks. [sent-186, score-0.246]
85 The fully adaptive L1 and L2 networks performed very badly, apparently because the priors became excessively narrow (σ 2 ≈ 0. [sent-187, score-0.293]
86 L1 with fixed variance and adaptive mean was somewhat better than L1 with mean fixed at 0 (although the adaptive mean was very close to zero, settling around 0. [sent-190, score-0.172]
87 The networks with Gaussian posteriors outperformed those with delta posteriors, with the best score obtained using a fully adaptive posterior. [sent-192, score-0.489]
88 Table 2 shows the effect of pruning on the trained ‘adaptive weight noise’ network from Table 1. [sent-193, score-0.693]
89 The pruned networks were retrained using the same optimisation as before, with the error recorded before and after retraining. [sent-194, score-0.446]
90 As well as being highly effective at removing weights, pruning led to improved performance following retraining in some cases. [sent-195, score-0.274]
91 ‘Weights’ is the number of weights left after pruning and ‘Percent’ is the same figure expressed as a percentage of the original weights. [sent-200, score-0.349]
92 ‘Initial Error’ is the test error immediately after pruning and ‘Retrain Error’ is the test error following ‘Retrain Epochs’ of subsequent retraining. [sent-201, score-0.308]
93 55 cells λ input gates H forget gates V forget gates cells output gates Figure 3: Weight costs in an 2D LSTM recurrent connection. [sent-240, score-0.91]
94 Each dot corresponds to a weight; the lighter the colour the more bits the weight costs. [sent-241, score-0.218]
95 The vertical axis shows the LSTM cell the weight comes from; the horizontal axis shows the LSTM unit the weight goes to. [sent-242, score-0.418]
96 Connectionist temporal classification: Laa belling unsegmented sequence data with recurrent neural networks. [sent-307, score-0.205]
97 Keeping the neural networks simple by minimizing the description length of the weights. [sent-318, score-0.278]
98 An analysis of noise in recurrent neural networks: convergence and generalization. [sent-335, score-0.279]
99 Probable networks and plausible predictions - a review of practical bayesian methods for supervised neural networks. [sent-355, score-0.242]
100 A direst adaptive method for faster backpropagation learning: The rprop algorithm. [sent-379, score-0.19]
wordName wordTfidf (topN-words)
[('lc', 0.296), ('network', 0.243), ('pruning', 0.236), ('ln', 0.221), ('phoneme', 0.19), ('variational', 0.176), ('networks', 0.165), ('weight', 0.165), ('gauss', 0.148), ('le', 0.147), ('wi', 0.146), ('timit', 0.143), ('delta', 0.136), ('recurrent', 0.128), ('mdl', 0.119), ('derivatives', 0.115), ('weights', 0.113), ('graves', 0.109), ('gates', 0.109), ('pr', 0.108), ('forget', 0.108), ('lstm', 0.108), ('pruned', 0.103), ('spectrogram', 0.102), ('posteriors', 0.102), ('posterior', 0.096), ('wk', 0.096), ('minimising', 0.093), ('adaptive', 0.086), ('handwriting', 0.082), ('ctc', 0.081), ('epochs', 0.081), ('prior', 0.078), ('dkl', 0.078), ('neural', 0.077), ('noise', 0.074), ('retrain', 0.074), ('compression', 0.073), ('qi', 0.072), ('optimisation', 0.071), ('transmitting', 0.071), ('nats', 0.071), ('retrained', 0.071), ('training', 0.067), ('coding', 0.066), ('speech', 0.062), ('transcription', 0.061), ('compress', 0.061), ('gaussian', 0.06), ('probable', 0.058), ('cost', 0.058), ('optimised', 0.056), ('giles', 0.054), ('nowlan', 0.054), ('optimiser', 0.054), ('quantisation', 0.054), ('rprop', 0.054), ('transcriptions', 0.054), ('receiver', 0.053), ('bits', 0.053), ('multidimensional', 0.053), ('costs', 0.052), ('diagonal', 0.052), ('targets', 0.051), ('backpropagation', 0.05), ('trained', 0.049), ('laplace', 0.049), ('transmission', 0.048), ('inference', 0.048), ('discretisation', 0.047), ('bright', 0.047), ('initialised', 0.047), ('bj', 0.047), ('architecture', 0.046), ('stopping', 0.046), ('horizontal', 0.045), ('resolution', 0.044), ('transmit', 0.044), ('bands', 0.044), ('barber', 0.044), ('vertical', 0.043), ('priors', 0.042), ('early', 0.041), ('regularisation', 0.041), ('phone', 0.041), ('validation', 0.04), ('recognition', 0.039), ('cells', 0.039), ('transmitted', 0.039), ('loss', 0.038), ('units', 0.038), ('removing', 0.038), ('ordinary', 0.037), ('steepest', 0.037), ('shannon', 0.037), ('momentum', 0.037), ('optimise', 0.037), ('error', 0.036), ('description', 0.036), ('calculated', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
2 0.16099989 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.12451117 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
4 0.11292853 302 nips-2011-Variational Learning for Recurrent Spiking Networks
Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner
Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation confirms the method’s applicability to learning both stationary and temporal spike patterns. 1
5 0.10049519 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta
Abstract: In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The specific example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We define a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also refinement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer affiliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conflicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to find the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efficiently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be specified by experts. For instance, the influence of one variable on another, or the combined influence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the specification of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly specified by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Influence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] defines the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisfied: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to refine the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the definition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisfied, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this definition of the CPT for V is that a constraint is more likely to be satisfied if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully defined given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisfied (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisfied (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modified versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efficient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisfied (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difficult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisfied by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to find a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to find the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us fix the leak parameters and the link reliability from the first cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to find M MAP . To evaluate M MAP , we first compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 confirms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a fixed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to find the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to find parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identified with a symptom represented by the first test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model refinement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the first approach that exploits implicit constraints based on value of information assessments. Secondly it is the first approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restificar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative influences. In Proceedings of Twenty Third International Conference on Uncertainty in Artificial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratification in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, July 2000. 9
6 0.099843934 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
7 0.09965647 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
8 0.099520504 301 nips-2011-Variational Gaussian Process Dynamical Systems
9 0.098603331 109 nips-2011-Greedy Model Averaging
10 0.098097913 104 nips-2011-Generalized Beta Mixtures of Gaussians
11 0.095946096 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
12 0.092016853 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
13 0.087885141 4 nips-2011-A Convergence Analysis of Log-Linear Training
14 0.08547423 60 nips-2011-Confidence Sets for Network Structure
15 0.083969377 100 nips-2011-Gaussian Process Training with Input Noise
16 0.081925817 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
17 0.080234706 261 nips-2011-Sparse Filtering
18 0.080128543 25 nips-2011-Adaptive Hedge
19 0.077510901 244 nips-2011-Selecting Receptive Fields in Deep Networks
20 0.077239551 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
topicId topicWeight
[(0, 0.236), (1, 0.035), (2, 0.081), (3, -0.028), (4, -0.009), (5, -0.07), (6, 0.045), (7, 0.026), (8, -0.063), (9, 0.0), (10, -0.115), (11, -0.189), (12, 0.032), (13, -0.124), (14, -0.059), (15, -0.031), (16, -0.125), (17, -0.032), (18, 0.011), (19, 0.02), (20, 0.048), (21, 0.037), (22, 0.086), (23, 0.084), (24, 0.064), (25, 0.058), (26, 0.104), (27, 0.065), (28, 0.083), (29, -0.025), (30, -0.037), (31, -0.035), (32, 0.051), (33, 0.016), (34, 0.143), (35, 0.037), (36, -0.032), (37, 0.055), (38, 0.035), (39, -0.041), (40, -0.038), (41, 0.048), (42, 0.01), (43, 0.014), (44, -0.044), (45, -0.013), (46, 0.004), (47, 0.131), (48, -0.006), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.96731746 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
2 0.62289643 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
3 0.60305154 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
Author: Onur Dikmen, Cédric Févotte
Abstract: In this paper we describe a maximum likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE approach is shown to embed automatic model order selection, akin to automatic relevance determination.
4 0.59913021 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
5 0.59515846 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
Author: Ke Chen, Ahmad Salman
Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1
6 0.58537167 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
7 0.58423895 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
8 0.58295643 258 nips-2011-Sparse Bayesian Multi-Task Learning
9 0.57030922 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
10 0.56780529 241 nips-2011-Scalable Training of Mixture Models via Coresets
11 0.56232125 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
12 0.55714071 104 nips-2011-Generalized Beta Mixtures of Gaussians
13 0.53225511 301 nips-2011-Variational Gaussian Process Dynamical Systems
14 0.52476442 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
15 0.52251923 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
16 0.51689434 250 nips-2011-Shallow vs. Deep Sum-Product Networks
17 0.51511377 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
18 0.50055134 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
19 0.49316218 240 nips-2011-Robust Multi-Class Gaussian Process Classification
20 0.48455042 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
topicId topicWeight
[(0, 0.024), (4, 0.034), (20, 0.042), (26, 0.035), (31, 0.104), (33, 0.021), (43, 0.06), (45, 0.11), (57, 0.043), (65, 0.023), (74, 0.057), (83, 0.07), (89, 0.285), (99, 0.032)]
simIndex simValue paperId paperTitle
1 0.92825091 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions
Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern
Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1
2 0.79904163 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
Author: Yusuke Watanabe
Abstract: While loopy Belief Propagation (LBP) has been utilized in a wide variety of applications with empirical success, it comes with few theoretical guarantees. Especially, if the interactions of random variables in a graphical model are strong, the behaviors of the algorithm can be difficult to analyze due to underlying phase transitions. In this paper, we develop a novel approach to the uniqueness problem of the LBP fixed point; our new “necessary and sufficient” condition is stated in terms of graphs and signs, where the sign denotes the types (attractive/repulsive) of the interaction (i.e., compatibility function) on the edge. In all previous works, uniqueness is guaranteed only in the situations where the strength of the interactions are “sufficiently” small in certain senses. In contrast, our condition covers arbitrary strong interactions on the specified class of signed graphs. The result of this paper is based on the recent theoretical advance in the LBP algorithm; the connection with the graph zeta function.
same-paper 3 0.78919578 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
4 0.61196923 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
5 0.56632245 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
6 0.55956995 258 nips-2011-Sparse Bayesian Multi-Task Learning
7 0.55949485 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data
8 0.55921406 180 nips-2011-Multiple Instance Filtering
9 0.55821365 219 nips-2011-Predicting response time and error rates in visual search
10 0.55810833 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
11 0.55776274 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
12 0.55636388 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
13 0.55527139 276 nips-2011-Structured sparse coding via lateral inhibition
14 0.55441177 229 nips-2011-Query-Aware MCMC
15 0.55245036 156 nips-2011-Learning to Learn with Compound HD Models
16 0.55215603 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
17 0.55205262 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
18 0.55166614 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
19 0.55160677 227 nips-2011-Pylon Model for Semantic Segmentation
20 0.55146307 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation