nips nips2005 nips2005-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. [sent-4, score-0.604]
2 Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. [sent-5, score-0.145]
3 Our results involve genericity assumptions on the distributions of causes. [sent-6, score-0.215]
4 If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. [sent-7, score-0.439]
5 Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. [sent-8, score-0.13]
6 Previous results can be used to determine convergence and to estimate convergence rates. [sent-9, score-0.25]
7 1 Introduction It is important to understand the relationship between the Rescorla-Wagner (RW) algorithm [1,2] and theories of learning based on maximum likelihood (ML) estimation of the parameters of generative models [3,4,5]. [sent-10, score-0.471]
8 But maximum likelihood offers the promise of a sound statistical basis including the ability to learn sophisticated probabilistic models for causal learning [6,7,8]. [sent-12, score-0.388]
9 Previous work, summarized in section (2), showed a direct relationship between the basic Rescorla-Wagner algorithm and maximum likelihood for the ∆P model of causal learning [4,9]. [sent-13, score-0.487]
10 More recently, a generalization of Rescorla-Wagner was shown to perform maximum likelihood estimation for both the ∆P and the noisy-or models [10]. [sent-14, score-0.223]
11 Throughout the paper, we follow the common practice of studying the convergence of the expected value of the weights and ignoring the fluctuations. [sent-15, score-0.196]
12 The size of these fluctuations can be calculated analytically and precise convergence quantified [10]. [sent-16, score-0.102]
13 We show that two classes of generalized Rescorla-Wagner algorithms can perform ML estimation for all generative models provided genericity assumptions on the causes are satisfied. [sent-18, score-0.735]
14 These generalizations include augmenting the set of variables to represent conjunctive causes and are related to the augmented Rescorla-Wagner algorithm [2]. [sent-19, score-0.328]
15 We also analyze the case where the genericity assumption breaks down and pay particular attention to Chengs’ causal power model [4,5]. [sent-20, score-0.506]
16 We demonstrate that Rescorla-Wagner can perform ML estimation for this model up to a nonlinear transformation of the model parameters (i. [sent-21, score-0.246]
17 We sketch how a nonlinear Rescorla-Wagner can estimate the parameters directly. [sent-24, score-0.13]
18 This gives convergence conditions and put bounds on the convergence rate. [sent-26, score-0.233]
19 But the results can also be applied in the piecewise iid case (such as forward and backward blocking [11]). [sent-31, score-0.145]
20 2 Summary of Previous Work We summarize pervious work relating maximum likelihood estimation of generative models with the Rescorla-Wagner algorithm [4,9,10]. [sent-32, score-0.397]
21 This work assumes that there is a binaryvalued event E which can be caused by one or more of two binary-valued causes C1 , C2 . [sent-33, score-0.133]
22 The ∆P and Noisy-or theories use generative models of form: P∆P (E = 1|C1 , C2 , ω1 , ω2 ) = ω1 C1 + ω2 C2 PN oisy−or (E = 1|C1 , C2 , ω1 , ω2 ) = ω1 C1 + ω2 C2 − ω1 ω2 C1 C2 , (1) (2) where {ω1 , ω2 } are the model parameters. [sent-34, score-0.23]
23 (5) (6) It is known that if the basic update rule (5) is used then the weights converge to the ML estimates of the parameters {ω1 , ω2 } provided the data is generated by the ∆P model (1) [4,9] (but not for the noisy-or model). [sent-39, score-0.501]
24 If the variant update rule (6) is used, then the weights converge to the parameters {ω1 , ω2 } of the ∆P model or the noisy-or model (2) depending on which model generates the data [10]. [sent-40, score-0.432]
25 3 Basic Ingredients This section describes three basic ingredients of this work: (i) the generative models, (ii) maximum likelihood, and (iii) the generalized Rescorla-Wagner algorithms. [sent-41, score-0.414]
26 We represent the distribution P (E|C; α) by the function: P (E = 1|C; α) = αi hi (C), (7) i where the {hi (C)} are a set of basis functions and the {αi } are parameters. [sent-43, score-0.162]
27 Maximum Likelihood (ML) estimates the α by solving: α∗ = arg min − α log{P (E µ |C µ ; α)P (C µ )} = arg min − α µ∈Λ log P (E µ |C µ ; α). [sent-55, score-0.107]
28 (9) µ∈Λ Observe that the estimate of α is independent of P (C) provided the distribution is generic. [sent-56, score-0.089]
29 The Rescorla-Wagner (RW) algorithm updates weights {Vi : i = 1, . [sent-59, score-0.163]
30 (10) We assume a generalized form: ∆Vi = Vj fij (C) + Egi (C), i, j = 1, . [sent-66, score-0.354]
31 4 Theoretical Results We now gives sufficient conditions which ensure that the only fixed points of generalized Rescorla-Wagner correspond to ML estimates of the parameters α of generative models P (E|C, α). [sent-71, score-0.467]
32 We then obtain two classes of generalized Rescorla-Wagner which satisfy these conditions. [sent-72, score-0.119]
33 For one class, convergence to the fixed points follow directly. [sent-73, score-0.102]
34 For the other class we need to adapt results from [10] to guarantee convergence to the fixed points. [sent-74, score-0.155]
35 Our results assume genericity conditions on the distribution P (C) of causes. [sent-75, score-0.217]
36 The number of weights {Vi } used by the Rescorla-Wagner algorithm is equal to the number of parameters {αi } that specify the model. [sent-77, score-0.168]
37 But many weights will remain zero unless conjunctions of causes occur, see section (6). [sent-78, score-0.294]
38 A sufficient condition for generalized Rescorla-Wagner (11), to have a unique fixed point at the maximum likelihood estimates of the parameters of a generative model P (E|C; α) (7), is that < fij (C) >P (C) = − < gi (C)hj (C) >P (C) ∀ i, j and the matrix < fij (C) >P (C) is invertible. [sent-80, score-1.152]
39 This is zero if, and only if, j Vj < fij (C) >P (C) + j αj < gi (C)hj (C) >P (C) = 0. [sent-83, score-0.337]
40 For example, < fij (C) >P (C) = C P (C)fij (C). [sent-87, score-0.235]
41 Hence the requirement that the matrix < fij (C) >P (C) is invertible usually requires that P (C) is generic. [sent-88, score-0.346]
42 Convergence may still occur if the matrix < fij (C) >P (C) is non-invertible. [sent-92, score-0.303]
43 Linear combinations of the weights will remained fixed (in the directions of the zero eigenvectors of the matrix) and the remaining linear co,mbinations will converge. [sent-93, score-0.119]
44 Additional conditions to ensure convergence to the fixed point, and to determine the convergence rate, can be found using Theorems 3,4,5 in [10]. [sent-94, score-0.256]
45 1 Generalized RW class I We now give prove a corollary of Theorem 1 which will enable us to obtain our first class of generalized RW algorithms. [sent-96, score-0.296]
46 A sufficient condition for generalized RW to have fixed points at ML estimates of the model parameters is fij (C) = −hi (C)hj (C), gi (C) = hi (C) ∀ i, j and the matrix < hi (C)hj (C) >P (C) is invertible. [sent-98, score-0.912]
47 These conditions define generalized RW class I (GRW-I) which is a natural extension of basic Rescorla-Wagner (5): ∂ (E − hj (C)Vj )2 , i = 1, . [sent-103, score-0.427]
48 , n (12) hj (C)Vj } = − ∆Vi = hi (C){E − ∂Vi j j This GRW-I algorithm ia guaranteed to converge to the fixed point because it performs stochastic steepest descent. [sent-106, score-0.445]
49 To illustrate Corollary 1, we show the relationships between GRW-I and ML for three different generative models: (i) the ∆P model, (ii) the noisy-or model, and (iii) the most general form of P (E|C) for two causes. [sent-108, score-0.144]
50 It is important to realize that these generative models form a hierarchy and GRW-I algorithms for the later models will also perform ML on the simpler ones. [sent-109, score-0.253]
51 Then equation (12) reduces to the basic RW algorithm (5) with two weights V1 , V2 . [sent-113, score-0.218]
52 This rederives the known relationship between basic RW, ML, and the ∆P model [4,9]. [sent-115, score-0.116]
53 This is equivalent to the genericity condition < C1 C2 >2 (C) =< P C1 >P (C) < C2 >P (C) . [sent-117, score-0.211]
54 Then Corollary 1 proves that the following algorithm will converge to estimate V1∗ = ω1 , V2∗ = ω2 and V3∗ = −ω1 ω2 for the noisy-or model. [sent-121, score-0.163]
55 (13) This algorithm is a minor variant of basic RW. [sent-123, score-0.127]
56 Observe that this has more weights (n = 3) than the total number of causes. [sent-124, score-0.094]
57 The first two weights V1 and V2 yield ω1 , ω2 while the third weight V3 gives a (redundant) estimate of ω1 ω2 . [sent-125, score-0.168]
58 The matrix < hi (C)hj (C) >P (C) has determinant (< C1 C2 > − < C1 >)(< C1 C2 > − < C2 >) < C1 C2 > and is invertible provided < C1 >= 0, 1, < C2 >= 0, 1 and < C1 C2 >=< C1 >< C2 >. [sent-126, score-0.361]
59 It is known that basic RW is unable to do ML estimation for the noisy-or model if there are only two weights [4,5,9,10]. [sent-128, score-0.241]
60 The differences here is that three weights are used. [sent-129, score-0.094]
61 By Corollary 1, this algorithm will converge to V1∗ = α1 , V2∗ = α2 , V3∗ = α3 , V4∗ = α4 , provided the matrix is invertible. [sent-136, score-0.205]
62 The determinant of the matrix < hi (C)hj (C) >P (C) is < C1 C2 > (< C1 C2 > − < C1 >)(< C1 C2 > − < C2 >)(1− < C1 > − < C2 > + < C1 C2 >). [sent-137, score-0.252]
63 It is important to realize that the most general GRW-I algorithm will converge if P (E|C) is the ∆P or the noisy-or model. [sent-139, score-0.151]
64 2 Generalized RW Class II We can obtain a second class of generalized RW algorithms which perform ML estimation. [sent-148, score-0.171]
65 A sufficient condition for RW to have unique fixed point at the ML estimate of the generative model P (E|C) is that fij (C) = −gi (C)hj (C), provided the matrix < hi (C)hj (C) >P (C) is invertible. [sent-150, score-0.73]
66 Corollary 2 defines GRW-II to be of form: ∆Vi = gi (C){E − hj (C)Vj }. [sent-153, score-0.264]
67 (16) The matrix < hi (C)hj (C) >P (C) has determinant < C1 C2 > (< C1 > − < C1 C2 >)(< C2 > − < C1 C2 >) and so is invertible for generic P (C). [sent-158, score-0.318]
68 The algorithm will converge to weights V1∗ = ω1 , V2∗ = ω2 , V3∗ = −ω1 ω2 . [sent-159, score-0.211]
69 If we change the model to ∆P , then we get convergence to V1∗ = ω1 , V2∗ = ω2 , V3∗ = 0. [sent-160, score-0.132]
70 The remaining update equations for V1 &V2; will converge to ω1 , ω2 for both the noisy-or and the ∆P model. [sent-164, score-0.172]
71 These reduced update equations are identical to those given by equation (6) which were proven to converge to ω1 , ω2 [10]. [sent-165, score-0.203]
72 We note that the matrix < hi (C)hj (C) >P (C) now has a zero eigenvalue (because g3 (C) = 0) but this does not matter because it corresponds to the third weight V3 . [sent-166, score-0.264]
73 The matrix remains invertible if we restrict it to i, j = 1, 2. [sent-167, score-0.111]
74 A limitation of GRW-II algorithm of equation (16) is that it only updates the weights if only one cause is active. [sent-168, score-0.234]
75 So it would fail to explain effects such as blocking where both causes are on for part of the stimuli (Dayan personal communication). [sent-169, score-0.182]
76 5 Non-generic, coordinate transformations, and non-linear RW Our results have assumed genericity constraints on the distribution P (C) of causes. [sent-170, score-0.237]
77 Cheng’s PC theory [4,5] uses the noisy-or model for generating the data but cause C1 is a background cause which is on all the time (i. [sent-174, score-0.11]
78 This implies that < C2 >=< C1 C2 > and so we cannot apply RW algorithms (13), the most general algorithm, or (16) because the matrix determinant will be zero in all three cases. [sent-177, score-0.136]
79 (17) Theorem 1 shows that we can define generalized RW algorithms to find ML estimates of ω1 and ω2 (1 − ω1 ) (assuming ω1 = 1). [sent-179, score-0.176]
80 But, conversely, it is impossible to estimate ω2 directly by any linear generalized RW. [sent-180, score-0.188]
81 RW estimates the parameters of the generative model in a different coordinate system than the one used to specify the model. [sent-182, score-0.324]
82 So RW can estimate the ML parameters provided we allow for an additional non-linear transformation. [sent-184, score-0.133]
83 If we reparameterize the generative model to be P (E = 1|C) = ω1 + ω2 C2 , where ω2 = ω2 (1 − ω1 ), then we can ˆ ˆ design an RW to estimate {ω1 , ω2 }. [sent-186, score-0.22]
84 In this case, the generative model P (E|C) becomes independent of ω2 and so it is impossible to estimate it. [sent-188, score-0.243]
85 But suppose we want to really estimate ω1 and ω2 directly (for Cheng’s model, the value of ω2 is the causal power and hence is a meaningful quantity [4,5]). [sent-189, score-0.298]
86 To estimate ω2 , we replace the variable V2 by a new variable V3 = V2 /(1 − V1 ) which is updated by a nonlinear equation (V1 is updated as before): V3t+1 = V3t + ∆V2t V3t δV1t + , 1 − V1t 1 − V1t (19) where we use V3 = V2 /(1−V1 ) to re-express ∆V1 and ∆V2 in terms of functions of V1 and V3 . [sent-194, score-0.117]
87 ˆ 6 Conclusion This paper shows that we can obtain linear generalizations of the Rescorla-Wagner algorithm which can learn the parameters of generative models by Maximum Likelihood. [sent-196, score-0.336]
88 For one class of RW generalizations we have only shown that the fixed points are unique and correspond to ML estimates of the parameters of the generative models. [sent-197, score-0.414]
89 But Theorems 3,4 & 5 of Yuille (2004) can be applied to determine convergence conditions. [sent-198, score-0.102]
90 These theorems can also be used to obtain convergence results for piecewise i. [sent-203, score-0.207]
91 samples as occurs in foreward and backward blocking experiments. [sent-206, score-0.118]
92 These generalizations of Rescorla-Wagner require augmenting the number of weight variables. [sent-207, score-0.171]
93 This was already proposed, on experimental grounds, so that new weights get created if causes occur in conjunction, [2]. [sent-208, score-0.228]
94 Note that this happens naturally in the algorithms presented (13, the most general algorithm,16) – weights remain at zero until we get an event C1 C2 = 1. [sent-209, score-0.167]
95 It is straightforward to extend the analysis to models with conjunctions of many causes. [sent-210, score-0.09]
96 We conjecture that these generalizations converge to good approaximation to ML estimates if we truncate the conjunction of causes at a fixed order. [sent-211, score-0.394]
97 Finally, many of our results have involved a genericity assumption on the distribution of causes P (C). [sent-212, score-0.299]
98 We have argued that when these assumptions are violated, for example in Cheng’s experiments, then generalized RW still performs ML estimation, but with a nonlinear transform. [sent-213, score-0.186]
99 Alternatively we have shown how to define a nonlinear RW that estimates the parameters directly. [sent-214, score-0.141]
100 “The Rescorla-Wagner algorithm and Maximum Likelihood estimation of causal parameters”. [sent-314, score-0.305]
wordName wordTfidf (topN-words)
[('rw', 0.646), ('ml', 0.284), ('fij', 0.235), ('causal', 0.221), ('genericity', 0.188), ('hj', 0.187), ('generative', 0.144), ('hi', 0.141), ('cheng', 0.141), ('corollary', 0.119), ('generalized', 0.119), ('causes', 0.111), ('convergence', 0.102), ('pp', 0.098), ('weights', 0.094), ('generalizations', 0.092), ('converge', 0.087), ('vi', 0.081), ('gi', 0.077), ('blocking', 0.071), ('likelihood', 0.07), ('vit', 0.07), ('vj', 0.068), ('invertible', 0.066), ('determinant', 0.066), ('conditioning', 0.065), ('conjunctions', 0.064), ('basic', 0.063), ('estimates', 0.057), ('theorems', 0.056), ('estimation', 0.054), ('pavlovian', 0.054), ('yuille', 0.054), ('update', 0.051), ('augmenting', 0.051), ('maximum', 0.05), ('piecewise', 0.049), ('coordinate', 0.049), ('uctuations', 0.047), ('estimate', 0.046), ('matrix', 0.045), ('augmented', 0.044), ('parameters', 0.044), ('dayan', 0.043), ('provided', 0.043), ('courville', 0.043), ('cause', 0.04), ('nonlinear', 0.04), ('updates', 0.039), ('ingredients', 0.038), ('xed', 0.036), ('violated', 0.036), ('breaks', 0.036), ('realize', 0.034), ('grif', 0.034), ('variant', 0.034), ('equations', 0.034), ('angeles', 0.033), ('rule', 0.032), ('equation', 0.031), ('power', 0.031), ('model', 0.03), ('algorithm', 0.03), ('theories', 0.03), ('conditions', 0.029), ('los', 0.029), ('induction', 0.029), ('class', 0.029), ('ii', 0.029), ('observe', 0.029), ('weight', 0.028), ('assumptions', 0.027), ('pc', 0.026), ('models', 0.026), ('happens', 0.026), ('correspond', 0.025), ('backward', 0.025), ('matter', 0.025), ('arg', 0.025), ('transformation', 0.025), ('zero', 0.025), ('conjunction', 0.024), ('adapt', 0.024), ('acknowledgement', 0.023), ('truncate', 0.023), ('ire', 0.023), ('keck', 0.023), ('ensure', 0.023), ('unique', 0.023), ('perform', 0.023), ('veri', 0.023), ('occur', 0.023), ('relationship', 0.023), ('condition', 0.023), ('relating', 0.023), ('impossible', 0.023), ('samples', 0.022), ('event', 0.022), ('basis', 0.021), ('special', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
2 0.10832009 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
3 0.10639264 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
4 0.084476151 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
5 0.067560144 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
Author: Kristina Klinkner, Cosma Shalizi, Marcelo Camperi
Abstract: Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist- ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 defines informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulfills the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 briefly describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation. 1 Synchrony or Coherence? Most hypotheses which involve the idea that information sharing is reflected in coordinated activity across neural units invoke a very specific notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reflected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efficiently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function γXY (h) yields the cross-spectrum FXY (ν), which in turn gives the 2 spectral coherence cXY (ν) = FXY (ν)/FX (ν)FY (ν), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the first neuron spiked at time t conveys little information about what the second neuron is doing at t — it’s not spiking, but it’s not spiking most of the time anyway. Mutual information calculated from the direct observations conflates the “no spike” of the second neuron preparing to fire with its just-sitting-around “no spike”. Here, mutual information could find the coordination if we used a 50 ms lag, but that won’t work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won’t help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. “State” here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which fixes the distribution of observables at all times ≥ t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x ∈ X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y(t), y(t − τ ), ...y(t − (k − 1)τ )] is diffeomorphic to X if k > 2d, for generic choices of lag τ . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very flexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call “informational coherence”. 2 States and Informational Coherence There are alternatives to calculating the “surface” mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units’ patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of “pattern of activity” is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H[X] and the mutual information between X and Y is I[X; Y ]. As is well-known, I[X; Y ] ≤ min H[X], H[Y ]. We use this to normalize the mutual state information to a 0 − 1 scale, and this is the informational coherence (IC). ψ(X, Y ) = I[X; Y ] , with 0/0 = 0 . min H[X], H[Y ] (1) ψ can be interpreted as follows. I[X; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared between the two systems. ψ simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won’t work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufficient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modified, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simplified. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes’s Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a finite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or confidence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a finite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR filters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been filtered out. 2.2 Estimating the Coherence Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and filter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16]. a b Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right). the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of ψ values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the significance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then find p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reflected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I[X1 , X2 , . . . Xn ] ≡ D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I[X1 , X2 ] by its maximum possible value, min H[X1 ], H[X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n − 1 marginal entropies: I[X1 ; X2 ; . . . Xn ] ≤ min k H[Xn ] i=k Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution, though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n T (X = x) = (i,j)∈ET T (Xi = xi , Xj = xj ) T (Xi = xi ) T (Xi = xi )T (Xj = xj ) i=1 (2) where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge’s weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning trees, such as Prim’s algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: I[X1 ; X2 ; . . . Xn ] ≥ D(T ||Q) = I[Xi ; Xj ] (3) (i,j)∈ET Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many configurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because finding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore confine ourselves to the Chow-Liu approximation here. 3 Example: A Model of Gamma and Beta Rhythms We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (30–80Hz) and beta (12–30Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, defined by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm a b c d Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation (top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). (subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connections (potentiated by Hebbian preprocessing as a result of synchrony during the gamma rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), suppressed during gamma due to the metabotropic activation used in the generation of the rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, fire on a subset of interneurons cycles (Fig. 1). Fig. 2 compares zero-lag cross-correlation, a second-order method of quantifying coordination, with the informational coherence calculated from the reconstructed states. (In this simulation, we could have calculated the actual states of the model neurons directly, rather than reconstructing them, but for purposes of testing our method we did not.) Crosscorrelation finds some of the relationships visible in Fig. 1, but is confused by, for instance, the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives similar results.) Informational coherence, however, has no trouble recognizing the two populations as effectively coordinated blocks. The presence of dynamical noise, problematic for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the inactive, low-numbered neurons are excluded). The tree estimate of the global informational multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of Fig. 2 repeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and the tree estimate of the global multi-information is 7377.7 bits, though the estimated global coherence falls very slightly to 0.742. This is because low-numbered neurons which were quiescent before are now active, contributing to the global information, but the over-all pattern is somewhat weaker and more noisy (as can be seen from Fig. 1b.) So, as expected, the total information content is higher, but the overall coordination across the network is lower. 4 Conclusion Informational coherence provides a measure of neural information sharing and coordinated activity which accommodates nonlinear, stochastic relationships between extended patterns of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure of global coordination across networks or regions. Applied to data from multi-electrode recordings, it should be a valuable tool in evaluating hypotheses about distributed neural representation and function. Acknowledgments Thanks to R. Haslinger, E. Ionides and S. Page; and for support to the Santa Fe Institute (under grants from Intel, the NSF and the MacArthur Foundation, and DARPA agreement F30602-00-2-0583), the Clare Booth Luce Foundation (KLK) and the James S. McDonnell Foundation (CRS). References [1] L. F. Abbott and T. J. Sejnowski, eds. Neural Codes and Distributed Representations. MIT Press, 1998. [2] E. N. Brown, R. E. Kass, and P. P. Mitra. Nature Neuroscience, 7:456–461, 2004. [3] D. H. Ballard, Z. Zhang, and R. P. N. Rao. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, eds., Probabilistic Models of the Brain, pp. 273–284, MIT Press, 2002. [4] D. R. Brillinger and A. E. P. Villa. In D. R. Brillinger, L. T. Fernholz, and S. Morgenthaler, eds., The Practice of Data Analysis, pp. 77–92. Princeton U.P., 1997. [5] R. Quian Quiroga et al. Physical Review E, 65:041903, 2002. [6] R. F. Streater. Statistical Dynamics. Imperial College Press, London. [7] M. L. Littman, R. S. Sutton, and S. Singh. In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1555–1561. MIT Press, 2002. [8] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge U.P., 1997. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [10] M. Palus et al. Physical Review E, 63:046211, 2001. [11] F. B. Knight. Annals of Probability, 3:573–596, 1975. [12] C. R. Shalizi and K. L. Shalizi. In M. Chickering and J. Halpern, eds., Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference, pp. 504–511. AUAI Press, 2004. [13] C. R. Shalizi and J. P. Crutchfield. Journal of Statistical Physics, 104:817–819, 2001. [14] H. Jaeger. Neural Computation, 12:1371–1398, 2000. [15] D. Ron, Y. Singer, and N. Tishby. Machine Learning, 25:117–149, 1996. [16] P. B¨ hlmann and A. J. Wyner. Annals of Statistics, 27:480–513, 1999. u [17] N. U. Ahmed. Linear and Nonlinear Filtering for Scientists and Engineers. World Scientific, 1998. [18] D. R. Upper. PhD thesis, University of California, Berkeley, 1997. [19] E. Schneidman, S. Still, M. J. Berry, and W. Bialek. Physical Review Letters, 91:238701, 2003. [20] C. K. Chow and C. N. Liu. IEEE Transactions on Information Theory, IT-14:462–467, 1968. [21] T. H. Cormen et al. Introduction to Algorithms. 2nd ed. MIT Press, 2001. [22] S. Amari. IEEE Transacttions on Information Theory, 47:1701–1711, 2001. [23] S. Kirshner, P. Smyth, and A. Robertson. Tech. Rep. 04-04, UC Irvine, Information and Computer Science, 2004. [24] M. S. Olufsen et al. Journal of Computational Neuroscience, 14:33–54, 2003.
6 0.066621579 178 nips-2005-Soft Clustering on Graphs
7 0.066341288 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
8 0.063720793 148 nips-2005-Online Discovery and Learning of Predictive State Representations
9 0.053003304 79 nips-2005-Fusion of Similarity Data in Clustering
10 0.051099032 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
11 0.050639745 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
12 0.049756512 46 nips-2005-Consensus Propagation
13 0.049372479 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
14 0.048324849 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
15 0.046892852 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
16 0.046839565 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.045572523 2 nips-2005-A Bayes Rule for Density Matrices
18 0.044910584 167 nips-2005-Robust design of biological experiments
19 0.040669937 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
20 0.039202802 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
topicId topicWeight
[(0, 0.151), (1, 0.02), (2, -0.011), (3, -0.041), (4, 0.027), (5, -0.011), (6, -0.064), (7, -0.028), (8, -0.046), (9, -0.018), (10, -0.005), (11, 0.095), (12, -0.02), (13, -0.027), (14, 0.048), (15, 0.097), (16, -0.044), (17, 0.134), (18, 0.134), (19, 0.068), (20, -0.124), (21, 0.128), (22, -0.144), (23, -0.029), (24, -0.066), (25, -0.042), (26, 0.059), (27, -0.084), (28, 0.022), (29, 0.075), (30, 0.107), (31, -0.082), (32, -0.026), (33, 0.001), (34, 0.092), (35, 0.04), (36, -0.098), (37, -0.145), (38, 0.006), (39, 0.105), (40, 0.069), (41, 0.065), (42, -0.074), (43, 0.117), (44, -0.019), (45, 0.069), (46, 0.011), (47, 0.009), (48, 0.019), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.93175662 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
2 0.65411335 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
3 0.51983368 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
4 0.46496159 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
5 0.44851834 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
Author: Michaël Aupetit
Abstract: Given a set of points and a set of prototypes representing them, how to create a graph of the prototypes whose topology accounts for that of the points? This problem had not yet been explored in the framework of statistical learning theory. In this work, we propose a generative model based on the Delaunay graph of the prototypes and the ExpectationMaximization algorithm to learn the parameters. This work is a first step towards the construction of a topological model of a set of points grounded on statistics. 1 1.1
6 0.43641952 148 nips-2005-Online Discovery and Learning of Predictive State Representations
7 0.41884798 62 nips-2005-Efficient Estimation of OOMs
8 0.39043069 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
10 0.38416573 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
11 0.36500078 167 nips-2005-Robust design of biological experiments
12 0.36419046 79 nips-2005-Fusion of Similarity Data in Clustering
13 0.35879153 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
14 0.3534503 178 nips-2005-Soft Clustering on Graphs
15 0.34998107 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
16 0.34978706 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
17 0.3416034 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
18 0.32659051 35 nips-2005-Bayesian model learning in human visual perception
19 0.29434803 108 nips-2005-Layered Dynamic Textures
20 0.28494409 2 nips-2005-A Bayes Rule for Density Matrices
topicId topicWeight
[(3, 0.035), (10, 0.061), (11, 0.013), (27, 0.027), (31, 0.05), (34, 0.093), (55, 0.049), (57, 0.03), (69, 0.064), (73, 0.024), (75, 0.225), (77, 0.016), (88, 0.104), (91, 0.113)]
simIndex simValue paperId paperTitle
same-paper 1 0.7999509 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
2 0.76888132 109 nips-2005-Learning Cue-Invariant Visual Responses
Author: Jarmo Hurri
Abstract: Multiple visual cues are used by the visual system to analyze a scene; achromatic cues include luminance, texture, contrast and motion. Singlecell recordings have shown that the mammalian visual cortex contains neurons that respond similarly to scene structure (e.g., orientation of a boundary), regardless of the cue type conveying this information. This paper shows that cue-invariant response properties of simple- and complex-type cells can be learned from natural image data in an unsupervised manner. In order to do this, we also extend a previous conceptual model of cue invariance so that it can be applied to model simple- and complex-cell responses. Our results relate cue-invariant response properties to natural image statistics, thereby showing how the statistical modeling approach can be used to model processing beyond the elemental response properties visual neurons. This work also demonstrates how to learn, from natural image data, more sophisticated feature detectors than those based on changes in mean luminance, thereby paving the way for new data-driven approaches to image processing and computer vision. 1
3 0.6521998 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
Author: O. P. Kreidl, Alan S. Willsky
Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed offline, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efficiency and (iii) connections to active research areas. 1
4 0.6425758 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
Author: Firas Hamze, Nando de Freitas
Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.
5 0.64215714 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.63497394 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
7 0.62986237 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
8 0.62846315 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
9 0.6261695 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
10 0.62274355 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.62247431 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
12 0.62156379 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
13 0.61965978 181 nips-2005-Spiking Inputs to a Winner-take-all Network
14 0.61932969 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
15 0.61884725 23 nips-2005-An Application of Markov Random Fields to Range Sensing
16 0.61880076 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
17 0.61822993 54 nips-2005-Data-Driven Online to Batch Conversions
18 0.61806297 136 nips-2005-Noise and the two-thirds power Law
19 0.61769891 78 nips-2005-From Weighted Classification to Policy Search
20 0.61662447 74 nips-2005-Faster Rates in Regression via Active Learning