nips nips2003 nips2003-100 knowledge-graph by maker-knowledge-mining

100 nips-2003-Laplace Propagation


Source: pdf

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a novel method for approximate inference in Bayesian models and regularized risk functionals. [sent-10, score-0.028]

2 It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. [sent-11, score-0.251]

3 In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. [sent-12, score-0.484]

4 1 Introduction Inference via Bayesian estimation can lead to optimization problems over rather large data sets. [sent-13, score-0.063]

5 Exact computation in these cases is often computationally intractable, which has led to many approximation algorithms, such as variational approximation [5], or loopy belief propagation. [sent-14, score-0.236]

6 However, most of these methods still rely on the propagation of the exact probabilities (upstream and downstream evidence in the case of belief propagation), rather than an approximation. [sent-15, score-0.298]

7 This approach becomes costly if the random variables are real valued or if the graphical model contains large cliques. [sent-16, score-0.039]

8 EP works by propagating the sufficient statistics of an exponential family, that is, mean and variance for the normal distribution, between various factors of the posterior. [sent-18, score-0.088]

9 In the following we develop a cheap version of EP which requires only the Laplace approximation in each step and show how this can be applied to SVM and Gaussian Processes. [sent-21, score-0.09]

10 Hence we approximate ˆ Ep(θ|X) [θ] ≈ argmax − log p(θ|X) =: θ (1) θ 2 Varp(θ|X) [θ] ≈ ∂θ [− log p(θ|X)]|θ=θ (2) ˆ This is commonly referred to as the Laplace-approximation. [sent-25, score-0.149]

11 It is exact for normal distriˆ butions and works best if θ is strongly concentrated around its mean. [sent-26, score-0.115]

12 However, if p(θ|X) has special structure, such as being the product of several simple terms, possibly each of them dependent only on a small number of variables at a time, computational savings can be gained. [sent-28, score-0.029]

13 1 Approximate Inference For the sake of simplicity in notation we drop the explicit dependency of θ on X and as in [6] we assume that N p(θ) = ti (θ). [sent-31, score-0.425]

14 (3) i=1 Our strategy relies on the assumption that if we succeed in finding good approximations ˜ ˜ of each of the terms ti (θ) by ti (θ) we will obtain an approximate maximizer θ of p(θ) ˜i (θ). [sent-32, score-0.939]

15 Key is a good approximation of each of the ti at the by maximizing p(θ) := i t ˜ maximum of p(θ). [sent-33, score-0.515]

16 pi (θ) := ti (θ) ˜ (4) j=1,j=i ˜ and subsequent use of the Laplace approximation of ti (θ) at θi := argmaxθ pi (θ) as the ˜ ˜ new estimate ti (θ). [sent-35, score-1.547]

17 The following lemma shows that this strategy is valid: Lemma 1 (Fixed Point of Laplace Propagation) For all second-order fixed points the following holds: θ∗ is a fixed point of Laplace propagation if and only if it is a local optimum of p(θ). [sent-37, score-0.19]

18 Then the first order optimality conditions require ∂θ log pi (θ∗ ) = 0 for all i and the Laplace approximation yields ˜ 2 2 ˜ ˜ ∂θ log ti (θ∗ ) = ∂θ log ti (θ∗ ) and ∂θ log ti (θ∗ ) = ∂θ log ti (θ∗ ). [sent-39, score-2.146]

19 Consequently, up to second ∗ order, the derivatives of p, pi , and p agree at θ , which implies that θ∗ is a local optimum. [sent-40, score-0.118]

20 Then again, ∂θ log pi (θ∗ ) have to vanish, since the ˜ ˜ Laplace approximation is exact up to second order. [sent-42, score-0.287]

21 This means that also all ti will have an ˜ optimum at θ∗ , which means that θ∗ is a fixed point. [sent-43, score-0.449]

22 ˜ The next step is to establish methods for updating the approximations ti of ti . [sent-44, score-0.909]

23 One option ˜ is to perform such updates sequentially, thereby improving only one ti at a time. [sent-45, score-0.457]

24 This is advantageous if we can process only one approximation at a time. [sent-46, score-0.09]

25 For parallel processing, ˜ however, we will perform several operations at a time, that is, recompute several ti (θ) and merge the new approximations subsequently. [sent-47, score-0.603]

26 We will see how the BCM is a one-step approximation of LP in the parallel case, whereas SV chunking is an exact implementation of LP in the sequential case. [sent-48, score-0.576]

27 2 Message Passing Message passing [7] has been widely successful for inference in graphical models. [sent-50, score-0.131]

28 Assume that we can split θ into a (not necessarily disjoint) set of coordinates, say θC1 , . [sent-51, score-0.041]

29 (5) i=1 Then the goal of computing a Laplace approximation of pi reduces to computing a Laplace ˜ approximation for the subset of variables θCi , since these are the only coordinates ti depends on. [sent-55, score-0.696]

30 For directed graphical models, these 7654 0123 θ3 ? [sent-61, score-0.039]

31 are the conditional probabilities governing the parents and chil ? [sent-62, score-0.025]

32 Hence, to carry out calculations we only need to    7654 0123 7654 0123 ˜ θ5 θ4 consider local information regarding ti (θCi ). [sent-66, score-0.425]

33 In the example above θ3 depends on (θ1 , θ2 ) and (θ4 , θ5 ) are conditionally independent of θ1 and θ2 , given θ3 . [sent-67, score-0.048]

34 (6) To find the Laplace approximation corresponding to the terms involving θ3 we only need to consider p(θ3 |θ1 , θ2 ) itself and its neighbors “upstream” and “downstream” of θ3 containing θ1 , θ2 , θ3 in their functional form. [sent-69, score-0.09]

35 This means that LP can be used as a drop-in replacement of exact inference in message passing algorithms. [sent-70, score-0.223]

36 The main difference being, that now we are propagating mean and variance from the Laplace approximation rather than true probabilities (as in message passing) or true means and variances (as in expectation propagation). [sent-71, score-0.265]

37 3 Bayes Committee Machine In this section we show that the Bayes Committee Machine (BCM) [9] corresponds to one ˜ step of LP in conjunction with a particular initialization, namely constant ti . [sent-72, score-0.425]

38 , ZN , which are conditionally independent of each other, given a parameter θ, as depicted in the figure on the right. [sent-78, score-0.048]

39 >=< jjj θ IIII jjjvvvv III jj jjjj vv I$ jjjj @ABCzv u @ABC GFED GFED @ABC GFED . [sent-80, score-0.108]

40 (7) i=1 Finally, Tresp and coworkers [9] find Laplace approximations for p(θ|Zi ) ∝ p(Zi |θ)p(θ) with respect to θ. [sent-84, score-0.059]

41 2 Rewriting The BCM The repeated invocation of Bayes rule seems wasteful, yet it was necessary in the context of the BCM formulation to explain how estimates from subsets could be combined in a committee like fashion. [sent-87, score-0.213]

42 In Gaussian processes, we generally assume that p(θ) is normal, hence t0 (θ) is quadratic. [sent-90, score-0.028]

43 This allows us to state the LP algorithm to find the mode and curvature of p(θ|Z): Algorithm 1 Iterated Bayes Committee Machine ˜ ˜ Initialize t0 ← cp(θ) and ti (θ) ← const. [sent-91, score-0.449]

44 repeat ˜ Compute new approximations ti (θ) in parallel by finding Laplace approximations to ˜ pi , as defined in (4). [sent-92, score-0.753]

45 For i = 0 we obtain ˜ N j=0,j=i until Convergence Return argmaxθ t0 (θ) N ˜ ti (θ). [sent-94, score-0.425]

46 ˜ ti (θ) = p(θ)p(Zi |θ) pi = ti (θ) ˜ (9) j=1,j=i N ˜ i=1 ti (θ). [sent-95, score-1.366]

47 Note that in the first iteration (9) can be written as pi ∝ p(θ)p(Zi |θ), since all remaining ˜ ˜ ˜ terms ti are constant. [sent-96, score-0.516]

48 This means that after the first update ti is identical to the estimates obtained from the BCM. [sent-97, score-0.425]

49 Whereas the BCM stops at this point, we have the liberty to continue the approximation and also the liberty to choose whether we use a parallel or a sequential update regime, depending on the number of processing units available. [sent-98, score-0.346]

50 As a side-effect, we obtain a simplified proof of the following: Theorem 2 (Exact BCM [9]) For normal distributions the BCM is exact, that is, the Iterated BCM converges in one step. [sent-99, score-0.057]

51 ˜ ˜ Proof For normal distributions all ti are exact, hence p(θ) = ti (θ) = ti (θ) = p(θ), ˜ i i which shows that p = p. [sent-100, score-1.36]

52 ˜ Note that [9] formulates the problem as one of classification or regression, that is Z = (X, Y ), where the labels Y are conditionally independent, given X and the parameter θ. [sent-101, score-0.048]

53 4 Support Vector Machines The optimization goals in Support Vector Machines (SVM) are very similar to those in Gaussian Processes: essentially the negative log posterior − log p(θ|Z) corresponds to the objective function of the SV optimization problem. [sent-103, score-0.222]

54 In the following we show that SVM chunking [4] and parallel SVM training [2] can be found to be special cases of LP. [sent-105, score-0.428]

55 Taking ˜ logarithms of (3) and defining πi (θ) := − log ti (θ) (and π (θ) := − log ti (θ) analogously) ˜ we obtain the following formulation of LP in log-space. [sent-106, score-0.946]

56 , N } N Minimize πi (θ) + πj (θ) and replace πi (θ) by a Taylor approximation at the ˜ ˜ j=1,i=j minimum θi of the above expression. [sent-110, score-0.09]

57 1 Chunking To show that SV chunking is equivalent to LP in logspace, we briefly review the basic ideas of chunking. [sent-112, score-0.334]

58 For the rest of the deviation we let c(x, y, f (x)) = max(0, 1 − yf (x)) (the analysis still holds in the general case, however it becomes considerably more tedious). [sent-115, score-0.053]

59 The dual of (10) becomes m minimize α m 1 αi αj yi yj Kij k(xi , xj ) − αi s. [sent-116, score-0.296]

60 2 i,j=1 i=1 m yi αi = 0 and αi ∈ [0, C] (11) i=1 The basic idea of chunking is to optimize only over subsets of the vector α at a time. [sent-118, score-0.305]

61 Denote by Sw the set of variables we are using in the current optimization step, let αw be the corresponding vector, and by αf the variables which remain unchanged. [sent-119, score-0.063]

62 Likewise denote Hww Hwf by yw , yf the corresponding parts of y, and let H = be the quadratic Hf w H f f matrix of (11), again split into terms depending on αw and αf respectively. [sent-120, score-0.134]

63 Then (11), restricted to αw can be written as [4] 1 minimize αw Hww αw +αf Hf w αw − αw 2 4. [sent-121, score-0.036]

64 yw αw +yf αf = 0, αi ∈ [0, C] (12) i∈Sw Equivalence to LP We now show that the correction terms arising from chunking are the same as those arising from LP. [sent-124, score-0.45]

65 m} and define π0 (θ, b) := 1 θ 2 2 and πi (θ, b) := C c(xj , yj , f (xj )). [sent-131, score-0.151]

66 As for πi ˜ (with i = 0) we have πi = ˜ yj βj Φ(xj ), θ + j∈Si yj βj b = θi , θ + bi b j∈Si (14) where βj ∈ Cc (xj , yj , f (xj )), θi := j∈Si yj βj Φ(xj ), and bi := case minimization over πi (θ) + j=i πj (θ) amounts to minimizing ˜ 1 θ 2 2 +C c(xj , yj , f (xj )) + C j∈Si j∈Si yj βj . [sent-133, score-0.982]

67 j ∈Si / Skipping technical details, the dual optimization problem is given by minimize α subject to 1 2 αj αl yj yl k(xj , kl ) − αj − j∈Si j,l∈Si αj ∈ [0, C] and j∈Si yj αj − αj βl yj yl k(xj , kl ) j∈Si ,l∈Si (15) j∈Si yj βj = 0. [sent-137, score-0.829]

68 The latter is identical to (12), the optimization problem arising from chunking, provided that we perform the substitution αj = −βj for all j ∈ Si . [sent-138, score-0.128]

69 To show this last step, note that at optimality null has to be an element of the subdifferential of πi (θ) with respect to θ, b. [sent-139, score-0.025]

70 Taking derivatives of πi + j=i πi implies ˜ c (xj , yj , f (xj )) − C θ ∈ −C θj . [sent-140, score-0.151]

71 Finally, to start the approximation scheme we need to consider a proper initialization of πi . [sent-142, score-0.173]

72 ˜ In analogy to the BCM setting we use πi = 0, which leads precisely to the SVM chunking ˜ method, where one optimizes over one subset at a time (denoted by Si ), while the other sets are fixed, taking only their linear contribution into account. [sent-143, score-0.28]

73 ˜ LP does not require that all the updates of ti (or πi ) be carried out sequentially. [sent-144, score-0.457]

74 Instead, we ˜ can also consider parallel approximations similar to [2]. [sent-145, score-0.178]

75 There the optimization problem is split into several small parts and each of them is solved independently. [sent-146, score-0.104]

76 This is equivalent to one-step parallel LP: with the initialization πi = 0 for all i = 0 ˜ and π0 = π0 = 1 θ 2 we minimize πi + j=i πj in parallel. [sent-148, score-0.238]

77 This is equivalent to ˜ ˜ 2 solving the SV optimization problem on the corresponding subset Si (as we saw in the previous section). [sent-149, score-0.063]

78 Hence, the linear terms θi , bi arising from the approximation πi (θ, b) = ˜ C θi , θ + Cbi b lead to the overall approximation π (θ, b) = ˜ πi (θ, b) = ˜ i 1 θ 2 2 + θi , θ , (17) i with the joint minimizer being the average of the individual solutions. [sent-150, score-0.283]

79 5 Experiments To test our ideas we performed a set of experiments with the widely available Web and Adult datasets from the UCI repository [1]. [sent-151, score-0.095]

80 4 MHz Intel Xeon machine with 1 GB RAM using MATLAB R13. [sent-153, score-0.028]

81 We first tested the performance of Gaussian process training with Laplace propogation using a logistic loss function. [sent-155, score-0.054]

82 The data was partitioned into chunks of roughly 500 samples each and the maximum of columns in the low rank approximation [3] was set to 750. [sent-156, score-0.197]

83 1 Note that we had to replace the equality with set inclusion due to the fact that c is not everywhere differentiable, hence we used sub-differentials instead. [sent-157, score-0.028]

84 TFactor refers to the time (in seconds) for computing the low rank factorization while TTrain denotes the training time for the Gaussian process. [sent-159, score-0.026]

85 We empirically observed that on all datasets the algorithm converges in less than 3 iterations using serial updates and in less than 6 iterations using parallel updates. [sent-160, score-0.294]

86 36 Table 1: Gaussian process training with serial and parallel Laplace propogation. [sent-203, score-0.226]

87 We conducted another set of experiments to test the speedups obtained by seeding the SMO with values of α obtained by performing one iteration of Laplace propogation on the dataset. [sent-204, score-0.054]

88 We partitioned the Adult1 and Web1 datasets into 5 chunks each while the Adult4 and Web4 datasets were partitioned into 10 chunks each. [sent-206, score-0.286]

89 As can be seen our initialization significantly speeds up the SMO in many cases sometimes acheving upto 4 times speed up. [sent-276, score-0.083]

90 In general serial updates seem to perform better than parallel updates. [sent-278, score-0.258]

91 This is to be expected since we use the information from other blocks as soon as they become available in case of the serial algorithm while we completely ignore the other blocks in the parallel algorithm. [sent-279, score-0.226]

92 6 Summary And Discussion Laplace propagation fills the gap between Expectation Propagation, which requires exact computation of first and second order moments, and message passing algorithms when optimizing structured density functions. [sent-280, score-0.355]

93 Its main advantage is that it only requires the Laplace approximation in each computational step, while being applicable to a wide range of optimization tasks. [sent-281, score-0.153]

94 In this sense, it complements Minka’s Expectation Propagation, whenever exact expressions are not available. [sent-282, score-0.081]

95 As a side effect, we showed that Tresp’s Bayes Committee Machine and Support Vector Chunking methods are special instances of this strategy, which also sheds light on the fact why simple averaging schemes for SVM, such as the one of Colobert and Bengio seem to work in practice. [sent-283, score-0.029]

96 The key point in our proofs was that we split the data into disjoint subsets. [sent-284, score-0.068]

97 By the assumption of independent and identically distributed data it followed that the variable assignments are conditionally independent from each other, given the parameter θ, which led to a favorable factorization property in p(θ|Z). [sent-285, score-0.105]

98 It should be noted that LP allows one to perform chunking-style optimization in Gaussian Processes, which effectively puts an upper bound on the amount of memory required for optimization purposes. [sent-286, score-0.126]

99 A parallel mixture of svms for very large scale problems. [sent-298, score-0.119]

100 Sequential minimal optimization: A fast algorithm for training support vector machines. [sent-344, score-0.032]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ti', 0.425), ('laplace', 0.342), ('chunking', 0.28), ('bcm', 0.257), ('lp', 0.222), ('tparallel', 0.188), ('tserial', 0.188), ('committee', 0.187), ('si', 0.165), ('yj', 0.151), ('propagation', 0.136), ('tnomod', 0.134), ('parallel', 0.119), ('xj', 0.109), ('serial', 0.107), ('smo', 0.099), ('svm', 0.093), ('pi', 0.091), ('approximation', 0.09), ('zi', 0.089), ('ep', 0.085), ('initialization', 0.083), ('gfed', 0.081), ('tfactor', 0.081), ('message', 0.073), ('bayes', 0.07), ('ci', 0.068), ('sv', 0.068), ('arising', 0.065), ('passing', 0.064), ('optimization', 0.063), ('approximations', 0.059), ('exact', 0.058), ('normal', 0.057), ('chunks', 0.056), ('downstream', 0.054), ('hf', 0.054), ('hww', 0.054), ('jjjj', 0.054), ('liberty', 0.054), ('propogation', 0.054), ('yf', 0.053), ('argmax', 0.053), ('partitioned', 0.051), ('log', 0.048), ('conditionally', 0.048), ('adult', 0.047), ('abc', 0.047), ('upstream', 0.047), ('expectation', 0.046), ('split', 0.041), ('yw', 0.04), ('graphical', 0.039), ('bi', 0.038), ('yl', 0.037), ('sw', 0.037), ('datasets', 0.036), ('minimize', 0.036), ('iterated', 0.034), ('minka', 0.033), ('support', 0.032), ('updates', 0.032), ('led', 0.031), ('propagating', 0.031), ('smola', 0.031), ('zn', 0.03), ('tresp', 0.03), ('repository', 0.03), ('bengio', 0.03), ('strategy', 0.03), ('jerusalem', 0.029), ('processes', 0.029), ('special', 0.029), ('ideas', 0.029), ('sequential', 0.029), ('uci', 0.028), ('inference', 0.028), ('hence', 0.028), ('machine', 0.028), ('disjoint', 0.027), ('agree', 0.027), ('factorization', 0.026), ('repeated', 0.026), ('gaussian', 0.026), ('kl', 0.026), ('optimality', 0.025), ('basic', 0.025), ('probabilities', 0.025), ('bayesian', 0.025), ('belief', 0.025), ('optimum', 0.024), ('mode', 0.024), ('gap', 0.024), ('mhz', 0.023), ('ttrain', 0.023), ('distri', 0.023), ('ables', 0.023), ('approximative', 0.023), ('complements', 0.023), ('eleazar', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

2 0.14891419 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear filtering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-fire mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to find the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear filtering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear filter is typically interpreted as the neuron’s “receptive field,” while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the filtered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear filter [5], and the properties of estimators (for both the linear filter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the fixed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the fine temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-fire (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear filter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, filtered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efficient estimation of the model parameters. Second, noise allows us to explicitly model neural firing statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise influences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Specifically, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efficiently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efficiency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of fixed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufficiently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable firing regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its first two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that influence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. We should also note that the proof extends without difficulty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The first technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional firing rate f (t) satisfies t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only figure significantly when the V (t) is driven close to threshold, a small number of constraints often suffice to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function specified in a five-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs significantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efficiently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “refining” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional firing rate function f (t). Recall that we showed how to efficiently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population firing rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-fire neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

3 0.12843618 193 nips-2003-Variational Linear Response

Author: Manfred Opper, Ole Winther

Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.

4 0.11163627 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

5 0.10652904 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1

6 0.10215144 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

7 0.10039344 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

8 0.10013 32 nips-2003-Approximate Expectation Maximization

9 0.092322297 122 nips-2003-Margin Maximizing Loss Functions

10 0.084710397 88 nips-2003-Image Reconstruction by Linear Programming

11 0.083407931 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

12 0.082986675 169 nips-2003-Sample Propagation

13 0.082868129 115 nips-2003-Linear Dependent Dimensionality Reduction

14 0.081023924 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

15 0.079122335 189 nips-2003-Tree-structured Approximations by Expectation Propagation

16 0.078572221 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

17 0.077720508 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

18 0.073754832 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

19 0.07003884 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

20 0.068640836 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.224), (1, -0.074), (2, -0.056), (3, 0.08), (4, 0.147), (5, -0.073), (6, 0.062), (7, -0.106), (8, -0.026), (9, -0.014), (10, 0.006), (11, -0.006), (12, -0.109), (13, -0.003), (14, -0.01), (15, 0.066), (16, 0.103), (17, -0.05), (18, -0.022), (19, 0.019), (20, 0.074), (21, -0.016), (22, -0.034), (23, 0.074), (24, 0.05), (25, 0.026), (26, -0.001), (27, -0.168), (28, -0.041), (29, 0.025), (30, -0.096), (31, -0.027), (32, 0.062), (33, -0.018), (34, -0.182), (35, -0.092), (36, 0.034), (37, -0.086), (38, 0.108), (39, 0.003), (40, -0.104), (41, 0.077), (42, -0.021), (43, 0.034), (44, 0.046), (45, -0.004), (46, -0.084), (47, -0.064), (48, 0.099), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9560039 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

2 0.5822109 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear filtering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-fire mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to find the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear filtering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear filter is typically interpreted as the neuron’s “receptive field,” while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the filtered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear filter [5], and the properties of estimators (for both the linear filter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the fixed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the fine temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-fire (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear filter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, filtered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efficient estimation of the model parameters. Second, noise allows us to explicitly model neural firing statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise influences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Specifically, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efficiently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efficiency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of fixed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufficiently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable firing regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its first two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that influence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. We should also note that the proof extends without difficulty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The first technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional firing rate f (t) satisfies t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only figure significantly when the V (t) is driven close to threshold, a small number of constraints often suffice to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function specified in a five-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs significantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efficiently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “refining” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional firing rate function f (t). Recall that we showed how to efficiently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population firing rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-fire neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

3 0.5093106 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall

Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.

4 0.50697708 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1

5 0.50176865 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

6 0.47605205 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

7 0.46256557 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

8 0.46163356 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

9 0.45707026 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

10 0.45566717 124 nips-2003-Max-Margin Markov Networks

11 0.43368506 193 nips-2003-Variational Linear Response

12 0.43142045 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

13 0.4036046 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

14 0.39583388 32 nips-2003-Approximate Expectation Maximization

15 0.38043407 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

16 0.35367647 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

17 0.35137576 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

18 0.34402543 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

19 0.3401221 122 nips-2003-Margin Maximizing Loss Functions

20 0.33815464 155 nips-2003-Perspectives on Sparse Bayesian Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.048), (11, 0.01), (27, 0.294), (30, 0.013), (35, 0.099), (53, 0.083), (69, 0.012), (71, 0.101), (76, 0.042), (85, 0.101), (91, 0.081), (99, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80511057 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

2 0.7921201 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for different outcomes. We consider the problem of maximizing the total longterm value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium. 1

3 0.74984956 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

Author: Christopher J. Paciorek, Mark J. Schervish

Abstract: We introduce a class of nonstationary covariance functions for Gaussian process (GP) regression. Nonstationary covariance functions allow the model to adapt to functions whose smoothness varies with the inputs. The class includes a nonstationary version of the Matérn stationary covariance, in which the differentiability of the regression function is controlled by a parameter, freeing one from fixing the differentiability in advance. In experiments, the nonstationary GP regression model performs well when the input space is two or three dimensions, outperforming a neural network model and Bayesian free-knot spline models, and competitive with a Bayesian neural network, but is outperformed in one dimension by a state-of-the-art Bayesian free-knot spline model. The model readily generalizes to non-Gaussian data. Use of computational methods for speeding GP fitting may allow for implementation of the method on larger datasets. 1

4 0.57663918 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

5 0.57105941 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

6 0.57053399 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

7 0.56759107 189 nips-2003-Tree-structured Approximations by Expectation Propagation

8 0.56662285 145 nips-2003-Online Classification on a Budget

9 0.56286746 158 nips-2003-Policy Search by Dynamic Programming

10 0.56257659 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

11 0.56119329 41 nips-2003-Boosting versus Covering

12 0.55925816 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

13 0.55901688 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

14 0.55646253 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

15 0.55598474 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

16 0.55474222 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

17 0.55440193 78 nips-2003-Gaussian Processes in Reinforcement Learning

18 0.55361831 126 nips-2003-Measure Based Regularization

19 0.55324042 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

20 0.55313241 124 nips-2003-Max-Margin Markov Networks