nips nips2005 nips2005-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. [sent-5, score-0.239]
2 symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. [sent-9, score-1.779]
3 Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. [sent-10, score-0.258]
4 Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes. [sent-11, score-0.516]
5 1 Introduction Consider the model fn (x) = 1 sn n vj h(x; uj ) ≡ j=1 1 sn n vj hj (x) (1) j=1 which can be viewed as a multi-layer perceptron with input x, hidden functions h, weights uj , output weights vj , and sn a sequence of normalizing constants. [sent-12, score-2.749]
6 The work of Radford Neal [1] showed that, under certain assumptions on the parameter priors {vj , hj }, the distribution over the implied network functions fn converged to that of a Gaussian process, in the large network limit n → ∞. [sent-13, score-0.8]
7 The following questions then arise: what is the relationship between choices of distributions on the model priors, and the asymptotic distribution over the induced neural functions? [sent-18, score-0.169]
8 Previous work on these problems consists mainly in Neal’s publication [1], which established that when the output weights vj are finite variance and i. [sent-22, score-0.646]
9 symmetric stable (SS), the first-order marginal distributions of the functions are also SS. [sent-29, score-0.689]
10 This paper conducts a further investigation of these questions, with concentration on the cases where the weight priors can be 1) of infinite variance, and 2) non-i. [sent-32, score-0.177]
11 In Section 1, we give a general classification of the possible limiting processes that may arise under an i. [sent-36, score-0.237]
12 assumption on output weights distributed from a certain class — roughly speaking, those weights with tails asymptotic to a power-law — and provide explicit formulae for all the joint distribution functions. [sent-39, score-0.365]
13 As a byproduct, Neal’s preliminary analysis is completed, a full multivariate prescription attained and the convergence of the finite-dimensional distributions proved. [sent-40, score-0.255]
14 priors, specifically independent priors where the “identically distributed” assumption is discarded. [sent-44, score-0.202]
15 An example where a finite-variance non-Gaussian process acts as a limit point for a nontrivial infinite network is presented, followed by an investigation of conditions under which the Gaussian approximation is valid, via the Lindeberg-Feller theorem. [sent-45, score-0.226]
16 Finally, we raise the possibility of replacing network models with the processes themselves for learning applications: here, motivated by the foregoing limit theorems, the set of stable processes form a natural generalization to the Gaussian case. [sent-46, score-0.864]
17 Classes of stable stochastic processes are examined where the parameterizations are particularly simple, as well as preliminary applications to the nonlinear regression problem. [sent-47, score-0.662]
18 2 Neural Network Limits Referring to (1), we make the following assumptions: hj (x) ≡ h(x; uj ) are uniformly bounded in x (as for instance occurs if h is associated with some fixed nonlinearity), and {uj } is an i. [sent-48, score-0.259]
19 With these assumptions, the choice of output priors vj will tend to dictate large-network behavior, independently of uj . [sent-55, score-0.796]
20 In the sequel, we restrict ourselves to functions fn (x) : R → R, as the respective proofs for the generalizations of x and fn to higher-dimensional spaces are routine. [sent-56, score-0.381]
21 priors The Gaussian distribution has the feature that if X1 and X2 are statistically independent copies of the Gaussian variable X, then their linear combination is also Gaussian, i. [sent-63, score-0.348]
22 If one further demands symmetry of the distribution, then α α they must have characteristic function Φ(t) = e−σ |t| , for parameters σ > 0 (called the spread), and 0 < α ≤ 2, termed the index. [sent-68, score-0.185]
23 Since the characteristic functions are not generally twice differentiable at t = 0, their variances are infinite, the Gaussian distribution being the only finite variance stable distribution, associated to index α = 2. [sent-69, score-0.768]
24 The attractive feature of stable variables, by definition, is closure under the formation of linear combinations: the linear combination of any two independent stable variables is another stable variable of the same index. [sent-70, score-1.526]
25 Moreover, the stable distributions are attraction points of distributions under a linear combiner operator, and indeed, the only such distributions in n the following sense: if {Yj } are i. [sent-71, score-0.732]
26 , and an + s1 j=1 Yj converges in distribution to X, n then X must be stable [5]. [sent-74, score-0.57]
27 priors vj , and assuming (1) converges at all — convergence can occur only to stable variables, for each x. [sent-78, score-1.229]
28 Multivariate analogues are defined similarly: we say a random vector X is (strictly) stable if, for every a, b ∈ R, there exists a constant c such that aX1 + bX2 = cX where Xi are independent copies of X and the equality is in distribution. [sent-79, score-0.574]
29 A symmetric stable random vector is one which is stable and for which the distribution of X is the same as −X. [sent-80, score-1.108]
30 The following important classification theorem gives an explicit Fourier domain description of all multivariate symmetric stable distributions: Theorem 1. [sent-81, score-0.722]
31 X is a symmetric α-stable vector if and only if it has characteristic function | t, s |α dΓ(s) Φ(t) = exp − (2) S d−1 where Γ is a finite measure on the unit (d − 1)-sphere S d−1 , and 0 < α ≤ 2. [sent-83, score-0.365]
32 In this case, the (unique) symmetrized measure Γ is called the spectral measure of the stable random vector X. [sent-85, score-0.641]
33 Finally, stable processes are defined as indexed sets of random variables whose finitedimensional distributions are (multivariate) stable. [sent-86, score-0.711]
34 Let v be a symmetric stable random variable of index 0 < α ≤ 2, and spread σ > 0. [sent-89, score-0.82]
35 copies of n 1 y, then Sn = n1/α i=1 yi converges in distribution to an α-stable variable with characteristic function Φ(t) = exp{−|σt|α E|h|α }. [sent-94, score-0.408]
36 This follows by computing the characteristic function ΦSn , then using standard theorems in measure theory (e. [sent-96, score-0.227]
37 Then fn (x) = n1/α j=1 vj hj (x) converges in distribution to a symmetric α-stable process f (x) as n → ∞. [sent-105, score-1.179]
38 The finite-dimensional stable distribution of (f (x1 ), . [sent-106, score-0.493]
39 , f (xd )), where xi ∈ R, has characteristic function: α Ψ(t) = exp (−σ α Eh | t, h | ) (3) where h = (h(x1 ), . [sent-109, score-0.228]
40 , h(xd )), and h(x) is a random variable with the common distribution (across j) of hj (x). [sent-112, score-0.306]
41 , h(xd )) has joint probability density p(h) = p(rs), with s on the S d−1 sphere and r the radial component of h, then the finite measure Γ corresponding to the multivariate stable distribution of (f (x1 ), . [sent-116, score-0.619]
42 It suffices to show that every finite-dimensional distribution of f (x) converges d to a symmetric multivariate stable characteristic function. [sent-121, score-0.977]
43 We have i=1 ti fn (xi ) = 1 n1/α n j=1 d vj i=1 ti hj (xi ) for constants {x1 , . [sent-122, score-0.858]
44 The relation between the expectation in (3) and the stable spectral measure (4) is derived from a change of variable to spherical coordinates in the d-dimensional space of h. [sent-130, score-0.582]
45 Remark: When α = 2, the exponent in the characteristic function (3) is a quadratic form in t, and becomes the usual Gaussian multivariate distribution. [sent-131, score-0.269]
46 The above proposition is the rigorous completion of Neal’s analysis, and gives the explicit form of the asymptotic process under i. [sent-132, score-0.159]
47 More generally, we can consider output weights from the normal domain of attraction of index α, which, roughly, consists of those densities whose tails are asymptotic to |x|−(α+1) , 0 < α < 2 [6, pg. [sent-136, score-0.33]
48 weights vj from the normal domain of attraction n 1 of an SS variable with index α, spread σ. [sent-142, score-0.892]
49 Then fn (x) = n1/α j=1 vj hj (x) converges in distribution to a symmetric α-stable process f (x), with the joint characteristic functions given as in Proposition 1. [sent-143, score-1.399]
50 1 Example: Distributions with step-function priors Let h(x) = sgn(a + ux), where a and u are independent Gaussians with zero mean. [sent-146, score-0.202]
51 Neal in [1] has shown that when the output weights vj are Gaussian, then the choice of the signum nonlinearity for h gives rise to local Brownian motion in the central regime. [sent-148, score-0.851]
52 There is a natural generalization of the Brownian process within the context of symmetric stable processes, called the symmetric α-stable L´ vy motion. [sent-149, score-0.871]
53 It is characterised by an e indexed sequence {wt : t ∈ R} satisfying i) w0 = 0 almost surely, ii) independent increments, and iii) wt − ws is distributed symmetric α-stable with spread σ = |t − s|1/α . [sent-150, score-0.417]
54 As we shall now show, the choice of step-function nonlinearity for h and symmetric α-stable priors for vj lead to locally L´ vy stable motion, which provide a theoretical exposition for e the empirical observations in [1]. [sent-151, score-1.437]
55 From (3) the random variable f (x) − f (y) is symmetric stable with spread parameter [Eh |h(x) − h(y)|α ]1/α . [sent-153, score-0.765]
56 Hence locally, the increment f (x) − f (y) is a symmetric stable variable with spread proportional to |x − y|1/α , which is condition (iii) of L´ vy motion. [sent-156, score-0.856]
57 Its joint characteristic function in the variables t1 , . [sent-165, score-0.23]
58 , tn−1 ) = exp (−2α (p1 |t1 |α + · · · + pn−1 |tn−1 |α )) (6) which describes a vector of independent α-stable random variables, as the characteristic function splits. [sent-175, score-0.27]
59 The differences between sample functions arising from Cauchy priors as opposed to Gaussian priors is evident from Fig. [sent-177, score-0.333]
60 processes wn and their “integrated” versions i=1 wi , simulating the L´ vy e motions. [sent-188, score-0.21]
61 process, which would correspond, in the network, to hidden units with heavy weighting factors vj . [sent-192, score-0.527]
62 priors We begin with an interesting example, which shows that if the “identically distributed” assumption for the output weights is dispensed with, the limiting distribution of (1) can attain a non-stable (and non-Gaussian) form. [sent-197, score-0.404]
63 Take vj to be independent random variables with P (vj = 2−j ) = P (vj = −2−j ) = 1/2. [sent-198, score-0.657]
64 The characteristic functions can easily be computed as E[eitvj ] = cos(t/2j ). [sent-199, score-0.22]
65 Now recall the Viet´ formula: e n cos(t/2j ) = j=1 sin t 2n sin(t/2n ) (7) Taking n → ∞ shows that the limiting characteristic function is a sinc function, which corresponds with the uniform density. [sent-200, score-0.273]
66 What conditions are required on independent, but not necessarily identically distributed priors vj for convergence to the Gaussian? [sent-202, score-0.784]
67 Let vj be a sequence of independent random variables each with zero mean and finite variance, define s2 = n n n var[ j=1 vj ], and assume s1 = 0. [sent-206, score-1.214]
68 Then the sequence s1 j=1 vj converges in distrin 2 An intuitive proof is as follows: one thinks of j vj as a binary expansion of real numbers in [-1,1]; the prescription of the probability laws for vj imply all such expansions are equiprobable, manifesting in the uniform distribution. [sent-207, score-1.758]
69 bution to an N (0, 1) variable, if 1 n→∞ s2 n n lim i=1 |v|≥ sn v 2 dFvj (v) = 0 (8) for each > 0, and where Fvj is the distribution function for vj . [sent-208, score-0.706]
70 Let the network (1) have independent finite-variance weights vj . [sent-215, score-0.734]
71 We want to show these variables are jointly Gaussian in the limit as n → ∞, by showing that every linear combination of the components converges in distribution to a Gaussian distribution. [sent-225, score-0.273]
72 Fixing k constants µi , we n k k k have i=1 µi f (xi ) = s1 j=1 vj i=1 µi hj (xi ). [sent-226, score-0.699]
73 Define ξj = i=1 µi hj (xi ), and n n s2 = var( j=1 vj ξj ) = (Eξ 2 )s2 , where ξ is a random variable with the common distrib˜n n ution of ξj . [sent-227, score-0.785]
74 Then for some c > 0: 1 s2 ˜n n |vj (ω)ξj (ω)|2 dP (ω) ≤ j=1 |vj ξj |≥ sn ˜ c2 1 Eξ 2 s2 n n j=1 |vj |≥ (Eξ2 )1/2 sn c |vj (ω)|2 dP (ω) The right-hand side can be made arbitrarily small, from the Lindeberg assumption on {vj }, hence {vj ξj } is Lindeberg, from which the theorem follows. [sent-228, score-0.317]
75 If the output weights {vj } are a uniformly bounded sequence of independent random variables, and limn→∞ sn = ∞, then fn (x) in (1) converges in distribution to a Gaussian process. [sent-231, score-0.649]
76 2 was made possible precisely because the weights vj decayed sufficiently quickly with j, with the result that limn sn < ∞. [sent-233, score-0.789]
77 3 Learning with Stable Processes One of the original reasons for focusing machine learning interest on Gaussian processes consisted in the fact that they act as limit points of suitably constructed parametric models [2], [3]. [sent-234, score-0.22]
78 The problem of learning a regression function, which was previously tackled by Bayesian inference on a modelling neural network, could be reconsidered by directly placing a Gaussian process prior on the fitting functions themselves. [sent-235, score-0.16]
79 Gaussian processes did not seem to capture the richness of finite neural networks — for one, the dependencies between multiple outputs of a network vanished in the Gaussian limit. [sent-237, score-0.186]
80 The obvious generalization of Gaussian process regression involves the placement of a stable process prior of index α on u, and setting as i. [sent-244, score-0.683]
81 Then the observations y also form a stable process of index α. [sent-248, score-0.558]
82 The use of such priors on the data u afford a significant broadening in the number of interesting dependency relationships that may be assumed. [sent-252, score-0.193]
83 An understanding of the dependency structure of multivariate stable vectors can be first broached by considering the following basic class. [sent-253, score-0.573]
84 symmetric stable variables of the same index, and let H be a matrix of appropriate dimension so that x = Hv is well-defined. [sent-257, score-0.628]
85 Then x has a symmetric stable characteristic function, where ˜ the spectral measure Γ in Theorem 1 is discrete, i. [sent-258, score-0.851]
86 Not so when α < 2, for then the characteristic function for x in general possesses n fundamental discontinuities in higher-order derivatives, where n is the number of columns of H. [sent-263, score-0.185]
87 A number of techniques already exist in the statistical literature for the estimation of the spectral measure (and hence the mixing H) of multivariate stable vectors from empirical data. [sent-266, score-0.65]
88 The infinite-dimensional generalization of the above situation gives rise to the set of stable processes produced as time-varying filtered versions of i. [sent-267, score-0.592]
89 stable noise, and similar to the Gaussian process, are parameterized by a centering (mean) function µ(x) and a bivariate filter function h(x, ν) encoding dependency information. [sent-270, score-0.526]
90 Another simple family of stable processes consist of the so-called sub-Gaussian processes. [sent-271, score-0.563]
91 These are processes defined by u(x) = A1/2 G(x) where A is a totally right-skew α/2 stable variable [5], and G a Gaussian process of mean zero and covariance K. [sent-272, score-0.675]
92 The result is a symmetric α-stable random process with finite-dimensional characteristic functions of form 1 (10) Φ(t) = exp(− | t, Kt |α/2 ) 2 The sub-Gaussian processes are then completely parameterized by the statistics of the subordinating Gaussian process G. [sent-273, score-0.624]
93 (11) Unfortunately, the regression is somewhat trivial, because a calculation shows that the coefficients of regression {ai } are the same as the case where Yi are assumed jointly Gaussian! [sent-281, score-0.165]
94 It follows that the predictive mean estimates for (10) employing sub-Gaussian priors are identical to the estimates under a Gaussian hypothesis. [sent-283, score-0.211]
95 , Yn−1 differs greatly from the Gaussian, and is neither stable nor symmetric about its conditional mean in general. [sent-287, score-0.583]
96 In any case, regression on stable processes suggest the need to compute and investigate the entire a posteriori probability law. [sent-291, score-0.63]
97 The main thrust of our foregoing results indicate that the class of possible limit points of network functions is significantly richer than the family of Gaussian processes, even under relatively restricted (e. [sent-292, score-0.248]
98 Gaussian processes are the appropriate models of large networks with finite variance priors in which no one component dominates another, but when the finite variance assumption is discarded, stable processes become the natural limit points. [sent-298, score-0.93]
99 Our discussion of the stable process regression problem has principally been confined to an exposition of the basic theoretical issues and principles involved, rather than to algorithmic procedures. [sent-303, score-0.603]
100 Nevertheless, since simple closed-form expressions exist for the characteristic functions, the predictive probability laws can all in principle be computed with multi-dimensional Fourier transform techniques. [sent-304, score-0.218]
wordName wordTfidf (topN-words)
[('vj', 0.527), ('stable', 0.445), ('characteristic', 0.185), ('hj', 0.172), ('fn', 0.159), ('priors', 0.149), ('symmetric', 0.138), ('gaussian', 0.133), ('sn', 0.131), ('processes', 0.118), ('cauchy', 0.108), ('neal', 0.104), ('eh', 0.099), ('lindeberg', 0.099), ('spread', 0.096), ('vy', 0.092), ('xd', 0.092), ('limiting', 0.088), ('uj', 0.087), ('tn', 0.087), ('weights', 0.086), ('multivariate', 0.084), ('converges', 0.077), ('attraction', 0.074), ('limit', 0.072), ('limits', 0.072), ('distributions', 0.071), ('network', 0.068), ('regression', 0.067), ('brownian', 0.059), ('nitesimal', 0.059), ('ss', 0.059), ('process', 0.058), ('yn', 0.057), ('theorem', 0.055), ('index', 0.055), ('nite', 0.054), ('variable', 0.054), ('independent', 0.053), ('nonlinearity', 0.053), ('proposition', 0.051), ('xn', 0.051), ('asymptotic', 0.05), ('clt', 0.05), ('signum', 0.05), ('distribution', 0.048), ('identically', 0.047), ('limn', 0.045), ('variables', 0.045), ('dependency', 0.044), ('copies', 0.044), ('characterised', 0.043), ('foregoing', 0.043), ('hv', 0.043), ('xi', 0.043), ('measure', 0.042), ('spectral', 0.041), ('central', 0.04), ('var', 0.04), ('symmetrized', 0.039), ('closure', 0.039), ('mixing', 0.038), ('bivariate', 0.037), ('increments', 0.037), ('prescription', 0.037), ('rs', 0.035), ('functions', 0.035), ('output', 0.033), ('motion', 0.033), ('laws', 0.033), ('exposition', 0.033), ('replacement', 0.033), ('random', 0.032), ('preliminary', 0.032), ('remark', 0.032), ('cx', 0.032), ('tails', 0.032), ('condition', 0.031), ('estimates', 0.031), ('convergence', 0.031), ('arise', 0.031), ('jointly', 0.031), ('sequence', 0.03), ('pennsylvania', 0.03), ('philadelphia', 0.03), ('richer', 0.03), ('suitably', 0.03), ('distributed', 0.03), ('events', 0.03), ('rise', 0.029), ('converged', 0.029), ('york', 0.029), ('generalizations', 0.028), ('dominates', 0.028), ('investigation', 0.028), ('dp', 0.028), ('surely', 0.028), ('rotation', 0.028), ('cos', 0.027), ('wt', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
2 0.21118899 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
3 0.15182064 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
4 0.13798426 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
5 0.13599508 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
Author: Michaël Aupetit
Abstract: Given a set of points and a set of prototypes representing them, how to create a graph of the prototypes whose topology accounts for that of the points? This problem had not yet been explored in the framework of statistical learning theory. In this work, we propose a generative model based on the Delaunay graph of the prototypes and the ExpectationMaximization algorithm to learn the parameters. This work is a first step towards the construction of a topological model of a set of points grounded on statistics. 1 1.1
6 0.10832009 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
7 0.10170499 138 nips-2005-Non-Local Manifold Parzen Windows
8 0.1016801 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
9 0.087774791 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
10 0.085162506 85 nips-2005-Generalization to Unseen Cases
11 0.076976448 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
12 0.074082248 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
13 0.073751174 74 nips-2005-Faster Rates in Regression via Active Learning
14 0.072788462 47 nips-2005-Consistency of one-class SVM and related algorithms
15 0.072263941 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
16 0.072259128 182 nips-2005-Statistical Convergence of Kernel CCA
17 0.071819447 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
18 0.069708459 79 nips-2005-Fusion of Similarity Data in Clustering
19 0.068728097 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
20 0.066688687 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
topicId topicWeight
[(0, 0.225), (1, 0.067), (2, -0.052), (3, -0.084), (4, -0.006), (5, -0.019), (6, -0.058), (7, -0.112), (8, -0.016), (9, 0.002), (10, 0.052), (11, 0.154), (12, 0.009), (13, 0.031), (14, -0.044), (15, 0.167), (16, -0.009), (17, 0.195), (18, 0.213), (19, 0.057), (20, -0.038), (21, 0.068), (22, -0.255), (23, 0.009), (24, -0.1), (25, -0.124), (26, -0.046), (27, -0.097), (28, -0.073), (29, -0.017), (30, 0.06), (31, -0.029), (32, -0.005), (33, -0.016), (34, -0.024), (35, 0.065), (36, -0.019), (37, -0.098), (38, 0.117), (39, -0.0), (40, 0.085), (41, 0.023), (42, -0.021), (43, -0.127), (44, -0.033), (45, 0.089), (46, 0.004), (47, -0.019), (48, 0.036), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.96676099 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
2 0.66624701 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
3 0.6039055 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
4 0.59501231 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
Author: Michaël Aupetit
Abstract: Given a set of points and a set of prototypes representing them, how to create a graph of the prototypes whose topology accounts for that of the points? This problem had not yet been explored in the framework of statistical learning theory. In this work, we propose a generative model based on the Delaunay graph of the prototypes and the ExpectationMaximization algorithm to learn the parameters. This work is a first step towards the construction of a topological model of a set of points grounded on statistics. 1 1.1
5 0.46331915 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire
Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.
6 0.45556459 138 nips-2005-Non-Local Manifold Parzen Windows
7 0.45466745 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
8 0.45005792 50 nips-2005-Convex Neural Networks
9 0.43472558 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
10 0.42245275 182 nips-2005-Statistical Convergence of Kernel CCA
11 0.38968024 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
12 0.3861011 178 nips-2005-Soft Clustering on Graphs
13 0.36906368 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
14 0.36318383 167 nips-2005-Robust design of biological experiments
15 0.3513703 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
16 0.33909422 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
17 0.31929091 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
18 0.31775981 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers
19 0.31402075 62 nips-2005-Efficient Estimation of OOMs
20 0.31148291 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
topicId topicWeight
[(3, 0.037), (10, 0.036), (11, 0.012), (27, 0.018), (31, 0.04), (34, 0.069), (55, 0.025), (69, 0.038), (73, 0.036), (88, 0.061), (91, 0.521)]
simIndex simValue paperId paperTitle
same-paper 1 0.97603297 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
2 0.96289825 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
3 0.94136012 118 nips-2005-Learning in Silicon: Timing is Everything
Author: John V. Arthur, Kwabena Boahen
Abstract: We describe a neuromorphic chip that uses binary synapses with spike timing-dependent plasticity (STDP) to learn stimulated patterns of activity and to compensate for variability in excitability. Specifically, STDP preferentially potentiates (turns on) synapses that project from excitable neurons, which spike early, to lethargic neurons, which spike late. The additional excitatory synaptic current makes lethargic neurons spike earlier, thereby causing neurons that belong to the same pattern to spike in synchrony. Once learned, an entire pattern can be recalled by stimulating a subset. 1 Variability in Neural Systems Evidence suggests precise spike timing is important in neural coding, specifically, in the hippocampus. The hippocampus uses timing in the spike activity of place cells (in addition to rate) to encode location in space [1]. Place cells employ a phase code: the timing at which a neuron spikes relative to the phase of the inhibitory theta rhythm (5-12Hz) conveys information. As an animal approaches a place cell’s preferred location, the place cell not only increases its spike rate, but also spikes at earlier phases in the theta cycle. To implement a phase code, the theta rhythm is thought to prevent spiking until the input synaptic current exceeds the sum of the neuron threshold and the decreasing inhibition on the downward phase of the cycle [2]. However, even with identical inputs and common theta inhibition, neurons do not spike in synchrony. Variability in excitability spreads the activity in phase. Lethargic neurons (such as those with high thresholds) spike late in the theta cycle, since their input exceeds the sum of the neuron threshold and theta inhibition only after the theta inhibition has had time to decrease. Conversely, excitable neurons (such as those with low thresholds) spike early in the theta cycle. Consequently, variability in excitability translates into variability in timing. We hypothesize that the hippocampus achieves its precise spike timing (about 10ms) through plasticity enhanced phase-coding (PEP). The source of hippocampal timing precision in the presence of variability (and noise) remains unexplained. Synaptic plasticity can compensate for variability in excitability if it increases excitatory synaptic input to neurons in inverse proportion to their excitabilities. Recasting this in a phase-coding framework, we desire a learning rule that increases excitatory synaptic input to neurons directly related to their phases. Neurons that lag require additional synaptic input, whereas neurons that lead 120µm 190µm A B Figure 1: STDP Chip. A The chip has a 16-by-16 array of microcircuits; one microcircuit includes four principal neurons, each with 21 STDP circuits. B The STDP Chip is embedded in a circuit board including DACs, a CPLD, a RAM chip, and a USB chip, which communicates with a PC. require none. The spike timing-dependent plasticity (STDP) observed in the hippocampus satisfies this requirement [3]. It requires repeated pre-before-post spike pairings (within a time window) to potentiate and repeated post-before-pre pairings to depress a synapse. Here we validate our hypothesis with a model implemented in silicon, where variability is as ubiquitous as it is in biology [4]. Section 2 presents our silicon system, including the STDP Chip. Section 3 describes and characterizes the STDP circuit. Section 4 demonstrates that PEP compensates for variability and provides evidence that STDP is the compensation mechanism. Section 5 explores a desirable consequence of PEP: unconventional associative pattern recall. Section 6 discusses the implications of the PEP model, including its benefits and applications in the engineering of neuromorphic systems and in the study of neurobiology. 2 Silicon System We have designed, submitted, and tested a silicon implementation of PEP. The STDP Chip was fabricated through MOSIS in a 1P5M 0.25µm CMOS process, with just under 750,000 transistors in just over 10mm2 of area. It has a 32 by 32 array of excitatory principal neurons commingled with a 16 by 16 array of inhibitory interneurons that are not used here (Figure 1A). Each principal neuron has 21 STDP synapses. The address-event representation (AER) [5] is used to transmit spikes off chip and to receive afferent and recurrent spike input. To configure the STDP Chip as a recurrent network, we embedded it in a circuit board (Figure 1B). The board has five primary components: a CPLD (complex programmable logic device), the STDP Chip, a RAM chip, a USB interface chip, and DACs (digital-to-analog converters). The central component in the system is the CPLD. The CPLD handles AER traffic, mediates communication between devices, and implements recurrent connections by accessing a lookup table, stored in the RAM chip. The USB interface chip provides a bidirectional link with a PC. The DACs control the analog biases in the system, including the leak current, which the PC varies in real-time to create the global inhibitory theta rhythm. The principal neuron consists of a refractory period and calcium-dependent potassium circuit (RCK), a synapse circuit, and a soma circuit (Figure 2A). RCK and the synapse are ISOMA Soma Synapse STDP Presyn. Spike PE LPF A Presyn. Spike Raster AH 0 0.1 Spike probability RCK Postsyn. Spike B 0.05 0.1 0.05 0.1 0.08 0.06 0.04 0.02 0 0 Time(s) Figure 2: Principal neuron. A A simplified schematic is shown, including: the synapse, refractory and calcium-dependent potassium channel (RCK), soma, and axon-hillock (AH) circuits, plus their constituent elements, the pulse extender (PE) and the low-pass filter (LPF). B Spikes (dots) from 81 principal neurons are temporally dispersed, when excited by poisson-like inputs (58Hz) and inhibited by the common 8.3Hz theta rhythm (solid line). The histogram includes spikes from five theta cycles. composed of two reusable blocks: the low-pass filter (LPF) and the pulse extender (PE). The soma is a modified version of the LPF, which receives additional input from an axonhillock circuit (AH). RCK is inhibitory to the neuron. It consists of a PE, which models calcium influx during a spike, and a LPF, which models calcium buffering. When AH fires a spike, a packet of charge is dumped onto a capacitor in the PE. The PE’s output activates until the charge decays away, which takes a few milliseconds. Also, while the PE is active, charge accumulates on the LPF’s capacitor, lowering the LPF’s output voltage. Once the PE deactivates, this charge leaks away as well, but this takes tens of milliseconds because the leak is smaller. The PE’s and the LPF’s inhibitory effects on the soma are both described below in terms of the sum (ISHUNT ) of the currents their output voltages produce in pMOS transistors whose sources are at Vdd (see Figure 2A). Note that, in the absence of spikes, these currents decay exponentially, with a time-constant determined by their respective leaks. The synapse circuit is excitatory to the neuron. It is composed of a PE, which represents the neurotransmitter released into the synaptic cleft, and a LPF, which represents the bound neurotransmitter. The synapse circuit is similar to RCK in structure but differs in function: It is activated not by the principal neuron itself but by the STDP circuits (or directly by afferent spikes that bypass these circuits, i.e., fixed synapses). The synapse’s effect on the soma is also described below in terms of the current (ISYN ) its output voltage produces in a pMOS transistor whose source is at Vdd. The soma circuit is a leaky integrator. It receives excitation from the synapse circuit and shunting inhibition from RCK and has a leak current as well. Its temporal behavior is described by: τ dISOMA ISYN I0 + ISOMA = dt ISHUNT where ISOMA is the current the capacitor’s voltage produces in a pMOS transistor whose source is at Vdd (see Figure 2A). ISHUNT is the sum of the leak, refractory, and calciumdependent potassium currents. These currents also determine the time constant: τ = C Ut κISHUNT , where I0 and κ are transistor parameters and Ut is the thermal voltage. STDP circuit ~LTP SRAM Presynaptic spike A ~LTD Inverse number of pairings Integrator Decay Postsynaptic spike Potentiation 0.1 0.05 0 0.05 0.1 Depression -80 -40 0 Presynaptic spike Postsynaptic spike 40 Spike timing: t pre - t post (ms) 80 B Figure 3: STDP circuit design and characterization. A The circuit is composed of three subcircuits: decay, integrator, and SRAM. B The circuit potentiates when the presynaptic spike precedes the postsynaptic spike and depresses when the postsynaptic spike precedes the presynaptic spike. The soma circuit is connected to an AH, the locus of spike generation. The AH consists of model voltage-dependent sodium and potassium channel populations (modified from [6] by Kai Hynna). It initiates the AER signaling process required to send a spike off chip. To characterize principal neuron variability, we excited 81 neurons with poisson-like 58Hz spike trains (Figure 2B). We made these spike trains poisson-like by starting with a regular 200Hz spike train and dropping spikes randomly, with probability of 0.71. Thus spikes were delivered to neurons that won the coin toss in synchrony every 5ms. However, neurons did not lock onto the input synchrony due to filtering by the synaptic time constant (see Figure 2B). They also received a common inhibitory input at the theta frequency (8.3Hz), via their leak current. Each neuron was prevented from firing more than one spike in a theta cycle by its model calcium-dependent potassium channel population. The principal neurons’ spike times were variable. To quantify the spike variability, we used timing precision, which we define as twice the standard deviation of spike times accumulated from five theta cycles. With an input rate of 58Hz the timing precision was 34ms. 3 STDP Circuit The STDP circuit (related to [7]-[8]), for which the STDP Chip is named, is the most abundant, with 21,504 copies on the chip. This circuit is built from three subcircuits: decay, integrator, and SRAM (Figure 3A). The decay and integrator are used to implement potentiation, and depression, in a symmetric fashion. The SRAM holds the current binary state of the synapse, either potentiated or depressed. For potentiation, the decay remembers the last presynaptic spike. Its capacitor is charged when that spike occurs and discharges linearly thereafter. A postsynaptic spike samples the charge remaining on the capacitor, passes it through an exponential function, and dumps the resultant charge into the integrator. This charge decays linearly thereafter. At the time of the postsynaptic spike, the SRAM, a cross-coupled inverter pair, reads the voltage on the integrator’s capacitor. If it exceeds a threshold, the SRAM switches state from depressed to potentiated (∼LTD goes high and ∼LTP goes low). The depression side of the STDP circuit is exactly symmetric, except that it responds to postsynaptic activation followed by presynaptic activation and switches the SRAM’s state from potentiated to depressed (∼LTP goes high and ∼LTD goes low). When the SRAM is in the potentiated state, the presynaptic 50 After STDP 83 92 100 Timing precision(ms) Before STDP 75 B Before STDP After STDP 40 30 20 10 0 50 60 70 80 90 Input rate(Hz) 100 50 58 67 text A 0.2 0.4 Time(s) 0.6 0.2 0.4 Time(s) 0.6 C Figure 4: Plasticity enhanced phase-coding. A Spike rasters of 81 neurons (9 by 9 cluster) display synchrony over a two-fold range of input rates after STDP. B The degree of enhancement is quantified by timing precision. C Each neuron (center box) sends synapses to (dark gray) and receives synapses from (light gray) twenty-one randomly chosen neighbors up to five nodes away (black indicates both connections). spike activates the principal neuron’s synapse; otherwise the spike has no effect. We characterized the STDP circuit by activating a plastic synapse and a fixed synapse– which elicits a spike at different relative times. We repeated this pairing at 16Hz. We counted the number of pairings required to potentiate (or depress) the synapse. Based on this count, we calculated the efficacy of each pairing as the inverse number of pairings required (Figure 3B). For example, if twenty pairings were required to potentiate the synapse, the efficacy of that pre-before-post time-interval was one twentieth. The efficacy of both potentiation and depression are fit by exponentials with time constants of 11.4ms and 94.9ms, respectively. This behavior is similar to that observed in the hippocampus: potentiation has a shorter time constant and higher maximum efficacy than depression [3]. 4 Recurrent Network We carried out an experiment designed to test the STDP circuit’s ability to compensate for variability in spike timing through PEP. Each neuron received recurrent connections from 21 randomly selected neurons within an 11 by 11 neighborhood centered on itself (see Figure 4C). Conversely, it made recurrent connections to randomly chosen neurons within the same neighborhood. These connections were mediated by STDP circuits, initialized to the depressed state. We chose a 9 by 9 cluster of neurons and delivered spikes at a mean rate of 50 to 100Hz to each one (dropping spikes with a probability of 0.75 to 0.5 from a regular 200Hz train) and provided common theta inhibition as before. We compared the variability in spike timing after five seconds of learning with the initial distribution. Phase coding was enhanced after STDP (Figure 4A). Before STDP, spike timing among neurons was highly variable (except for the very highest input rate). After STDP, variability was virtually eliminated (except for the very lowest input rate). Initially, the variability, characterized by timing precision, was inversely related to the input rate, decreasing from 34 to 13ms. After five seconds of STDP, variability decreased and was largely independent of input rate, remaining below 11ms. Potentiated synapses 25 A Synaptic state after STDP 20 15 10 5 0 B 50 100 150 200 Spiking order 250 Figure 5: Compensating for variability. A Some synapses (dots) become potentiated (light) while others remain depressed (dark) after STDP. B The number of potentiated synapses neurons make (pluses) and receive (circles) is negatively (r = -0.71) and positively (r = 0.76) correlated to their rank in the spiking order, respectively. Comparing the number of potentiated synapses each neuron made or received with its excitability confirmed the PEP hypothesis (i.e., leading neurons provide additional synaptic current to lagging neurons via potentiated recurrent synapses). In this experiment, to eliminate variability due to noise (as opposed to excitability), we provided a 17 by 17 cluster of neurons with a regular 200Hz excitatory input. Theta inhibition was present as before and all synapses were initialized to the depressed state. After 10 seconds of STDP, a large fraction of the synapses were potentiated (Figure 5A). When the number of potentiated synapses each neuron made or received was plotted versus its rank in spiking order (Figure 5B), a clear correlation emerged (r = -0.71 or 0.76, respectively). As expected, neurons that spiked early made more and received fewer potentiated synapses. In contrast, neurons that spiked late made fewer and received more potentiated synapses. 5 Pattern Completion After STDP, we found that the network could recall an entire pattern given a subset, thus the same mechanisms that compensated for variability and noise could also compensate for lack of information. We chose a 9 by 9 cluster of neurons as our pattern and delivered a poisson-like spike train with mean rate of 67Hz to each one as in the first experiment. Theta inhibition was present as before and all synapses were initialized to the depressed state. Before STDP, we stimulated a subset of the pattern and only neurons in that subset spiked (Figure 6A). After five seconds of STDP, we stimulated the same subset again. This time they recruited spikes from other neurons in the pattern, completing it (Figure 6B). Upon varying the fraction of the pattern presented, we found that the fraction recalled increased faster than the fraction presented. We selected subsets of the original pattern randomly, varying the fraction of neurons chosen from 0.1 to 1.0 (ten trials for each). We classified neurons as active if they spiked in the two second period over which we recorded. Thus, we characterized PEP’s pattern-recall performance as a function of the probability that the pattern in question’s neurons are activated (Figure 6C). At a fraction of 0.50 presented, nearly all of the neurons in the pattern are consistently activated (0.91±0.06), showing robust pattern completion. We fitted the recall performance with a sigmoid that reached 0.50 recall fraction with an input fraction of 0.30. No spurious neurons were activated during any trials. Rate(Hz) Rate(Hz) 8 7 7 6 6 5 5 0.6 0.4 2 0.2 0 0 3 3 2 1 1 A 0.8 4 4 Network activity before STDP 1 Fraction of pattern actived 8 0 B Network activity after STDP C 0 0.2 0.4 0.6 0.8 Fraction of pattern stimulated 1 Figure 6: Associative recall. A Before STDP, half of the neurons in a pattern are stimulated; only they are activated. B After STDP, half of the neurons in a pattern are stimulated, and all are activated. C The fraction of the pattern activated grows faster than the fraction stimulated. 6 Discussion Our results demonstrate that PEP successfully compensates for graded variations in our silicon recurrent network using binary (on–off) synapses (in contrast with [8], where weights are graded). While our chip results are encouraging, variability was not eliminated in every case. In the case of the lowest input (50Hz), we see virtually no change (Figure 4A). We suspect the timing remains imprecise because, with such low input, neurons do not spike every theta cycle and, consequently, provide fewer opportunities for the STDP synapses to potentiate. This shortfall illustrates the system’s limits; it can only compensate for variability within certain bounds, and only for activity appropriate to the PEP model. As expected, STDP is the mechanism responsible for PEP. STDP potentiated recurrent synapses from leading neurons to lagging neurons, reducing the disparity among the diverse population of neurons. Even though the STDP circuits are themselves variable, with different efficacies and time constants, when using timing the sign of the weight-change is always correct (data not shown). For this reason, we chose STDP over other more physiological implementations of plasticity, such as membrane-voltage-dependent plasticity (MVDP), which has the capability to learn with graded voltage signals [9], such as those found in active dendrites, providing more computational power [10]. Previously, we investigated a MVDP circuit, which modeled a voltage-dependent NMDAreceptor-gated synapse [11]. It potentiated when the calcium current analog exceeded a threshold, which was designed to occur only during a dendritic action potential. This circuit produced behavior similar to STDP, implying it could be used in PEP. However, it was sensitive to variability in the NMDA and potentiation thresholds, causing a fraction of the population to potentiate anytime the synapse received an input and another fraction to never potentiate, rendering both subpopulations useless. Therefore, the simpler, less biophysical STDP circuit won out over the MVDP circuit: In our system timing is everything. Associative storage and recall naturally emerge in the PEP network when synapses between neurons coactivated by a pattern are potentiated. These synapses allow neurons to recruit their peers when a subset of the pattern is presented, thereby completing the pattern. However, this form of pattern storage and completion differs from Hopfield’s attractor model [12] . Rather than forming symmetric, recurrent neuronal circuits, our recurrent network forms asymmetric circuits in which neurons make connections exclusively to less excitable neurons in the pattern. In both the poisson-like and regular cases (Figures 4 & 5), only about six percent of potentiated connections were reciprocated, as expected by chance. We plan to investigate the storage capacity of this asymmetric form of associative memory. Our system lends itself to modeling brain regions that use precise spike timing, such as the hippocampus. We plan to extend the work presented to store and recall sequences of patterns, as the hippocampus is hypothesized to do. Place cells that represent different locations spike at different phases of the theta cycle, in relation to the distance to their preferred locations. This sequential spiking will allow us to link patterns representing different locations in the order those locations are visited, thereby realizing episodic memory. We propose PEP as a candidate neural mechanism for information coding and storage in the hippocampal system. Observations from the CA1 region of the hippocampus suggest that basal dendrites (which primarily receive excitation from recurrent connections) support submillisecond timing precision, consistent with PEP [13]. We have shown, in a silicon model, PEP’s ability to exploit such fast recurrent connections to sharpen timing precision as well as to associatively store and recall patterns. Acknowledgments We thank Joe Lin for assistance with chip generation. The Office of Naval Research funded this work (Award No. N000140210468). References [1] O’Keefe J. & Recce M.L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus 3(3):317-330. [2] Mehta M.R., Lee A.K. & Wilson M.A. (2002) Role of experience and oscillations in transforming a rate code into a temporal code. Nature 417(6890):741-746. [3] Bi G.Q. & Wang H.X. (2002) Temporal asymmetry in spike timing-dependent synaptic plasticity. Physiology & Behavior 77:551-555. [4] Rodriguez-Vazquez, A., Linan, G., Espejo S. & Dominguez-Castro R. (2003) Mismatch-induced trade-offs and scalability of analog preprocessing visual microprocessor chips. Analog Integrated Circuits and Signal Processing 37:73-83. [5] Boahen K.A. (2000) Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems II 47:416-434. [6] Culurciello E.R., Etienne-Cummings R. & Boahen K.A. (2003) A biomorphic digital image sensor. IEEE Journal of Solid State Circuits 38:281-294. [7] Bofill A., Murray A.F & Thompson D.P. (2005) Citcuits for VLSI Implementation of Temporally Asymmetric Hebbian Learning. In: Advances in Neural Information Processing Systems 14, MIT Press, 2002. [8] Cameron K., Boonsobhak V., Murray A. & Renshaw D. (2005) Spike timing dependent plasticity (STDP) can ameliorate process variations in neuromorphic VLSI. IEEE Transactions on Neural Networks 16(6):1626-1627. [9] Chicca E., Badoni D., Dante V., D’Andreagiovanni M., Salina G., Carota L., Fusi S. & Del Giudice P. (2003) A VLSI recurrent network of integrate-and-fire neurons connected by plastic synapses with long-term memory. IEEE Transaction on Neural Networks 14(5):1297-1307. [10] Poirazi P., & Mel B.W. (2001) Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron 29(3)779-796. [11] Arthur J.V. & Boahen K. (2004) Recurrently connected silicon neurons with active dendrites for one-shot learning. In: IEEE International Joint Conference on Neural Networks 3, pp.1699-1704. [12] Hopfield J.J. (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science 81(10):3088-3092. [13] Ariav G., Polsky A. & Schiller J. (2003) Submillisecond precision of the input-output transformation function mediated by fast sodium dendritic spikes in basal dendrites of CA1 pyramidal neurons. Journal of Neuroscience 23(21):7750-7758.
4 0.88962954 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
5 0.66919982 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
6 0.62437177 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
7 0.60273951 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
8 0.59075809 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
9 0.58591586 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
10 0.5801332 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
11 0.56388336 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
12 0.56144041 54 nips-2005-Data-Driven Online to Batch Conversions
13 0.55836749 181 nips-2005-Spiking Inputs to a Winner-take-all Network
14 0.55288064 30 nips-2005-Assessing Approximations for Gaussian Process Classification
15 0.54010904 52 nips-2005-Correlated Topic Models
16 0.53553391 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
17 0.531775 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
18 0.52105737 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
19 0.51984167 46 nips-2005-Consensus Propagation
20 0.51272821 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation