nips nips2001 nips2001-136 knowledge-graph by maker-knowledge-mining

136 nips-2001-On the Concentration of Spectral Properties

Source: pdf

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

Abstract: We consider the problem of measuring the eigenvalues of a randomly drawn sample of points. We show that these values can be reliably estimated as can the sum of the tail of eigenvalues. Furthermore, the residuals when data is projected into a subspace is shown to be reliably estimated on a random sample. Experiments are presented that confirm the theoretical results. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We consider the problem of measuring the eigenvalues of a randomly drawn sample of points. [sent-8, score-0.432]

2 We show that these values can be reliably estimated as can the sum of the tail of eigenvalues. [sent-9, score-0.196]

3 Furthermore, the residuals when data is projected into a subspace is shown to be reliably estimated on a random sample. [sent-10, score-0.556]

4 1 Introduction A number of learning algorithms rely on estimating spectral data on a sample of training points and using this data as input to further analyses. [sent-12, score-0.196]

5 For example in Principal Component Analysis (PCA) the subspace spanned by the first k eigenvectors is used to give a k dimensional model of the data with minimal residual, hence forming a low dimensional representation of the data for analysis or clustering. [sent-13, score-0.836]

6 Recently the approach has been applied in kernel defined feature spaces in what has become known as kernel-PCA [5]. [sent-14, score-0.116]

7 This representation has also been related to an Information Retrieval algorithm known as latent semantic indexing, again with kernel defined feature spaces [2]. [sent-15, score-0.225]

8 Furthermore eigenvectors have been used in the HITS [3] and Google's PageRank [1] algorithms. [sent-16, score-0.187]

9 In both cases the entries in the eigenvector corresponding to the maximal eigenvalue are interpreted as authority weightings for individual articles or web pages. [sent-17, score-0.305]

10 The use of these techniques raises the question of how reliably these quantities can be estimated from a random sample of data, or phrased differently, how much data is required to obtain an accurate empirical estimate with high confidence. [sent-18, score-0.17]

11 [6] have undertaken a study of the sensitivity of the estimate of the first eigenvector to perturbations of the connection matrix. [sent-20, score-0.19]

12 They have also highlighted the potential instability that can arise when two eigenvalues are very close in value, so that their eigenspaces become very difficult to distinguish empirically. [sent-21, score-0.349]

13 The aim of this paper is to study the error in estimation that can arise from the random sampling rather than from perturbations of the connectivity. [sent-22, score-0.061]

14 We will show that eigenvalues estimated from a sample of size m are indeed concentrated, and furthermore the sum of the last m - k eigenvalues is subject to a similar concentration effect, both results of independent mathematical interest. [sent-24, score-1.07]

15 The sum of the last m - k eigenvalues is related to the error in forming a k dimensional PCA approximation, and hence will be shown to justify using empirical projection subspaces in such algorithms as kernel-PCA and latent semantic kernels. [sent-25, score-0.753]

16 We provide experimental verification of the theoretical findings in section 4, before drawing our conclusions. [sent-28, score-0.036]

17 ,Xn be independent random variables taking values in a set A, and assume that f : An -+ ~, for 1 :::; i :::; n sup If(xI, . [sent-39, score-0.116]

18 ,···, xn)1 :::; Ci, We will also make use of the following theorem characterising the eigenvectors of a symmetric matrix. [sent-46, score-0.285]

19 , m, Ak(M) = max v'Mv min - - = vlv dim(T) = k O#v ET min max dim(T) = m - k+IO#v E T ~mxm is symmet- v'Mv vlv ' with the extrama achieved by the corresponding eigenvector. [sent-50, score-0.546]

20 The approach adopted in the proofs of the next section is to view the eigenvalues as the sums of squares of residuals. [sent-51, score-0.438]

21 This is applicable when the matrix is positive semidefinite and hence can be written as an inner product matrix M = XI X, where XI is the transpose of the matrix X containing the m vectors Xl, . [sent-52, score-0.192]

22 This is the finite dimensional version of Mercer's theorem, and follows immediately if we take X = V VA, where M = VA VI is the eigenvalue decomposition of M. [sent-56, score-0.292]

23 There may be more succinct ways of representing X, but we will assume for simplicity (but without loss of generality) that X is a square matrix with the same dimensions as M. [sent-57, score-0.088]

24 To set the scene, we now present a short description of the residuals viewpoint. [sent-58, score-0.263]

25 The starting point is the singular value decomposition of X = U~V', where U and V are orthonormal matrices and ~ is a diagonal matrix containing the singular values (in descending order). [sent-59, score-0.199]

26 We can now reconstruct the eigenvalue decomposition of M = X' X = V~U'U~V' = V AV', where A = ~2. [sent-60, score-0.16]

27 But equally we can construct a matrix N = XX' = U~V'V~U' = UAU' , with the same eigenvalues as M. [sent-61, score-0.365]

28 l (x)) is the projection of x onto the space spanned by v (space perpendicular to v), since IIxI1 2 = IIPv(x)11 2+ IIPv . [sent-64, score-0.346]

29 It follows that the first eigenvector is characterised as the direction for which sum of the squares of the residuals is minimal. [sent-67, score-0.658]

30 Applying the same line of reasoning to the first equality of Theorem 3, delivers the following equality m Ak = max L min dim(V) = k O,t:vEV . [sent-68, score-0.323]

31 It readily follows by induction over the dimension of V that we can equally characterise the sum of the first k and last m - k eigenvalues by m max i= l m L dim(V) = k . [sent-71, score-0.649]

32 =1 J=l Hence, as for the case when k = 1, the subspace spanned by the first k eigenvalues is characterised as that for which the sum of the squares of the residuals is minimal. [sent-79, score-1.214]

33 Frequently, we consider all of the above as occurring in a kernel defined feature space, so that wherever we have written Xj we should have put ¢>(Xj), where ¢> is the corresponding projection. [sent-80, score-0.152]

34 3 Concentration of eigenvalues The previous section outlined the relatively well-known perspective that we now apply to obtain the concentration results for the eigenvalues of positive semi-definite matrices. [sent-81, score-0.844]

35 The key to the results is the characterisation in terms of the sums of residuals given in equations (1) and (4). [sent-82, score-0.391]

36 Theorem 4 Let K(x,z) be a positive semi-definite kernel function on a space X, and let J-t be a distribution on X. [sent-83, score-0.162]

37 Fix natural numbers m and 1 :::; k < m and let S = (Xl"'" x m) E xm be a sample of m points drawn according to J-t. [sent-84, score-0.366]

38 k (S) is the k-th eigenvalue of the matrix K(S) with entries K(S)ij K(Xi,Xj) and R2 = maxx Ex K(x,x). [sent-91, score-0.248]

39 Proof: The result follows from an application of Theorem 1 provided 1 1 2 sup 1 ). [sent-92, score-0.153]

40 s m m Let S = S \ {Xi} and let V (11) be the k dimensional subspace spanned by the first k eigenvectors of K(S) (K(S)). [sent-97, score-0.74]

41 Using equation (1) we have m m D Surprisingly a very similar result holds when we consider the sum of the last m - k eigenvalues. [sent-98, score-0.116]

42 Theorem 5 Let K(x, z) be a positive semi-definite kernel function on a space X, and let J-t be a distribution on X. [sent-99, score-0.162]

43 Fix natural numbers m and 1 :::; k < m and let S = (Xl, . [sent-100, score-0.049]

44 , Xm) E xm be a sample of m points drawn according to J-t. [sent-103, score-0.317]

45 >k(S) is the sum of all but the largest k eigenvalues of the matrix K(S) with entries K(S)ij = K(Xi,Xj) and R2 = maxxEX K(x,x). [sent-110, score-0.501]

46 Proof: The result follows from an application of Theorem 1 provided sup 1~). [sent-111, score-0.153]

47 m Let S = S \ {xd and let V (11) be the k dimensional subspace spanned by the first k eigenvectors of K(S) (K(S)). [sent-116, score-0.74]

48 Using equation (4) we have m j=l #i m #i D j=l Our next result concerns the concentration of the residuals with respect to a fixed subspace. [sent-117, score-0.481]

49 For a subspace V and training set S , we introduce the notation 1 m Fv(S) = - L IIPV(Xi )112 . [sent-118, score-0.222]

50 Fix natural numbers m and a subspace V and let S = (Xl, . [sent-120, score-0.228]

51 , Xm) E xm be a sample of m points drawn according to J-t. [sent-123, score-0.317]

52 Proof: The result follows from an application of Theorem 2 provided sup IFv(S) - F(S \ {xd U {xi)1 ::::: R2/m. [sent-125, score-0.153]

53 S,X i Clearly the largest change will occur if one of the points Xi and Xi is lies in the subspace V and the other does not. [sent-126, score-0.22]

54 D 4 Experiments In order to test the concentration results we p erformed experiments with the Breast cancer data using a cubic polynomial kernel. [sent-128, score-0.257]

55 The kernel was chosen to ensure that the spectrum did not decay too fast. [sent-129, score-0.141]

56 We centered the whole data set so that the origin of the feature space is placed at the centre of gravity of the training set. [sent-131, score-0.162]

57 We then performed an eigenvalue decomposition of the training set. [sent-132, score-0.203]

58 The sum of the eigenvalues greater than the k-th gives the sum of the residual squared norms of the training points when we project onto the space spanned by the first k eigenvectors. [sent-133, score-0.945]

59 Dividing this by the average of all the eigenvalues (which measures the average square norm of the training points in the transformed space) gives a fraction residual not captured in the k dimensional projection. [sent-134, score-0.712]

60 The Figure la shows the full spectrum, while Figure 1b shows a zoomed in subwindow. [sent-137, score-0.036]

61 The very tight error bars show clearly the very tight concentration of the sums of tail of eigenvalues as predicted by Theorem 5. [sent-138, score-0.731]

62 In order to test the concentration results for subsets we measured the residuals of the test points when they are projected into the subspace spanned by the first k eigenvectors generated above for the training set. [sent-139, score-1.292]

63 The dashed lines in Figure 1 show the ratio of the average squares of these residuals to the average squared norm of the test points. [sent-140, score-0.456]

64 We see the two curves tracking each other very closely, indicating that the subspace identified as optimal for the training set is indeed capturing almost the same amount of information in the test points. [sent-141, score-0.371]

65 5 Conclusions The paper has shown that the eigenvalues of a positive semi-definite matrix generated from a random sample is concentrated. [sent-142, score-0.438]

66 Furthermore the sum of the last m - k eigenvalues is similarly concentrated as is the residual when the data is projected into a fixed subspace. [sent-143, score-0.607]

67 [ __ O~--L---~--~--~---L-=~~~~======~~ o 10 20 30 40 50 60 Projection Dimensionality 70 80 90 100 (b) Figure 1: Plots ofthe fraction of the average squared norm captured in the subspace spanned by the first k eigenvectors for different values of k. [sent-158, score-0.768]

68 Continuous line is fraction for training set, while the dashed line is for the test set. [sent-159, score-0.131]

69 Experiments are presented that confirm the theoretical predictions on a real world dataset. [sent-161, score-0.082]

70 The results provide a basis for performing PCA or kernel-PCA from a randomly generated sample, as they confirm that the subset identified by the sample will indeed 'generalise' in the sense that it will capture most of the information in a test sample. [sent-162, score-0.304]

71 Further research should look at the question of how the space identified by a subsample relates to the eigenspace of the underlying kernel operator. [sent-163, score-0.222]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eigenvalues', 0.313), ('residuals', 0.263), ('iipv', 0.249), ('concentration', 0.218), ('dim', 0.197), ('eigenvectors', 0.187), ('spanned', 0.187), ('subspace', 0.179), ('xj', 0.179), ('veir', 0.166), ('theoreill', 0.165), ('xm', 0.157), ('xi', 0.144), ('ilpv', 0.125), ('sup', 0.116), ('max', 0.104), ('xl', 0.102), ('eigenvalue', 0.101), ('pv', 0.099), ('theorem', 0.098), ('dimensional', 0.095), ('xd', 0.088), ('min', 0.086), ('eigenvector', 0.086), ('residual', 0.086), ('characterised', 0.083), ('ifv', 0.083), ('vlv', 0.083), ('confirm', 0.082), ('sum', 0.077), ('sample', 0.073), ('identified', 0.073), ('kernel', 0.073), ('characterisation', 0.072), ('mcdiarmid', 0.072), ('ak', 0.07), ('projection', 0.07), ('pca', 0.069), ('squares', 0.069), ('spectrum', 0.068), ('iip', 0.066), ('jaz', 0.066), ('xn', 0.064), ('fix', 0.062), ('perturbations', 0.061), ('reliably', 0.061), ('semantic', 0.059), ('web', 0.059), ('entries', 0.059), ('decomposition', 0.059), ('tail', 0.058), ('fv', 0.058), ('nello', 0.058), ('sums', 0.056), ('projected', 0.053), ('matrix', 0.052), ('forming', 0.05), ('latent', 0.05), ('let', 0.049), ('ll', 0.049), ('fraction', 0.049), ('holloway', 0.049), ('perpendicular', 0.049), ('norm', 0.047), ('en', 0.047), ('ng', 0.047), ('drawn', 0.046), ('equality', 0.045), ('va', 0.044), ('singular', 0.044), ('first', 0.043), ('tight', 0.043), ('training', 0.043), ('feature', 0.043), ('points', 0.041), ('space', 0.04), ('concentrated', 0.039), ('test', 0.039), ('spectral', 0.039), ('last', 0.039), ('squared', 0.038), ('captured', 0.038), ('cristianini', 0.038), ('proof', 0.038), ('follows', 0.037), ('indeed', 0.037), ('verification', 0.036), ('characterise', 0.036), ('av', 0.036), ('cii', 0.036), ('succinct', 0.036), ('zoomed', 0.036), ('maxx', 0.036), ('organised', 0.036), ('instability', 0.036), ('eigenspace', 0.036), ('gravity', 0.036), ('semidefinite', 0.036), ('phrased', 0.036), ('wherever', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

2 0.21007718 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

Author: Marzia Polito, Pietro Perona

Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1

3 0.194032 170 nips-2001-Spectral Kernel Methods for Clustering

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

4 0.18982898 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

5 0.17467719 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss

Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1

6 0.17446809 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

7 0.15294129 134 nips-2001-On Kernel-Target Alignment

8 0.14894404 74 nips-2001-Face Recognition Using Kernel Methods

9 0.13134669 171 nips-2001-Spectral Relaxation for K-means Clustering

10 0.12020198 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

11 0.10549009 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

12 0.10254436 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

13 0.09356717 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

14 0.088312447 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

15 0.085741833 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.079569861 79 nips-2001-Gaussian Process Regression with Mismatched Models

17 0.077348836 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

18 0.073307499 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

19 0.067922875 58 nips-2001-Covariance Kernels from Bayesian Generative Models

20 0.063557774 155 nips-2001-Quantizing Density Estimators

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.223), (1, 0.111), (2, -0.028), (3, -0.235), (4, 0.249), (5, 0.061), (6, -0.03), (7, 0.05), (8, 0.03), (9, 0.047), (10, -0.022), (11, 0.118), (12, -0.014), (13, 0.094), (14, -0.039), (15, -0.113), (16, -0.057), (17, -0.16), (18, 0.056), (19, 0.074), (20, -0.068), (21, -0.113), (22, 0.075), (23, 0.128), (24, -0.223), (25, -0.113), (26, -0.111), (27, -0.035), (28, 0.021), (29, 0.018), (30, -0.054), (31, 0.083), (32, 0.009), (33, -0.081), (34, -0.057), (35, 0.007), (36, -0.153), (37, -0.019), (38, 0.055), (39, -0.019), (40, 0.004), (41, 0.106), (42, -0.041), (43, 0.007), (44, 0.004), (45, 0.049), (46, -0.035), (47, -0.004), (48, -0.056), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96856374 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

2 0.76096511 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

Author: Marzia Polito, Pietro Perona

3 0.57886171 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Author: Mikhail Belkin, Partha Niyogi

Abstract: Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami op erator on a manifold , and the connections to the heat equation , we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering. Several applications are considered. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high dimensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However , the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 . While there is a large body of work on dimensionality reduction in general, most existing approaches do not explicitly take into account the structure of the manifold on which the data may possibly reside. Recently, there has been some interest (Tenenbaum et aI, 2000 ; Roweis and Saul, 2000) in the problem of developing low dimensional representations of data in this particular context. In this paper , we present a new algorithm and an accompanying framework of analysis for geometrically motivated dimensionality reduction. The core algorithm is very simple, has a few local computations and one sparse eigenvalu e problem. The solution reflects th e intrinsic geom etric structure of the manifold. Th e justification comes from the role of the Laplacian op erator in providing an optimal emb edding. Th e Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The emb edding maps for the data come from approximations to a natural map that is defined on the entire manifold. The framework of analysis presented here makes this connection explicit. While this connection is known to geometers and specialists in sp ectral graph theory (for example , see [1, 2]) to the best of our knowledge we do not know of any application to data representation yet. The connection of the Laplacian to the heat kernel enables us to choose the weights of the graph in a principled manner. The locality preserving character of the Laplacian Eigenmap algorithm makes it relatively insensitive to outliers and noise. A byproduct of this is that the algorithm implicitly emphasizes the natural clusters in the data. Connections to spectral clustering algorithms developed in learning and computer vision (see Shi and Malik , 1997) become very clear. Following the discussion of Roweis and Saul (2000) , and Tenenbaum et al (2000), we note that the biological perceptual apparatus is confronted with high dimensional stimuli from which it must recover low dimensional structure. One might argue that if the approach to recovering such low-dimensional structure is inherently local , then a natural clustering will emerge and thus might serve as the basis for the development of categories in biological perception. 1 The Algorithm Given k points Xl , ... , Xk in ]]{ I, we construct a weighted graph with k nodes, one for each point , and the set of edges connecting neighboring points to each other. 1. Step 1. [Constru cting th e Graph] We put an edge between nodes i and j if Xi and Xj are

4 0.57474452 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

Author: Anand Rangarajan, Alan L. Yuille

Abstract: Bayesian belief propagation in graphical models has been recently shown to have very close ties to inference methods based in statistical physics. After Yedidia et al. demonstrated that belief propagation ﬁxed points correspond to extrema of the so-called Bethe free energy, Yuille derived a double loop algorithm that is guaranteed to converge to a local minimum of the Bethe free energy. Yuille’s algorithm is based on a certain decomposition of the Bethe free energy and he mentions that other decompositions are possible and may even be fruitful. In the present work, we begin with the Bethe free energy and show that it has a principled interpretation as pairwise mutual information minimization and marginal entropy maximization (MIME). Next, we construct a family of free energy functions from a spectrum of decompositions of the original Bethe free energy. For each free energy in this family, we develop a new algorithm that is guaranteed to converge to a local minimum. Preliminary computer simulations are in agreement with this theoretical development. 1

5 0.52002335 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli

Abstract: Spike-triggered averaging techniques are effective for linear characterization of neural responses. But neurons exhibit important nonlinear behaviors, such as gain control, that are not captured by such analyses. We describe a spike-triggered covariance method for retrieving suppressive components of the gain control signal in a neuron. We demonstrate the method in simulation and on retinal ganglion cell data. Analysis of physiological data reveals signiﬁcant suppressive axes and explains neural nonlinearities. This method should be applicable to other sensory areas and modalities. White noise analysis has emerged as a powerful technique for characterizing response properties of spiking neurons. A sequence of stimuli are drawn randomly from an ensemble and presented in rapid succession, and one examines the subset that elicit action potentials. This “spike-triggered” stimulus ensemble can provide information about the neuron’s response characteristics. In the most widely used form of this analysis, one estimates an excitatory linear kernel by computing the spike-triggered average (STA); that is, the mean stimulus that elicited a spike [e.g., 1, 2]. Under the assumption that spikes are generated by a Poisson process with instantaneous rate determined by linear projection onto a kernel followed by a static nonlinearity, the STA provides an unbiased estimate of this kernel [3]. Recently, a number of authors have developed interesting extensions of white noise analysis. Some have examined spike-triggered averages in a reduced linear subspace of input stimuli [e.g., 4]. Others have recovered excitatory subspaces, by computing the spiketriggered covariance (STC), followed by an eigenvector analysis to determine the subspace axes [e.g., 5, 6]. Sensory neurons exhibit striking nonlinear behaviors that are not explained by fundamentally linear mechanisms. For example, the response of a neuron typically saturates for large amplitude stimuli; the response to the optimal stimulus is often suppressed by the presence of a non-optimal mask [e.g., 7]; and the kernel recovered from STA analysis may change shape as a function of stimulus amplitude [e.g., 8, 9]. A variety of these nonlinear behaviors can be attributed to gain control [e.g., 8, 10, 11, 12, 13, 14], in which neural responses are suppressively modulated by a gain signal derived from the stimulus. Although the underlying mechanisms and time scales associated with such gain control are current topics of research, the basic functional properties appear to be ubiquitous, occurring throughout the nervous system. a b 0 k0 0 Figure 1: Geometric depiction of spike-triggered analyses. a, Spike-triggered averaging with two-dimensional stimuli. Black points indicate raw stimuli. White points indicate stimuli eliciting a spike, and the STA (black vector), which provides an estimate of , corresponds to their center of mass. b, Spike-triggered covariance analysis of suppressive axes. Shown are a set of stimuli lying on a plane perpendicular to the excitatory kernel, . Within the plane, stimuli eliciting a spike are concentrated in an elliptical region. The minor axis of the ellipse corresponds to a suppressive stimulus direction: stimuli with a signiﬁcant component along this axis are less likely to elicit spikes. The stimulus component along the major axis of the ellipse has no inﬂuence on spiking. ¢ £ ¡ ¢ ¡ Here we develop a white noise methodology for characterizing a neuron with gain control. We show that a set of suppressive kernels may be recovered by ﬁnding the eigenvectors of the spike-triggered covariance matrix associated with smallest variance. We apply the technique to electrophysiological data obtained from ganglion cells in salamander and macaque retina, and recover a set of axes that are shown to reduce responses in the neuron. Moreover, when we ﬁt a gain control model to the data using a maximum likelihood procedure within this subspace, the model accounts for changes in the STA as a function of contrast. 1 Characterizing suppressive axes ¤¥ As in all white noise approaches, we assume that stimuli correspond to vectors, , in some ﬁnite-dimensional space (e.g., a neighborhood of pixels or an interval of time samples). We assume a gain control model in which the probability of a stimulus eliciting a spike grows monotonically with the halfwave-rectiﬁed projection onto an excitatory linear kernel, , and is suppressively modulated by the fullwave-rectiﬁed projection onto a set of . linear kernels, ¨ ¤§ ¤¥ ©¤ § ¨ ¤ ¥ ©¤ § ¦ First, we recover the excitatory kernel, . This is achieved by presenting spherically symmetric input stimuli (e.g., Gaussian white noise) to the neuron and computing the STA (Fig. 1a). STA correctly recovers the excitatory kernel, under the assumption that each of the gain control kernels are orthogonal (or equal) to the excitatory kernel. The proof is essentially the same as that given for recovering the kernel of a linear model followed by a monotonic nonlinearity [3]. In particular, any stimulus can be decomposed into a component in the direction of the excitatory kernel, and a component in a perpendicular direction. This can be paired with another stimulus that is identical, except that its component in the perpendicular direction is negated. The two stimuli are equally likely to occur in a spherically Gaussian stimulus set (since they are equidistant from the origin), and they are equally likely to elicit a spike (since their excitatory components are equal, and their rectiﬁed perpendicular components are equal). Their vector average lies in the direction of the excitatory kernel. Thus, the STA (which is an average over all such stimuli, or all such stimulus pairs) must also lie in that direction. In a subsequent section we explain how to Model: Retrieved: Excitatory: Excitatory: Eigenvalues: Suppressive: Suppressive: Weights Variance (eigenvalue) { 1.5 { 2{ 2.5 { 3{ 1 1 Arbitrary 0 Axis number 350 Figure 2: Estimation of kernels from a simulated model (equation 2). Left: Model kernels. Right: Sorted eigenvalues of covariance matrix of stimuli eliciting spikes (STC). Five eigenvalues fall signiﬁcantly below the others. Middle: STA (excitatory kernel) and eigenvectors (suppressive kernels) associated with the lowest eigenvalues. recover the excitatory kernel when it is not orthogonal to the suppressive kernels. Next, we recover the suppressive subspace, assuming the excitatory kernel is known. Consider the stimuli lying on a plane perpendicular to this kernel. These stimuli all elicit the same response in the excitatory kernel, but they may produce different amounts of suppression. Figure 1b illustrates the behavior in a three-dimensional stimulus space, in which one axis is assumed to be suppressive. The distribution of raw stimuli on the plane is spherically symmetric about the origin. But the distribution of stimuli eliciting a spike is narrower along the suppressive direction: these stimuli have a component along the suppressive axis and are therefore less likely to elicit a spike. This behavior is easily generalized from this plane to the entire stimulus space. If we assume that the suppressive axes are ﬁxed, then we expect to see reductions in variance in the same directions for any level of numerator excitation. Given this behavior of the spike-triggered stimulus ensemble, we can recover the suppressive subspace using principal component analysis. We construct the sample covariance matrix of the stimuli eliciting a spike: £ §¥

6 0.51953906 164 nips-2001-Sampling Techniques for Kernel Methods

7 0.50705594 74 nips-2001-Face Recognition Using Kernel Methods

8 0.49601039 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

9 0.46811011 170 nips-2001-Spectral Kernel Methods for Clustering

10 0.45590454 171 nips-2001-Spectral Relaxation for K-means Clustering

11 0.43083769 120 nips-2001-Minimax Probability Machine

12 0.40836784 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

13 0.37589905 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

14 0.35347453 134 nips-2001-On Kernel-Target Alignment

15 0.34535483 31 nips-2001-Algorithmic Luckiness

16 0.32340801 154 nips-2001-Products of Gaussians

17 0.31886747 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

18 0.31652069 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

19 0.27954417 62 nips-2001-Duality, Geometry, and Support Vector Regression

20 0.26524514 79 nips-2001-Gaussian Process Regression with Mismatched Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.06), (17, 0.506), (19, 0.019), (27, 0.133), (30, 0.035), (59, 0.029), (72, 0.037), (79, 0.024), (91, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92721182 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

2 0.91172254 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

Author: Rémi Munos

Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.

3 0.88665318 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

Author: Shai Fine, Katya Scheinberg

Abstract: We propose a framework based on a parametric quadratic programming (QP) technique to solve the support vector machine (SVM) training problem. This framework, can be specialized to obtain two SVM optimization methods. The first solves the fixed bias problem, while the second starts with an optimal solution for a fixed bias problem and adjusts the bias until the optimal value is found. The later method can be applied in conjunction with any other existing technique which obtains a fixed bias solution. Moreover, the second method can also be used independently to solve the complete SVM training problem. A combination of these two methods is more flexible than each individual method and, among other things, produces an incremental algorithm which exactly solve the 1-Norm Soft Margin SVM optimization problem. Applying Selective Sampling techniques may further boost convergence. 1

4 0.51782155 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

5 0.51236564 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

6 0.51082599 59 nips-2001-Direct value-approximation for factored MDPs

7 0.50888824 121 nips-2001-Model-Free Least-Squares Policy Iteration

8 0.49871051 13 nips-2001-A Natural Policy Gradient

9 0.48924142 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

10 0.48228014 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

11 0.4698047 36 nips-2001-Approximate Dynamic Programming via Linear Programming

12 0.45627531 31 nips-2001-Algorithmic Luckiness

13 0.4534485 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

14 0.45318237 89 nips-2001-Grouping with Bias

15 0.45317644 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

16 0.4475058 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

17 0.44529888 74 nips-2001-Face Recognition Using Kernel Methods

18 0.44402987 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

19 0.44055575 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

20 0.43640578 171 nips-2001-Spectral Relaxation for K-means Clustering