nips nips2003 nips2003-60 knowledge-graph by maker-knowledge-mining

60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis


Source: pdf

Author: James T. Kwok, Brian Mak, Simon Ho

Abstract: Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. [sent-4, score-0.791]

2 At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. [sent-5, score-0.057]

3 In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. [sent-6, score-0.134]

4 One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. [sent-7, score-0.894]

5 Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. [sent-8, score-1.143]

6 On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. [sent-9, score-1.446]

7 1 Introduction In recent years, there has been a lot of interest in the study of kernel methods [1]. [sent-10, score-0.103]

8 The basic idea is to map data in the input space X to a feature space via some nonlinear map ϕ, and then apply a linear method there. [sent-11, score-0.142]

9 It is now well known that the computational procedure depends only on the inner products1 ϕ(xi ) ϕ(xj ) in the feature space (where xi , xj ∈ X ), which can be obtained efficiently from a suitable kernel function k(·, ·). [sent-12, score-0.222]

10 Besides, kernel methods have the important computational advantage that no nonlinear optimization is involved. [sent-13, score-0.134]

11 Thus, the use of kernels provides elegant nonlinear generalizations of many existing linear algorithms. [sent-14, score-0.057]

12 In unsupervised learning, the kernel idea has also led to methods such as kernel-based clustering algorithms and kernel principal component analysis [2]. [sent-16, score-0.222]

13 In the field of automatic speech recognition, eigenvoice speaker adaptation [3] has drawn some attention in recent years as it is found particularly useful when only a small amount of adaptation speech is available; e. [sent-17, score-1.464]

14 At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. [sent-20, score-0.057]

15 a new speaker is represented as a linear combination of a few (most important) eigenvoices and the eigenvoice weights are usually estimated by maximizing the likelihood of the adaptation data. [sent-22, score-1.337]

16 In this paper, we investigate the use of nonlinear PCA to find the eigenvoices by kernel methods. [sent-24, score-0.353]

17 In effect, the nonlinear PCA problem is converted to a linear PCA problem in the highdimension feature space using the kernel trick. [sent-25, score-0.179]

18 One of the major challenges is to map the feature-space eigenvoices back to the observation space to compute the state observation likelihood of adaptation data during the estimation of eigenvoice weights and likelihood of test data during decoding. [sent-26, score-1.138]

19 Our solution is to compute kernel PCA using composite kernels. [sent-27, score-0.171]

20 We will call our new method kernel eigenvoice speaker adaptation. [sent-28, score-0.972]

21 Kernel eigenvoice adaptation will have to deal with several parameter spaces. [sent-29, score-0.754]

22 To avoid confusion, we denote the several spaces as follows: the d1 -dimensional observation space as O; the d2 -dimensional speaker (supervector) space as X ; and the d3 -dimensional speaker feature space as F. [sent-30, score-0.762]

23 Brief overviews on eigenvoice speaker adaptation and kernel PCA are given in Sections 2 and 3. [sent-33, score-1.185]

24 Sections 4 and 5 then describe our proposed kernel eigenvoice method and its robust extension. [sent-34, score-0.689]

25 2 Eigenvoice In the standard eigenvoice approach [3], speech training data are collected from many speakers with diverse characteristics. [sent-36, score-0.646]

26 A set of speaker-dependent (SD) acoustic hidden Markov models (HMMs) are trained from each speaker where each HMM state is modeled as a mixture of Gaussian distributions. [sent-37, score-0.375]

27 A speaker’s voice is then represented by a speaker supervector that is composed by concatenating the mean vectors of all HMM Gaussian distributions. [sent-38, score-0.521]

28 Thus, the ith speaker supervector consists of R constituents, one from each Gaussian, and will be denoted by xi = [xi1 . [sent-41, score-0.56]

29 The similarity between any two speaker supervectors xi and xj is measured by their dot product R xi xj = xir xjr . [sent-45, score-0.837]

30 (1) r=1 PCA is then performed on a set of training speaker supervectors and the resulting eigenvectors are called eigenvoices. [sent-46, score-0.426]

31 To adapt to a new speaker, his/her supervector s is treated as a linear combination of the first M eigenvoices {v1 , . [sent-47, score-0.412]

32 , M < 10) are employed so that a little amount of adaptation speech (e. [sent-58, score-0.307]

33 , T , the eigenvoice weights are in turn estimated by maximizing the likelihood of the o t ’s. [sent-64, score-0.564]

34 Furthermore, Qb is related to the new speaker supervector s by Qb (w) = − 1 2 R T r=1 t=1 2 Cr = (ot − sr (w)) where ot − sr (w) of the Gaussian at state r. [sent-66, score-1.096]

35 3 γt (r) d1 log(2π) + log |Cr | + ot − sr (w) 2 Cr , (3) C−1 (ot − sr (w)) and Cr is the covariance matrix r Kernel PCA In this paper, the computation of eigenvoices is generalized by performing kernel PCA instead of linear PCA. [sent-67, score-0.9]

36 In the following, let k(·, ·) be the kernel with associated mapping ϕ which maps a pattern x in the speaker supervector space X to ϕ(x) in the speaker feature space F. [sent-68, score-1.014]

37 The mth orthonormal eigenvector of the covariance matrix in the N αmi feature space is then given by [2] as vm = i=1 √λ ϕ(xi ) . [sent-87, score-0.13]

38 ˜ m 4 Kernel Eigenvoice As seen from Eqn (3), the estimation of eigenvoice weights requires the evaluation of the distance between adaptation data ot and Gaussian means of the new speaker in the observation space O. [sent-88, score-1.393]

39 In the standard eigenvoice method, this is done by first breaking down the adapted speaker supervector s to its R constituent Gaussians s1 , . [sent-89, score-1.112]

40 However, the use of kernel PCA does not allow us to access each constituent Gaussians directly. [sent-93, score-0.137]

41 To get around the problem, we investigate the use of composite kernels. [sent-94, score-0.068]

42 1 Definition of the Composite Kernel For the ith speaker supervector xi , we map each constituent xir separately via a kernel kr (·, ·) to ϕr (xir ), and then construct ϕ(xi ) as ϕ(xi ) = [ϕ1 (xi1 ) , . [sent-96, score-1.051]

43 Analogous to Eqn (1), the similarity between two speaker supervectors xi and xj in the composite feature space is measured by R k(xi , xj ) = kr (xir , xjr ) . [sent-100, score-0.889]

44 r=1 Note that if kr ’s are valid Mercer kernels, so is k [1]. [sent-101, score-0.157]

45 Using this composite kernel, we can then proceed with the usual kernel PCA on the set of N training speaker supervectors and obtain αm ’s, λm ’s, and the orthonormal eigenvectors vm ’s (m = 1, . [sent-102, score-0.668]

46 , M ) of the covariance matrix in the feature space F. [sent-105, score-0.059]

47 2 New Speaker in the Feature Space In the following, we denote the supervector of a new speaker by s. [sent-107, score-0.521]

48 Similar to the standard eigenvoice approach, its ϕ-mapped speaker feature vector2 ϕ(kev) (s) is assumed to be a ˜ ˜ 2 The notation for a new speaker in the feature space requires some explanation. [sent-108, score-1.257]

49 However, since the pre-image of a speaker in the feature space may ˜ linear combination of the first M eigenvectors, i. [sent-110, score-0.373]

50 , M M ϕ(kev) (s) = ˜ N w m vm = m=1 m=1 i=1 wm αmi √ ϕ(xi ). [sent-112, score-0.129]

51 ˜ λm (4) Its rth constituent is then given by M (kev) ϕr ˜ N (sr ) = m=1 i=1 Hence, the similarity between (kev) (sr , ot ) kr (kev) ϕr (sr ) wm αmi √ ϕr (xir ) . [sent-113, score-0.559]

52 i=1 Maximum Likelihood Adaptation Using an Isotropic Kernel On adaptation, we have to express ot − sr 2 r of Eqn (3) as a function of w. [sent-116, score-0.399]

53 ConC sider using isotropic kernels for kr so that kr (xir , xjr ) = κ( xir − xjr Cr ). [sent-117, score-0.712]

54 Then (kev) kr (sr , ot ) = κ( ot − sr 2 r ), and if κ is invertible, ot − sr 2 r will be a function of C C (kev) kr (sr , ot ), which in turn is a function of w by Eqn (5). [sent-118, score-1.606]

55 In the sequel, we will use the Gaussian kernel kr (xir , xjr ) = exp(−βr xir − xjr 2 r ), and hence C M ot − sr 2 Cr =− 1 w 1 (kev) √ m B(m, r, t) . [sent-119, score-1.018]

56 (6) log kr (sr , ot ) = − log A(r, t) + βr βr λm m=1 Substituting Eqn (6) for the Qb function in Eqn (3), and differentiating with respect to each eigenvoice weight, wj , j = 1, . [sent-120, score-0.974]

57 βr kr (sr , ot ) (7) not exist, its notation as ϕ(kev) (s) is not exactly correct. [sent-124, score-0.404]

58 4 Generalized EM Algorithm Because of the nonlinear nature of kernel PCA, Eqn (6) is nonlinear in w and there is no closed form solution for the optimal w. [sent-128, score-0.165]

59 One reasonable approach is to start with the eigenvoice weights of the supervector composed from the speaker-independent model x(si) . [sent-135, score-0.741]

60 That is, N wm = vm ϕ(x(si) ) = ˜ i=1 α √ mi ϕ(xi ) ϕ(x(si) ) = ˜ ˜ λm N = N i=1 N α √ mi [ϕ(xi ) − ϕ] [ϕ(x(si) ) − ϕ] ¯ ¯ λm N α 1 1 √ mi k(xi , x(si) )+ 2 k(xi , xp )+k(x(si) , xp ) k(xp , xq )− N p,q=1 N p=1 λm i=1 . [sent-136, score-0.41]

61 However, since the amount of adaptation data is so little the adaptation performance may vary widely. [sent-138, score-0.463]

62 By replacing ϕ(kev) (s) by ϕ(rkev) (s) for the com˜ ˜ ˜ putation of the kernel value of Eqn (5), and following the mathematical steps in Section 4, one may derive the required gradients for the joint maximum-likelihood estimation of w 0 and other eigenvoice weights in the GEM algorithm. [sent-142, score-0.651]

63 Notice that ϕ(rkev) (s) also contains components in ϕ(x(si) ) from eigenvectors beyond the ˜ ˜ M selected kernel eigenvoices for adaptation. [sent-143, score-0.343]

64 Thus, robust KEV adaptation may have the additional benefit of preserving the speaker-independent projections on the remaining less important but robust eigenvoices in the final speaker-adapted model. [sent-144, score-0.535]

65 6 Experimental Evaluation The proposed kernel eigenvoice adaptation method was evaluated on the TIDIGITS speech corpus [5]. [sent-145, score-0.957]

66 Its performance was compared with that of the speaker-independent model and the standard eigenvoice adaptation method using only 3s, 5. [sent-146, score-0.767]

67 If we exclude the leading and ending silence, the average duration of adaptation speech is 2. [sent-148, score-0.284]

68 There are 163 speakers (of both genders) in each set, each pronouncing 77 utterances of one to seven digits (out of the eleven digits: “0”, “1”, . [sent-155, score-0.092]

69 The speaker characteristics is quite diverse with speakers coming from 22 dialect regions of USA and their ages ranging from 6 to 70 years old. [sent-160, score-0.403]

70 In all the following experiments, only the training set was used to train the speakerindependent (SI) HMMs and speaker-dependent (SD) HMMs from which the SI and SD speaker supervectors were derived. [sent-161, score-0.405]

71 2 Acoustic Models All training data were processed to extract 12 mel-frequency cepstral coefficients and the normalized frame energy from each speech frame of 25 ms at every 10 ms. [sent-163, score-0.058]

72 Each of the eleven digit models was a strictly left-to-right HMM comprising 16 states and one Gaussian with diagonal covariance per state. [sent-164, score-0.056]

73 In addition, there were a 3-state “sil” model to capture silence speech and a 1-state “sp” model to capture short pauses between digits. [sent-165, score-0.089]

74 Thus, the dimension of the observation space d 1 is 13 and that of the speaker supervector space d2 = 11 × 16 × 13 = 2288. [sent-167, score-0.582]

75 Then an SD model was trained for each individual speaker by borrowing the variances and transition matrices from the corresponding SI models, and only the Gaussian means were estimated. [sent-169, score-0.339]

76 Furthermore, the sil and sp models were simply copied to the SD model. [sent-170, score-0.052]

77 3 Experiments The following five models/systems were compared: SI: speaker-independent model EV: speaker-adapted model found by the standard eigenvoice adaptation method. [sent-172, score-0.768]

78 Robust-EV: speaker-adapted models found by our robust version of EV, which is the interpolation between the SI supervector and the supervector found by EV. [sent-173, score-0.472]

79 KEV: speaker-adapted model found by our new kernel eigenvoice adaptation method as described in Section 4. [sent-177, score-0.884]

80 Robust-KEV: speaker-adapted model found by our robust KEV as described in Section 5. [sent-178, score-0.059]

81 All adaptation results are the averages of 5-fold cross-validation taken over all 163 test speaker data. [sent-179, score-0.554]

82 The detailed results using different numbers of eigenvoices are shown in Figure 1, while the best result for each model is shown in Table 1. [sent-180, score-0.219]

83 Table 1: Word recognition accuracies of SI model and the best adapted models found by EV, robust EV, KEV, and robust KEV using 2. [sent-181, score-0.166]

84 50 From Table 1, we observe that the standard eigenvoice approach cannot obtain better performance than the SI model3 . [sent-201, score-0.528]

85 On the other hand, using our kernel eigenvoice (KEV) method, we obtain a word error rate (WER) reduction of 16. [sent-202, score-0.665]

86 When the SI model is interpolated with the KEV model in our robust KEV method, the WER reduction further improves to 27. [sent-209, score-0.059]

87 The results show that nonlinear PCA using composite kernels can be more effective in finding the eigenvoices. [sent-214, score-0.125]

88 5 94 0 1 2 3 4 5 6 7 8 Number of Kernel Eigenvoices 9 10 Figure 1: Word recognition accuracies of adapted models found by KEV and robust KEV using different numbers of eigenvoices. [sent-223, score-0.121]

89 From Figure 1, the KEV method can outperform the SI model even with only two eigenvoices using only 2. [sent-224, score-0.244]

90 Its performance then improves slightly with more eigenvoices or more adaptation data. [sent-226, score-0.459]

91 If we allow interpolation with the SI model as in robust 3 The word accuracy of our SI model is not as good as the best reported result on TIDIGITS which is about 99. [sent-227, score-0.092]

92 The use of this simple model allowed us to run experiments with 5-fold cross-validation using very short adaptation speech. [sent-230, score-0.226]

93 Right now our approach requires computation of many kernel function values and is very computationally expensive. [sent-231, score-0.103]

94 KEV, the saturation effect is even more pronounced: even with one eigenvoice, the adaptation performance is already better than that of SI model, and then the performance does not change much with more eigenvoices or adaptation data. [sent-234, score-0.671]

95 7 Conclusions In this paper, we improve the standard eigenvoice speaker adaptation method using kernel PCA with a composite kernel. [sent-236, score-1.266]

96 In the TIDIGITS task, it is found that while the standard eigenvoice approach does not help, our kernel eigenvoice method may outperform the speaker-independent model by about 28–33% (in terms of error rate improvement). [sent-237, score-1.198]

97 One possible solution is to apply sparse kernel PCA [6] so that computation of the first M principal components involves only M (instead of N with M N ) kernel functions. [sent-239, score-0.222]

98 Another direction is to use compactly supported kernels [7], in which the value of κ( xi − xj ) vanishes when xi − xj is greater than a certain threshold. [sent-240, score-0.174]

99 Moreover, no more computation is required when xi − xj is large. [sent-242, score-0.074]

100 Nonlinear component analysis as a kernel o u eigenvalue problem. [sent-255, score-0.103]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eigenvoice', 0.528), ('kev', 0.425), ('speaker', 0.328), ('ot', 0.247), ('adaptation', 0.226), ('eigenvoices', 0.219), ('supervector', 0.193), ('xir', 0.179), ('kr', 0.157), ('sr', 0.152), ('si', 0.126), ('kernel', 0.103), ('eqn', 0.102), ('pca', 0.097), ('xjr', 0.09), ('ev', 0.082), ('supervectors', 0.077), ('wm', 0.072), ('qb', 0.071), ('mi', 0.071), ('composite', 0.068), ('tidigits', 0.067), ('speech', 0.058), ('vm', 0.057), ('gem', 0.052), ('rkev', 0.052), ('cr', 0.051), ('robust', 0.045), ('xi', 0.039), ('hmm', 0.039), ('sd', 0.039), ('speakers', 0.036), ('xj', 0.035), ('word', 0.034), ('constituent', 0.034), ('rth', 0.034), ('xp', 0.034), ('nonlinear', 0.031), ('hmms', 0.031), ('qa', 0.031), ('corpus', 0.029), ('adapted', 0.029), ('em', 0.029), ('kong', 0.029), ('feature', 0.028), ('observation', 0.027), ('eleven', 0.026), ('sil', 0.026), ('wer', 0.026), ('hong', 0.026), ('kernels', 0.026), ('diverse', 0.024), ('state', 0.024), ('acoustic', 0.023), ('eigenvectors', 0.021), ('br', 0.02), ('silence', 0.02), ('weights', 0.02), ('map', 0.018), ('gaussians', 0.017), ('space', 0.017), ('utterances', 0.017), ('recognition', 0.017), ('gaussian', 0.016), ('wj', 0.016), ('heart', 0.016), ('likelihood', 0.016), ('principal', 0.016), ('accuracies', 0.016), ('digit', 0.016), ('sp', 0.015), ('years', 0.015), ('similarity', 0.015), ('found', 0.014), ('covariance', 0.014), ('likelihoods', 0.014), ('improves', 0.014), ('sch', 0.014), ('orthonormal', 0.014), ('centered', 0.014), ('log', 0.013), ('digits', 0.013), ('method', 0.013), ('isotropic', 0.013), ('interpolation', 0.013), ('outperform', 0.012), ('employed', 0.012), ('notice', 0.012), ('amount', 0.011), ('com', 0.011), ('sar', 0.011), ('pauses', 0.011), ('bay', 0.011), ('borrowing', 0.011), ('copied', 0.011), ('genders', 0.011), ('kwok', 0.011), ('nguyen', 0.011), ('readers', 0.011), ('rev', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

Author: James T. Kwok, Brian Mak, Simon Ho

Abstract: Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. 1

2 0.16919874 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek

Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

3 0.1555437 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos

Abstract: Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM’s and the generative classifiers in speaker identification and image classification. 1

4 0.085423335 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

Author: Pedro F. Felzenszwalb, Daniel P. Huttenlocher, Jon M. Kleinberg

Abstract: In applying Hidden Markov Models to the analysis of massive data streams, it is often necessary to use an artificially reduced set of states; this is due in large part to the fact that the basic HMM estimation algorithms have a quadratic dependence on the size of the state set. We present algorithms that reduce this computational bottleneck to linear or near-linear time, when the states can be embedded in an underlying grid of parameters. This type of state representation arises in many domains; in particular, we show an application to traffic analysis at a high-volume Web site. 1

5 0.067956239 193 nips-2003-Variational Linear Response

Author: Manfred Opper, Ole Winther

Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.

6 0.062563084 112 nips-2003-Learning to Find Pre-Images

7 0.062032394 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

8 0.059320696 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

9 0.059272412 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

10 0.058935851 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

11 0.053138793 5 nips-2003-A Classification-based Cocktail-party Processor

12 0.05284645 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

13 0.05274494 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

14 0.051678494 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

15 0.047632616 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

16 0.044675522 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

17 0.044340398 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

18 0.04241493 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

19 0.035293769 130 nips-2003-Model Uncertainty in Classical Conditioning

20 0.035085086 100 nips-2003-Laplace Propagation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.117), (1, -0.033), (2, 0.022), (3, -0.011), (4, -0.017), (5, 0.083), (6, 0.092), (7, -0.113), (8, 0.026), (9, 0.031), (10, 0.118), (11, -0.152), (12, -0.153), (13, -0.134), (14, 0.144), (15, -0.069), (16, 0.173), (17, 0.008), (18, -0.036), (19, -0.046), (20, -0.053), (21, -0.045), (22, 0.057), (23, 0.021), (24, -0.009), (25, -0.058), (26, -0.058), (27, -0.06), (28, -0.143), (29, -0.01), (30, -0.022), (31, 0.079), (32, -0.03), (33, -0.006), (34, -0.043), (35, 0.213), (36, -0.184), (37, 0.0), (38, 0.089), (39, 0.051), (40, 0.095), (41, -0.006), (42, 0.039), (43, -0.043), (44, -0.031), (45, -0.051), (46, 0.054), (47, -0.002), (48, 0.009), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93800986 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

Author: James T. Kwok, Brian Mak, Simon Ho

Abstract: Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. 1

2 0.82278633 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek

Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

3 0.63603634 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos

Abstract: Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM’s and the generative classifiers in speaker identification and image classification. 1

4 0.39301789 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Author: Kannan Achan, Sam T. Roweis, Brendan J. Frey

Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers. 1

5 0.32889378 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

Author: Pedro F. Felzenszwalb, Daniel P. Huttenlocher, Jon M. Kleinberg

Abstract: In applying Hidden Markov Models to the analysis of massive data streams, it is often necessary to use an artificially reduced set of states; this is due in large part to the fact that the basic HMM estimation algorithms have a quadratic dependence on the size of the state set. We present algorithms that reduce this computational bottleneck to linear or near-linear time, when the states can be embedded in an underlying grid of parameters. This type of state representation arises in many domains; in particular, we show an application to traffic analysis at a high-volume Web site. 1

6 0.29232013 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

7 0.29099274 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

8 0.29042363 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

9 0.28509426 112 nips-2003-Learning to Find Pre-Images

10 0.26523691 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

11 0.2576001 123 nips-2003-Markov Models for Automated ECG Interval Analysis

12 0.24627963 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

13 0.23629875 193 nips-2003-Variational Linear Response

14 0.23572916 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

15 0.22944552 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

16 0.21721697 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

17 0.21497846 1 nips-2003-1-norm Support Vector Machines

18 0.21378043 108 nips-2003-Learning a Distance Metric from Relative Comparisons

19 0.21349153 100 nips-2003-Laplace Propagation

20 0.21298513 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (11, 0.026), (24, 0.393), (29, 0.022), (30, 0.013), (35, 0.051), (53, 0.085), (69, 0.011), (71, 0.115), (76, 0.04), (85, 0.053), (91, 0.041), (99, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77626634 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

Author: James T. Kwok, Brian Mak, Simon Ho

Abstract: Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. 1

2 0.72979873 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

3 0.59809595 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear filtering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-fire mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to find the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear filtering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear filter is typically interpreted as the neuron’s “receptive field,” while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the filtered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear filter [5], and the properties of estimators (for both the linear filter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the fixed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the fine temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-fire (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear filter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, filtered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efficient estimation of the model parameters. Second, noise allows us to explicitly model neural firing statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise influences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Specifically, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efficiently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efficiency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of fixed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufficiently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable firing regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its first two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that influence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. We should also note that the proof extends without difficulty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The first technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional firing rate f (t) satisfies t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only figure significantly when the V (t) is driven close to threshold, a small number of constraints often suffice to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function specified in a five-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs significantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efficiently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “refining” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional firing rate function f (t). Recall that we showed how to efficiently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population firing rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-fire neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

4 0.39386317 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1

5 0.38708302 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

6 0.38352039 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

7 0.38274494 122 nips-2003-Margin Maximizing Loss Functions

8 0.37653279 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

9 0.37452513 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

10 0.37437072 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

11 0.37298223 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

12 0.37190831 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

13 0.3718701 145 nips-2003-Online Classification on a Budget

14 0.37166178 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

15 0.37164646 189 nips-2003-Tree-structured Approximations by Expectation Propagation

16 0.37094566 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

17 0.36964673 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

18 0.36837265 126 nips-2003-Measure Based Regularization

19 0.36665124 100 nips-2003-Laplace Propagation

20 0.365592 108 nips-2003-Learning a Distance Metric from Relative Comparisons