nips nips2001 nips2001-15 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called
Reference: text
sentIndex sentText sentNum sentScore
1 d e Abstract Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. [sent-18, score-0.385]
2 Their so called "Fisher kernel" has been combined with discriminative classifiers such as SVM and applied successfully in e. [sent-19, score-0.13]
3 Whereas the Fisher kernel (FK) is calculated from the marginal log-likelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior log-odds . [sent-22, score-0.665]
4 Furthermore we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing FK and TOP. [sent-23, score-0.392]
5 In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel. [sent-24, score-0.436]
6 1 Introduction In classification tasks , learning enables us to predict the output y E {-1 , + 1} of some unknown system given the input a! [sent-25, score-0.136]
7 i ' y;}i=l' The purpose of a feature extractor f : X --+ ]RD is to convert the representation of data without losing the information needed for classification [3] . [sent-27, score-0.429]
8 When X is a vector space like ]Rd , a variety of feature extractors have been proposed (e. [sent-28, score-0.281]
9 However, they are typically not applicable when X is a set of sequences of symbols and does not h ave the structure of a vector space as in DNA or protein analysis [2]. [sent-31, score-0.304]
10 Recently, the Fisher kernel (FK) [6] was proposed to compute features from a probabilistic model p( a! [sent-32, score-0.385]
11 Then , the tangent vector of the log m arginal likelihood log p( ~ 1 is used 9) as a feature vector. [sent-35, score-0.253]
12 The Fisher kernel refers to the inner product in this feature space, but the method is effectively a feature extractor (also since the features are computed explicitly). [sent-36, score-0.746]
13 The Fisher kernel was combined with a discriminative classifier such as SVM and achieved excellent classification result s in several fields, for example in DNA and protein analysis [6 , 5]. [sent-37, score-0.898]
14 Empirically, it is reported that the FK-SVM system often outperforms the classification performance of the plug-in es- timate. [sent-38, score-0.136]
15 1 Note that the Fisher kernel is only one possible member in the family of feature extractors f iJ (re ) : X --+ ]RD that can be derived from probabilistic models . [sent-39, score-0.727]
16 Since model-dependent feature extractors have been newly developed, the performance measures for them are not yet established. [sent-42, score-0.325]
17 Then, we define a new kernel (or equivalently a feature extractor) derived from t he Tangent vector Of Posterior log-odds - that we denote as TOP kernel. [sent-44, score-0.453]
18 vVe will analyze the performance of the TOP kernel and the Fisher kernel in terms of our performance measures. [sent-45, score-0.636]
19 Then the TOP kernel is compared favorably to the Fisher kernel in a protein classification experiment. [sent-46, score-1.051]
20 Let re E X b e the input 'point' and y E { -1 , +1 } be the class label. [sent-48, score-0.169]
21 As the Fisher kernel is commonly used in combination with linear classifiers such as SVMs, one reasonable performance measure is the classification error of a linear classifier wTfiJ (re) + b (w E]RD and b E]R) in the feature space. [sent-62, score-0.761]
22 Usually wand b are learned, so the optimal feature extractor is different wit h regard to each learning algorithm. [sent-63, score-0.395]
23 When wand b are optimally chosen, the classification error is R(fiJ) = min wES ,bE~ E""y [-y(w T fiJ(re ) + b)], (2 . [sent-65, score-0.231]
24 R(f iJ) is at least as large as the Bayes error L * [3] and R(f iJ) = L * only if the linear classifier implements the same decision rule as the Bayes optimal rule. [sent-67, score-0.133]
25 As a related measure , we consider the estimation error of the posterior probability by a logistic regressor F(w T fiJ(re ) + b), with e. [sent-68, score-0.089]
26 2) The relationship between D(fiJ ) and R(fiJ) is illustrated as follows: Let L be t he classification error rate of a posterior probability estimator P(y + lire). [sent-72, score-0.295]
27 4) R(fiJ) - L* :s; 2D(fiJ)· 1 In classification by plug-in estimate, re is classified by t hresholding the posterior probability fj = sign(P(y = +llre, 0) - 0. [sent-77, score-0.326]
28 --------------------------- Since D(fo ) is an upper bound on R(fo), it is useful to derive a new kernel to minimize D(f 0) ' as will be done in Sec. [sent-79, score-0.318]
29 3 The Fisher kernel The Fisher kernel (FK) is defined 2 as K (;e , ;e' ) = s(;e ,iJ )TZ-1(iJ)s (;e' ,iJ) , where s is the Fisher score s(;e ,iJ ) = (otl1logp(;eliJ) , . [sent-81, score-0.636]
30 The theoretical foundation of FK is described in the following theorem [6]: "a kernel classifier employed the Fisher kernel derived from a model that cont ains the lab el as a la tent variable is , asymptotically, at least as good a classifier as t he MAP labeling based on the model" . [sent-85, score-0.913]
31 The theorem says that the Fisher kernel can perform at least as well as the plug-in estimate, if the parameters of linear classifier are properly det ermined (cf. [sent-86, score-0.539]
32 With our p erforman ce measure, this t heorem can be represented more concisely: R(f 0) is bounded by the classificat ion error of t he plug-in estimate R(fo) :S; E""y [- y(P(y = + ll;e,iJ ) - 0. [sent-88, score-0.21]
33 1 ) Not e that the classification rule constructed b y the plug-in estimate P( y = + 11 , iJ) ;e can also be realized by a linear classifier in feat ure space. [sent-91, score-0.388]
34 1) is important since it gu arantees that the Fisher kernel performs better when t he optimal w and b are available. [sent-93, score-0.318]
35 However, the Fisher kernel is not the only one to satisfy t his inequality. [sent-94, score-0.318]
36 In the next section, we present a new kernel which satisfies (3. [sent-95, score-0.355]
37 4 The TOP Kernel Definition Now we proceed to propose a new kernel: Our aim is to obtain a feature extractor that achieves small D(f 0). [sent-97, score-0.293]
38 Let us define v( ;e,9) = p-1 (p (y = +11;e , 9 )) = 10g( P( y = +11;e,9 )) -log(P (y = -11;e,9) ), which is called the posterior log-odds of a probabilistic model [1]. [sent-105, score-0.123]
39 By Taylor expansion around the estimate iJ up to t he first order 4 , we can approximate v( ;e,9*) as l' v( ;e,9*) ~ v( ;e,iJ) + L0tliv( ;e ,iJ)(e: - ad. [sent-106, score-0.079]
40 2) i=l 2In practice, some variants of the Fisher kernel are used. [sent-108, score-0.318]
41 For example, if the derivative of each class distribution , not marginal , is taken, the feature vector of FK is quite simila r to that of our kernel. [sent-109, score-0.199]
42 However , th ese variants should b e deliberately discrimin at ed from the Fisher kernel in theoretical disc ussions. [sent-110, score-0.415]
43 Throughout this pap er including ex p erim ents, we adopt the o rigi nal defi ni t ion of the F isher kern el from [6] . [sent-111, score-0.289]
44 3Notice t h at p- l (t) = log t - log(l - t ) 40 ne can easily derive TOP kern els from higher order Taylor ex pansions . [sent-112, score-0.11]
45 Howeve r, we will only deal wit h t he first order expansion here, because higher order ex pansio ns would induce extremely high dimensional feature vectors in practical cases. [sent-113, score-0.245]
46 Since a Tangent vector Of the Posterior log-odds const itutes the main p art of the feature vector, we call the inner product of t he two feature vectors "TOP kernel" : (4. [sent-118, score-0.27]
47 5) It is easy to verify t hat the TOP kernel satisfies (3. [sent-119, score-0.403]
48 A Theoretical Analysis In this section , we compare the TOP kernel with the plugin es timate in terms of p erformance measures . [sent-126, score-0.362]
49 When the TOP kernel is used, D(fo) :S E",IF((w* )T fo(al)) - P (y = +1 Ial,8*)I , (4. [sent-131, score-0.318]
50 Thus, D(fo) :S ME",I(w *)T fo (rn ) - P-l (P (y = + 1Irn , 8* )) I· (4. [sent-135, score-0.22]
51 This shows t h at if w and b are optimally chosen , t he rate of convergence of the TOP kernel is much faster than that of the plug-in estimate. [sent-140, score-0.383]
52 This result is closely related to large sample p erforman ces : Assuming t hat 8 is a n 1/ 2 consistent estimator with asymptotic normality (e. [sent-141, score-0.133]
53 So we can directly derive the convergen ce order as D,,(8) = Op (n- l / 2 ) and D(f 0) = Op( n - l ). [sent-145, score-0.098]
54 5 Therefore, t he TOP kernel h as a much b etter convergen ce rate in R(f 0)' which is a strong motiva tion to use the TOP kernel instead of plug-in estimate. [sent-148, score-0.861]
55 5Fo r detail ed disc ussion a bout t he conve rgence orders of classificatio n e rror, see C ha pte r 6 of [1] However, we must notice that this fast rate is possible only when the optimal linear classifier is combined with the TOP kernel. [sent-149, score-0.216]
56 Since non-optimal linear classifiers typically have the rate Op(n- 1 / 2 ) [1 ], the overall rate is dominated by the slower rate and turns out to be Op (n - 1 / 2 ) . [sent-150, score-0.153]
57 But this theoretical analysis is still meaningful, because it shows the existence of a very efficient linear boundary in the TOP feature space. [sent-151, score-0.179]
58 Exponential Family: A Special Case ·When the distribution of two classes belong to the exponential family, the TOP kernel can achieve an even better result than shown above . [sent-155, score-0.425]
59 Distributions of the exponential family can be written as q( re , 11) = exp( 11 T t (;I! [sent-156, score-0.231]
60 Then, the probabilistic model IS described as where 8 = {O' , 11+1 ' 11 - 1}· The posterior log-odds reads The TOP feature vector is described as A A AT ATT iiJ(;I! [sent-160, score-0.258]
61 ,8*)) can be constructed as a linear function in the feature space (4. [sent-167, score-0.135]
62 Furthermore, since each feature is represented as a linear function of sufficient statistics t+1 (re) and t - l (re), one can construct an equivalent feature space as (t + l (re) T, Ll (re) T) T without knowing iJ. [sent-170, score-0.297]
63 5 Experiments on Protein Data In order to illustrate that the TOP kernel works well for real-world problems , we will show t he result s on protein classification. [sent-172, score-0.545]
64 The protein sequence data is obtained from the Superfamily websit e. [sent-173, score-0.227]
65 6 This site provides sequence files with different degrees of redundancy filtering ; we used the one with 10% redundancy filtering. [sent-174, score-0.074]
66 In our experiment , we determine the top category "classes" as the learning target. [sent-177, score-0.252]
67 The numbers of sequences in the classes are listed as 791, 1277, 1015 , 915,84,76,383 . [sent-178, score-0.106]
68 We only use the first 4 classes, and 6 two-class problem s are generated from all pairs among t he 4 classes . [sent-179, score-0.071]
69 The 5th and 6th classes are not used because t he number of examples is too small. [sent-180, score-0.071]
70 In each two-class problem , the examples are randomly divided into 25 % training set, 25 % validation set and 50% t est set. [sent-182, score-0.117]
71 uk/S UPERFAMILY / As a probabilistic model for protein sequences, we make use of hidden markov models [2] with fully connected states. [sent-188, score-0.294]
72 To construct FK and TOP kernels , the derivatives with respect to all paramet ers of the HMMs from both classes are included. [sent-192, score-0.177]
73 Then, the marginal di stribution is written as p(ocI8) = aq( oc, e+1 ) + (1- a)q( oc, L1) , where a is a parameter corresponding to the class prior. [sent-194, score-0.095]
74 The feature vector of FK consists of the following: V'e, logp( oc I8) 00: logp(oc I8) P (y=s loc , 8)V'e , logq(oc ,es) 1 SE {-l ,+l } , 1 ' --;;-P (y = +1 1 , 9) - - - P(y = -11°c, 9) , °c , a I -a (5. [sent-195, score-0.291]
75 2) while the feature vector of TOP includes V'e ,v( oc ,8) sV'e , logq( oc ,e s) s = {+ 1, _ 1}. [sent-197, score-0.447]
76 5), we did not include any normalization of feature vectors. [sent-201, score-0.135]
77 However, in practical situations, it is effective to normalize feature s for improving classification performance. [sent-202, score-0.271]
78 Here, each feature of the TOP kernel is normalized to have mean 0 and variance 1. [sent-203, score-0.453]
79 Both, the TOP kernel and FK are combined with SVMs with bias terms. [sent-205, score-0.343]
80 To avoid this problem, the threshold is determined such that the false positive rate and the false negative rate are equal in the test set. [sent-210, score-0.128]
81 The hybrid HMM-TOP-SVM system has several model parameters: the number of HMM states, the pseudo count value [2] and the regularization parameter C of the SVM. [sent-212, score-0.138]
82 vVe determine these parameters as follows: First, the number of states and the pseudo count value are determined such that the error of the HMM on the validation set (i. [sent-213, score-0.207]
83 Based on the chosen HMM model, the paramet er C is det ermined such that the validation error of TOP-SVM is minimized. [sent-216, score-0.273]
84 Here, the number of states and the pseudo count value are chosen from {3 , 5,7,10,15,20,30,40, 60} and {l0 -1 0, 10 - 7 , 10 - 5 , 10 - 4 ,10 - 3 , 1O- 2 }, respectively. [sent-217, score-0.107]
85 Note that the model selection is performed in the same manner for the Fisher kernel as well. [sent-219, score-0.318]
86 Compared with the plug-in est imate, the Fisher kernel performed significantly better in several sett ings (i. [sent-222, score-0.394]
87 However, our TOP approach significantly outperforms the Fisher kernel: According to the Wilcoxson signed ranks test, the TOP kernel was significantly better 7Several HMM models have been engineered for protein classification [2]. [sent-226, score-0.803]
88 ~ ,~ ~ 1 :::1 034 032 , 03 0 28 1 I P FK TOP P FK TOP Figure 1: The error rates of SVMs with two feature extractors in t he protein classification experiments. [sent-258, score-0.712]
89 T he labels 'P ','FK' ,'T OP' denote t he p lug-in estimate , the F isher kernel and t he TOP kernel, respect ively. [sent-259, score-0.433]
90 T he t itle on each subfigure shows the t wo prot ein classes used for t he experiment. [sent-260, score-0.103]
91 1-2 F igure 2: Comparison of the error rates of t he F isher kernel and t he TO P kernel in discrim ination between class 1 and 2. [sent-261, score-0.897]
92 Except t wo cases, t he T OP kernel achieves smaller error rates . [sent-263, score-0.418]
93 This indicates that the T OP kernel was able to capture discrim inative information better than t he Fisher kernel. [sent-274, score-0.45]
94 6 Conclusion In this study, we presented the new discrim inative TOP kernel derived from probabilistic models. [sent-275, score-0.517]
95 Since the theoret ical framework for su ch kernels has so far not been established, we proposed two performance measures to analyze them and gave bounds an d rates to gain a bett er insigh t into model depen dent feat ure extractors from probabilistic models. [sent-276, score-0.459]
96 Exp erimentally, we showed that the T OP kernel compares favorably to F K in a realistic protein classification experim ent . [sent-277, score-0.733]
97 Note t h at Sm ith and Gales[8] h ave sh own t h at a similar approach works excellent ly in speech recogni tion tasks as well. [sent-278, score-0.111]
98 Fu t ure research will focus on constructing sm all sam ple bounds fo r t he T OP kernel to exten d the validity of t his work. [sent-279, score-0.636]
99 Since other nonlinear transformat ions F are possible for obtaining different and possibly b etter features, we will furt hermore consider to learn t he nonlinear transformat ion F from training samples . [sent-280, score-0.195]
100 An interes ting point is that so far T OP kernels perform local linear approximations, it would be interesting to move in the direction of local or even Table 1: P-values of statistical test s in the protein classification experiments . [sent-281, score-0.389]
wordName wordTfidf (topN-words)
[('fk', 0.332), ('kernel', 0.318), ('fisher', 0.292), ('op', 0.253), ('top', 0.252), ('protein', 0.227), ('fo', 0.22), ('fij', 0.207), ('extractor', 0.158), ('oc', 0.156), ('extractors', 0.146), ('classification', 0.136), ('feature', 0.135), ('re', 0.134), ('wx', 0.117), ('classifier', 0.1), ('discrim', 0.079), ('isher', 0.079), ('classes', 0.071), ('rd', 0.07), ('pseudo', 0.069), ('vve', 0.069), ('probabilistic', 0.067), ('validation', 0.067), ('ij', 0.067), ('discriminative', 0.066), ('tangent', 0.064), ('ure', 0.063), ('family', 0.061), ('posterior', 0.056), ('convergen', 0.053), ('disc', 0.053), ('elij', 0.053), ('erforman', 0.053), ('ermined', 0.053), ('feat', 0.053), ('inative', 0.053), ('kern', 0.053), ('llre', 0.053), ('logq', 0.053), ('paramet', 0.053), ('sdam', 0.053), ('transformat', 0.053), ('wilcoxson', 0.053), ('favorably', 0.052), ('dna', 0.05), ('est', 0.05), ('hat', 0.048), ('logp', 0.046), ('etter', 0.046), ('pot', 0.046), ('hmm', 0.046), ('ce', 0.045), ('taylor', 0.045), ('measures', 0.044), ('theoretical', 0.044), ('ion', 0.043), ('tion', 0.043), ('expansion', 0.043), ('hmms', 0.042), ('det', 0.042), ('ave', 0.042), ('classifiers', 0.039), ('jaakkola', 0.039), ('rate', 0.038), ('count', 0.038), ('satisfies', 0.037), ('ranks', 0.037), ('redundancy', 0.037), ('wit', 0.037), ('exponential', 0.036), ('estimate', 0.036), ('class', 0.035), ('sequences', 0.035), ('sm', 0.035), ('tsuda', 0.035), ('wand', 0.035), ('rates', 0.035), ('svms', 0.034), ('el', 0.033), ('signed', 0.033), ('error', 0.033), ('estimator', 0.032), ('wo', 0.032), ('parameter', 0.031), ('ex', 0.03), ('regard', 0.03), ('significant', 0.029), ('marginal', 0.029), ('log', 0.027), ('optimally', 0.027), ('construct', 0.027), ('false', 0.026), ('ni', 0.026), ('significantly', 0.026), ('kernels', 0.026), ('properly', 0.026), ('excellent', 0.026), ('er', 0.025), ('combined', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called
2 0.21451892 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
3 0.18286876 172 nips-2001-Speech Recognition using SVMs
Author: N. Smith, Mark Gales
Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1
4 0.17589962 60 nips-2001-Discriminative Direction for Kernel Classifiers
Author: Polina Golland
Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.
5 0.17031106 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
6 0.1483357 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
7 0.1446726 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.14234598 134 nips-2001-On Kernel-Target Alignment
9 0.13564692 74 nips-2001-Face Recognition Using Kernel Methods
10 0.13282965 170 nips-2001-Spectral Kernel Methods for Clustering
11 0.11360861 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
12 0.098380245 139 nips-2001-Online Learning with Kernels
13 0.097481452 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
14 0.094300672 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
15 0.090704806 21 nips-2001-A Variational Approach to Learning Curves
16 0.089520156 56 nips-2001-Convolution Kernels for Natural Language
17 0.086423956 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
18 0.083945289 105 nips-2001-Kernel Machines and Boolean Functions
19 0.08294867 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
20 0.081669211 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
topicId topicWeight
[(0, -0.267), (1, 0.153), (2, -0.076), (3, -0.077), (4, 0.032), (5, 0.177), (6, 0.115), (7, -0.024), (8, -0.053), (9, 0.131), (10, -0.099), (11, -0.006), (12, -0.011), (13, -0.079), (14, 0.033), (15, 0.001), (16, 0.007), (17, 0.067), (18, -0.02), (19, -0.015), (20, 0.085), (21, 0.129), (22, -0.093), (23, -0.062), (24, -0.036), (25, 0.068), (26, -0.055), (27, 0.144), (28, -0.108), (29, 0.074), (30, -0.103), (31, -0.005), (32, -0.005), (33, -0.115), (34, 0.074), (35, 0.044), (36, 0.068), (37, -0.058), (38, -0.022), (39, -0.029), (40, -0.077), (41, 0.026), (42, -0.027), (43, 0.059), (44, 0.003), (45, 0.012), (46, -0.037), (47, 0.032), (48, 0.046), (49, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.96296692 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called
2 0.73809731 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
3 0.69186348 172 nips-2001-Speech Recognition using SVMs
Author: N. Smith, Mark Gales
Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1
4 0.61829948 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
5 0.60625219 74 nips-2001-Face Recognition Using Kernel Methods
Author: Ming-Hsuan Yang
Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction
6 0.59847033 60 nips-2001-Discriminative Direction for Kernel Classifiers
7 0.59521568 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.59429014 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
10 0.56681657 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
11 0.55882591 134 nips-2001-On Kernel-Target Alignment
12 0.53856432 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
13 0.51654625 155 nips-2001-Quantizing Density Estimators
14 0.51290935 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists
15 0.48161542 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
16 0.47778046 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
17 0.474567 170 nips-2001-Spectral Kernel Methods for Clustering
18 0.46307901 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
19 0.45504364 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
20 0.44953543 56 nips-2001-Convolution Kernels for Natural Language
topicId topicWeight
[(14, 0.071), (17, 0.031), (19, 0.027), (27, 0.13), (30, 0.048), (36, 0.014), (38, 0.018), (44, 0.322), (59, 0.022), (72, 0.086), (79, 0.051), (83, 0.012), (91, 0.095)]
simIndex simValue paperId paperTitle
same-paper 1 0.79064417 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called
2 0.68968427 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
3 0.53934127 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
4 0.53486288 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
5 0.53242493 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
6 0.53132063 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
7 0.53006542 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.52964824 137 nips-2001-On the Convergence of Leveraging
9 0.52953094 58 nips-2001-Covariance Kernels from Bayesian Generative Models
10 0.52774107 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
11 0.52761608 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
12 0.52609825 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
13 0.52561533 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
14 0.52503628 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
15 0.52424842 134 nips-2001-On Kernel-Target Alignment
17 0.52057886 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
18 0.52005839 155 nips-2001-Quantizing Density Estimators
19 0.51984596 114 nips-2001-Learning from Infinite Data in Finite Time
20 0.51850033 188 nips-2001-The Unified Propagation and Scaling Algorithm