nips nips2003 nips2003-31 knowledge-graph by maker-knowledge-mining

31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

Source: pdf

Author: Dörthe Malzahn, Manfred Opper

Abstract: We compute approximate analytical bootstrap averages for support vector classiﬁcation using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk 1 Abstract We compute approximate analytical bootstrap averages for support vector classiﬁcation using a combination of the replica method of statistical physics and the TAP approach for approximate inference. [sent-8, score-1.206]

2 We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. [sent-9, score-0.163]

3 1 Introduction The bootstrap method [1, 2] is a widely applicable approach to assess the expected qualities of statistical estimators and predictors. [sent-10, score-0.656]

4 Say, for example, in a supervised learning problem, we are interested in measuring the expected error of our favorite prediction method on test points 1 which are not contained in the training set D0 . [sent-11, score-0.138]

5 If we have no hold out data, we can use the bootstrap approach to create artiﬁcial bootstrap data sets D by resampling with replacement training data from the original set D0 . [sent-12, score-1.205]

6 , some of the examples will appear several times in the bootstrap sample and others not at all. [sent-15, score-0.599]

7 A proxy for the true average test error can be obtained by retraining the model on each bootstrap training set D, calculating the test error only on those points which are not contained in D and ﬁnally averaging over all possible sets D. [sent-16, score-0.871]

8 While in general bootstrap averages can be approximated to any desired accuracy by the Monte-Carlo method, by generating a large enough number of random samples, it is useful to have also analytical approximations which avoid the time consuming retraining of the model for each new sample. [sent-17, score-0.873]

9 Existing analytical approximations (based on asymptotic techniques) such as the delta method and the saddle point method require usually explicit analytical formulas for the estimators of the parameters for a trained model (see e. [sent-18, score-0.466]

10 These may not be easily obtained for more complex models in machine learning such as support vector machines (SVMs). [sent-21, score-0.04]

11 Recently, we introduced a novel approach for the approximate calculation of bootstrap averages [4] which avoids explicit formulas for parameter estimates. [sent-22, score-0.844]

12 Instead, we deﬁne statistical estimators and predictors implicitly 1 The average is over the unknown distribution of training data sets. [sent-23, score-0.246]

13 Within this formulation, it becomes possible to perform averages over bootstrap samples analytically using the so-called “replica trick” of statistical physics [5]. [sent-25, score-0.819]

14 The latter involves a speciﬁc analytic continuation of the original statistical model. [sent-26, score-0.094]

15 As a ﬁnal step, we use techniques for approximate inference to treat the probabilistic model. [sent-28, score-0.074]

16 This combination of techniques allows us to obtain approximate bootstrap averages by solving a set of nonlinear equations rather than by explicit sampling. [sent-29, score-0.803]

17 Our method has passed a ﬁrst test successfully on the simple case of Gaussian process (GP) regression, where explicit predictions are still cheaply computed. [sent-30, score-0.099]

18 Also, since the original model is a smooth probabilistic one, the success of approximate inference techniques may be not too surprising. [sent-31, score-0.074]

19 In this paper, we will address a more challenging problem, that of the support vector machine. [sent-32, score-0.04]

20 Hence it is not clear a priori if our framework would survive these delicate limiting manipulations and still be able to give good approximate answers. [sent-35, score-0.076]

21 2 Hard Margin Support Vector Classiﬁers ˆ The hard margin SVM is a classiﬁer which predicts binary class labels y = sign[ fD0 (x)] ∈ {−1, 1} for inputs x ∈ I d based on a set of training points D0 = (z1 , z2 , . [sent-36, score-0.149]

22 , zN ), where R ˆ zi = (xi , yi ) (for details see [6]). [sent-39, score-0.115]

23 The usually nonlinear activation function fD0 (x) (which ˆ (x) = N yi αi K(x, xi ), where K(x, x ) we will call “internal ﬁeld”) is expressed as fD0 i=1 is a positive deﬁnite kernel and the set of αi ’s is computed from D0 by solving a certain convex optimization problem. [sent-40, score-0.238]

24 For bootstrap problems, we ﬁx the pool of training data D0 , and consider the statistics ˆ ˆ ˆ of vectors ˆD = (fD (x1 ), . [sent-41, score-0.606]

25 , fD (xN )) at all inputs xi ∈ D0 , when the predictor f is f computed on randomly chosen subsets D of D0 . [sent-44, score-0.111]

26 Unfortunately, we do not have an explicit analytical expression for ˆD , but it is obtained implicitly as the vector f = (f1 , . [sent-45, score-0.199]

27 , fN ) f which solves the constraint optimization problem Minimize f T K−1 f with fi yi ≥ 1 for all i such that (xi , yi ) ∈ D (1) K is the kernel matrix with elements K(xi , xj ). [sent-48, score-0.399]

28 (3) Our general notation suggests that this principle will apply to a variety of estimators and predictors of the MAP type. [sent-50, score-0.121]

29 (4) The pseudo-likelihood 2 is deﬁned by P (D|f ) = j: zj ∈D P (zj |fj ) = j: zj ∈D Θ(yj fj − 1) (5) where Θ(u) = 1 for u > 0 and 0 otherwise. [sent-55, score-0.334]

30 In the limit β → ∞, the measure P [f |D] ∝ µ[f ]P (D|f ) obviously concentrates at the vector ˆ which solves Eq. [sent-56, score-0.045]

31 f 4 Analytical Bootstrap Averages Using the Replica Trick With the bootstrap method, we would like to compute average properties of the estimator ˆD , Eq. [sent-58, score-0.636]

32 An important class of such f averages are of the type of a generalization error ε which are expectations of loss functions ˆ g(fD (xi ); xi , yi ) over test points i, i. [sent-60, score-0.434]

33 , those examples which are in D0 but not contained in the bootstrap training set D. [sent-62, score-0.643]

34 1 ε= N N ˆ ED δsi ,0 g(fD (xi ); xi , yi ) (6) ED [δsi ,0 ] i=1 where ED [· · · ] denotes the expectation over random bootstrap samples D which are created from the original training set D0 . [sent-64, score-0.844]

35 , sN ) where si is the number of times example zi appears in the set N D and i=1 si = S. [sent-68, score-0.26]

36 The Kronecker symbol, deﬁned by δsi ,0 = 1 for si = 0 and 0 else, guarantees that only realizations of bootstrap training sets D contribute to Eq. [sent-69, score-0.766]

37 For ﬁxed bootstrap sample size S, the distribution of s i ’s is multinomial. [sent-71, score-0.631]

38 Then we get the simpler, factorizing joint distribution N P (s) = S ( N )si e−S/N si ! [sent-73, score-0.162]

39 (6) can be rewritten as 1 ε(S) = N N i=1 ED δsi ,0 Z −r r a=1 df a µ[f a ] fia N a sj j=1 (P (zj |fj )) ED [δsi ,0 ] . [sent-80, score-0.288]

40 (8) were raised to positive powers rather than negative 2 3 It does not allow a full probabilistic interpretation [8]. [sent-88, score-0.079]

41 The superscripts should NOT be confused with powers of the variables. [sent-89, score-0.045]

42 ones, one could perform the bootstrap average over the distribution Eq. [sent-90, score-0.63]

43 To enable such an analytical average over the vector s (which is the “quenched disorder” in the language of statistical physics) one introduces the following “trick” extensively used in statistical physics of amorphous systems [5]. [sent-92, score-0.281]

44 We introduce the auxiliary quantity    r  N N  1 a (P (zj |fj ))sj  (9) ED δsi ,0 Z n−r df a µ[f a ] fia εn (S) = S   e− N N a=1 i=1 j=1 for arbitrary real n, which allows to write ε(S) = lim εn (S). [sent-93, score-0.352]

45 (10) n→0 The advantage of this deﬁnition is that for integers n ≥ r, εn (S) can be represented in terms of n replicas f 1 , f 2 , . [sent-94, score-0.045]

46 , f n of the original variable f for which an explicit average over si ’s is possible. [sent-97, score-0.224]

47 At the end of all calculations an analytical continuation to arbitrary real n and the limit n → 0 must be performed. [sent-98, score-0.246]

48 (3), exchange the expectation over datasets with the expectation over f ’s and use the explicit form of the distribution Eq. [sent-100, score-0.185]

49 The resulting expressions can be rewritten as 4 N r i=1 a=1 \i Ξn εn (S) = N fia , (11) \i where · · · \i denotes an average with respect to the so called cavity distribution P\i for replicated variables fi = (fi1 , . [sent-102, score-0.496]

50 , fin ) deﬁned by P\i (fi ) ∝ N 1 Li ( f i ) df j P ( f 1 , . [sent-105, score-0.132]

51 n a=1 The joint distribution of replica variables P (f1 , . [sent-109, score-0.195]

52 The ADATAP approach replaces the analytically intractable cavity distribution Eq. [sent-117, score-0.206]

53 , fin and assume replica symmetry and further 4 \i \i P\i (fi ), Eq. [sent-126, score-0.214]

54 introduce an explicit scaling of all parameters with β. [sent-128, score-0.097]

55 This scaling was found to make all ﬁnal expressions ﬁnite in the limit β → ∞. [sent-129, score-0.045]

56 (15) keeps the number of adjustable parameters independent of n and allows to perform the “replica limit” n → 0 and the “SVM-limit” β → ∞ in all equations analytically before we start the ﬁnal numerical parameter optimization. [sent-137, score-0.062]

57 (14) and (15) and resumming the power series over r yields the ﬁnal theoretical expression for Eq. [sent-140, score-0.051]

58 (6) ε(S) = 1 N N γc (i) + u −λc (i) ; xi , yi ∆λc (i) dG(u) g i=1 (16) u2 1 where dG(u) = du(2π)− 2 e− 2 and g is an arbitrary loss function. [sent-141, score-0.194]

59 ˆ ˆ g(fD (xi ); xi , yi ) = Θ(−yi fD (xi )) we obtain the bootstrapped classiﬁcation error ε(S) = where Φ(x) = x −∞ 1 N N Φ − i=1 yi γc (i) With (17) −λc (i) dG(u). [sent-142, score-0.543]

60 Besides the computation of generalization errors, we can use our method to quantify the uncertainty of the SVM prediction at test points. [sent-143, score-0.081]

61 This can be obtained by computing the ˆ bootstrap distribution of the “internal ﬁelds” fD (xi ) at a test input xi . [sent-144, score-0.714]

62 This is obtained ˆ (xi ); xi , yi ) = δ(fD (xi ) − h) using the Dirac δ-function ˆ from Eq. [sent-145, score-0.194]

63 , mc = ∆λc (i) and Vii = − (∆λcc(i) 2 are the predicted mean and variance of the internal i (i)) ﬁeld. [sent-148, score-0.159]

64 (The predicted posterior variance of the internal ﬁeld is (β∆λc (i))−1 and goes to zero as β → ∞ indicating the transition to a deterministic model. [sent-149, score-0.057]

65 / This replaces ∆λc (i), γc (i), λc (i) by −1 N ∆λc (x) = K(x, x) − K(x, xi )∆λ(i)Ti (x) (19) i=1 N γc (x) = ∆λc (x) Ti (x)γ(i) i=1 N λc (x) = (∆λc (x))2 (Ti (x))2 λ(i) i=1 N with Ti (x) = j=1 K(x, xj )(I + diag(∆λ)K)−1 . [sent-152, score-0.12]

66 5 1 Bootstrapped local field at a test input x Figure 1: Left: Average bootstrapped generalization error for hard margin support vector classiﬁcation on different data sets (simulation: symbols, theory: lines). [sent-187, score-0.465]

67 Right: Bootstrapped distribution of the internal ﬁeld for Sonar data at a test input x ∈ D0 . [sent-188, score-0.123]

68 We show an atypical case (simulation: histogram, theory line) which nevertheless predicts the relative weights for both class labels fairly well. [sent-191, score-0.071]

69 (21)-(23) for four benchmark data sets D0 [10] and different sample sizes S using a RBF kernel K(x, x ) = d 1 exp(− 2 k=1 vk (xk − xk )2 )) with individually customized hyperparameters vk . [sent-194, score-0.201]

70 1 compares our theoretical results for the bootstrapped learning curves obtained by Eq. [sent-196, score-0.204]

71 The Gaussian approximation of the cavity distribution is based on the assumption that the model prediction at a training input is inﬂuenced by a sufﬁciently large number of neighboring inputs. [sent-198, score-0.172]

72 We expect it to work well for sufﬁciently broad kernel functions. [sent-199, score-0.044]

73 This was the case for the Crabs and Wisconsin data sets where our theory is very accurate. [sent-200, score-0.069]

74 In comparison, the Sonar and Pima data sets were learnt with narrow RBF kernels. [sent-203, score-0.06]

75 However, our results provide still a reasonable estimate for the bootstrapped generalization error at sample size S = N . [sent-205, score-0.311]

76 While for practical applications of estimating the “true” generalization error using Efron’s 0. [sent-206, score-0.077]

77 632 bootstrap estimator the case S = N is of main importance, it is also interesting to discuss the limit of extreme oversampling S → ∞. [sent-207, score-0.652]

78 Since the hard margin SVM gains no additional information from the multiple presentation of the same data point, in this limit all bootstrap sets D supply exactly the same information as the data set D0 and the data average ED [. [sent-208, score-0.753]

79 (21)-(23), we can write the average prediction mi at input xi ∈ D0 as N ∆λ(j)∆λc (j) mi = j=1 yj αj K(xi , xj ) with weights αj = ∆λ(j)+∆λc (j) (yj mj − yj mc ) and recover j for S → ∞ the Kuhn-Tucker conditions αi ≥ 0 and αi Θ(yi mi − 1) = 0. [sent-217, score-0.726]

80 (17) is found to converge to the approximate leave-one-out error of Opper and Winther [8] lim ε(S) = S→∞ 1 N N SV Θ (−yi mc ) = i i=1 Θ i αi −1 [K−1 ]ii SV (20) where the weights αi are given by the SVM algorithm on D0 and KSV is the kernel matrix on the set of SV’s. [sent-219, score-0.279]

81 632 ε(N ) bootstrap estimate [2] of the generalization error approximated within our theory results in a differentiable expression Eq. [sent-221, score-0.685]

82 (17) which may be used for kernel hyperparameter estimation. [sent-222, score-0.044]

83 1 shows results for the bootstrapped distribution of the internal ﬁeld on test inputs x ∈ D0 . [sent-225, score-0.327]

84 The data set D0 contained N = 188 Sonar data and the bootstrap / is at sample size S = N . [sent-226, score-0.636]

85 We ﬁnd that the true distribution is often very Gaussian-like and well described by the theory Eq. [sent-227, score-0.071]

86 Both SVM training and the computation of our approximate SVM bootstrap requires the running of iterative algorithms. [sent-231, score-0.646]

87 We compared the time ttrain for training a single SVM on each of the four benchmark data sets D0 with the time ttheo needed to solve our theory for SVM bootstrap estimates on these data for S = N . [sent-232, score-0.856]

88 For sufﬁciently broad kernels we ﬁnd ttrain ≥ ttheo and our theory is reliable. [sent-233, score-0.183]

89 1 (left)) we ﬁnd ttheo > ttrain where our theory is still faster to compute but less reliable than a good Monte-Carlo estimate of the bootstrap. [sent-236, score-0.183]

90 7 Outlook Our experiments on SVMs show that the approximate replica bootstrap approach appears to be highly robust to apply to models which only ﬁt into our framework after some delicate limiting process. [sent-237, score-0.808]

91 Experiments on benchmark data showed that our theory is appreciably faster to compute than a good Monte-Carlo estimate of the bootstrap and yields reliable results for kernels which are sufﬁciently broad. [sent-239, score-0.645]

92 It will be interesting to apply our approach to other kernel methods such as kernel PCA. [sent-240, score-0.088]

93 Since our method is based on a fairly general framework, we will also investigate if it can be applied to models where the bootstrapped parameters have a more complicated structure like, e. [sent-241, score-0.236]

94 Acknowledgments DM gratefully acknowledges ﬁnancial support by the Copenhagen Image and Signal Processing Graduate School and by the Postgraduate Programme ”Natural Disasters” at the University of Karlsruhe. [sent-244, score-0.04]

95 Appendix: TAP Equations The ADATAP approach computes the set of parameters Λc (i), γc (i) by constructing an T 1 T ˆ alternative set of tractable likelihoods Lj (f ) = e− 2 f Λ(j)f +γ(j) f deﬁning an auxiliary n N ˆ Gaussian joint distribution PG (f1 , . [sent-245, score-0.121]

96 We use replica a symmetry and a speciﬁc scaling of the parameters with β: γ (j) = βγ(j), Λaa (j) = Λ0 (j) = β 2 λ0 (j) for all a, Λab (j) = Λ(j) = β 2 λ(j) for a = b and ∆λ(j) = β −1 (Λ0 (j)− Λ(j)). [sent-249, score-0.195]

97 Further i = (G)ii = (G γ)i = − (G diag(λ) G)ii Vii (22) with the N × N matrix G = (K−1 + diag(∆λ))−1 and 1 (23) χii = ∆λ(i) + ∆λc (i) γ(i) + γc (i) mi = ∆λ(i) + ∆λc (i) λ(i) + λc (i) Vii = − (∆λ(i) + ∆λc (i))2 We solve Eq. [sent-255, score-0.142]

98 (23) to update the sets of parameters {γc (i), ∆λc (i), λc (i)} and {γ(i), ∆λ(i), λ(i)}, respectively. [sent-258, score-0.062]

99 Reasonable start values are ∆λ(i) = ∆λ, λ(i) = −∆λ, N ωi 1 γ(i) = yi ∆λ where ∆λ is obtained as the root of 0 = 1 − N i=1 1+ω∆λ − (1 − (1 − i ∆λ −S/N c c e )Φ(∆ )) with ∆ = −0. [sent-259, score-0.115]

100 Opper, A statistical mechanics approach to approximate analytical Bootstrap averages, NIPS 15, S. [sent-277, score-0.201]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bootstrap', 0.569), ('fd', 0.265), ('vii', 0.232), ('bootstrapped', 0.204), ('fia', 0.179), ('replica', 0.163), ('mi', 0.142), ('analytical', 0.134), ('si', 0.13), ('averages', 0.129), ('fi', 0.125), ('fj', 0.12), ('yi', 0.115), ('zj', 0.107), ('svm', 0.106), ('cavity', 0.103), ('mc', 0.102), ('opper', 0.095), ('tap', 0.089), ('lj', 0.082), ('sonar', 0.082), ('winther', 0.082), ('df', 0.081), ('ii', 0.08), ('xi', 0.079), ('ttheo', 0.077), ('continuation', 0.067), ('ttrain', 0.067), ('explicit', 0.065), ('gp', 0.065), ('physics', 0.064), ('lim', 0.063), ('dg', 0.061), ('efron', 0.061), ('pima', 0.061), ('predictors', 0.061), ('estimators', 0.06), ('internal', 0.057), ('pg', 0.057), ('adatap', 0.051), ('crabs', 0.051), ('fib', 0.051), ('fin', 0.051), ('karlsruhe', 0.051), ('occupation', 0.051), ('resumming', 0.051), ('fn', 0.051), ('wisconsin', 0.051), ('sv', 0.049), ('generalization', 0.047), ('ti', 0.047), ('limit', 0.045), ('powers', 0.045), ('replicas', 0.045), ('denmark', 0.045), ('yj', 0.045), ('expectation', 0.044), ('kernel', 0.044), ('gibbs', 0.041), ('replaces', 0.041), ('formulas', 0.041), ('retraining', 0.041), ('hard', 0.041), ('approximate', 0.04), ('support', 0.04), ('theory', 0.039), ('margin', 0.039), ('trick', 0.038), ('estimator', 0.038), ('contained', 0.037), ('benchmark', 0.037), ('training', 0.037), ('delicate', 0.036), ('aa', 0.036), ('diag', 0.036), ('test', 0.034), ('ab', 0.034), ('cc', 0.034), ('partition', 0.034), ('probabilistic', 0.034), ('li', 0.033), ('inset', 0.032), ('predictor', 0.032), ('parameters', 0.032), ('predicts', 0.032), ('distribution', 0.032), ('eld', 0.031), ('suf', 0.031), ('analytically', 0.03), ('narrow', 0.03), ('vk', 0.03), ('sample', 0.03), ('error', 0.03), ('sets', 0.03), ('auxiliary', 0.029), ('average', 0.029), ('likelihoods', 0.028), ('rewritten', 0.028), ('symbols', 0.028), ('statistical', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

Author: Dörthe Malzahn, Manfred Opper

2 0.44061336 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

Author: Harald Steck, Tommi S. Jaakkola

Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artiﬁcial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We ﬁnd that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the eﬀect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the diﬀerence between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1

3 0.12064029 193 nips-2003-Variational Linear Response

Author: Manfred Opper, Ole Winther

Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean ﬁeld approximations.

4 0.11720975 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

5 0.10909735 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

6 0.10632102 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

7 0.10621244 122 nips-2003-Margin Maximizing Loss Functions

8 0.10215144 100 nips-2003-Laplace Propagation

9 0.095215775 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

10 0.092578717 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

11 0.081852391 115 nips-2003-Linear Dependent Dimensionality Reduction

12 0.081754953 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

13 0.079117097 113 nips-2003-Learning with Local and Global Consistency

14 0.07526613 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

15 0.074972384 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

16 0.073854014 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

17 0.072746642 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

18 0.069817863 117 nips-2003-Linear Response for Approximate Inference

19 0.069356449 78 nips-2003-Gaussian Processes in Reinforcement Learning

20 0.066523328 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.228), (1, -0.095), (2, -0.045), (3, -0.01), (4, 0.168), (5, 0.041), (6, 0.073), (7, -0.204), (8, -0.004), (9, -0.001), (10, -0.08), (11, 0.044), (12, -0.203), (13, -0.145), (14, 0.098), (15, 0.195), (16, 0.041), (17, -0.399), (18, 0.138), (19, 0.188), (20, 0.163), (21, 0.01), (22, 0.026), (23, -0.097), (24, 0.07), (25, -0.031), (26, -0.075), (27, 0.062), (28, 0.151), (29, -0.007), (30, -0.006), (31, 0.088), (32, 0.129), (33, -0.095), (34, 0.058), (35, 0.144), (36, 0.067), (37, 0.054), (38, -0.125), (39, -0.139), (40, 0.022), (41, -0.056), (42, 0.03), (43, -0.062), (44, -0.018), (45, 0.012), (46, -0.06), (47, 0.053), (48, -0.076), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.934026 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

Author: Dörthe Malzahn, Manfred Opper

2 0.89420855 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

Author: Harald Steck, Tommi S. Jaakkola

3 0.37202394 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

4 0.32981881 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

5 0.31837851 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

6 0.30499473 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

7 0.29366291 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

8 0.27576435 124 nips-2003-Max-Margin Markov Networks

9 0.27212769 111 nips-2003-Learning the k in k-means

10 0.2673347 193 nips-2003-Variational Linear Response

11 0.25712079 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

12 0.24739523 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

13 0.247235 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models

14 0.24388197 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

15 0.24231517 122 nips-2003-Margin Maximizing Loss Functions

16 0.23563136 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

17 0.23110542 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

18 0.22965127 115 nips-2003-Linear Dependent Dimensionality Reduction

19 0.22776763 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

20 0.22742848 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (11, 0.019), (29, 0.022), (35, 0.07), (49, 0.012), (53, 0.109), (71, 0.071), (73, 0.343), (76, 0.065), (85, 0.105), (91, 0.058), (99, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79128164 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

Author: Dörthe Malzahn, Manfred Opper

2 0.77639747 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models

Author: Charles Rosenberg, Alok Ladsariya, Tom Minka

Abstract: We present a Bayesian approach to color constancy which utilizes a nonGaussian probabilistic model of the image formation process. The parameters of this model are estimated directly from an uncalibrated image set and a small number of additional algorithmic parameters are chosen using cross validation. The algorithm is empirically shown to exhibit RMS error lower than other color constancy algorithms based on the Lambertian surface reﬂectance model when estimating the illuminants of a set of test images. This is demonstrated via a direct performance comparison utilizing a publicly available set of real world test images and code base.

3 0.52048784 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classiﬁcation algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classiﬁcation. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a reﬁned version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a ﬁnite-dimensional base class.

4 0.49091619 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

5 0.49070352 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

6 0.48927703 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

7 0.48815051 189 nips-2003-Tree-structured Approximations by Expectation Propagation

8 0.48753703 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

9 0.48731989 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

10 0.48639214 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

11 0.4862971 124 nips-2003-Max-Margin Markov Networks

12 0.4847647 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

13 0.48206338 112 nips-2003-Learning to Find Pre-Images

14 0.48195228 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

15 0.48170877 126 nips-2003-Measure Based Regularization

16 0.48141649 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

17 0.4808093 145 nips-2003-Online Classification on a Budget

18 0.48000708 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

19 0.47911724 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

20 0.47896168 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers