nips nips2003 nips2003-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk 1 Abstract We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. [sent-8, score-1.206]
2 We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. [sent-9, score-0.163]
3 1 Introduction The bootstrap method [1, 2] is a widely applicable approach to assess the expected qualities of statistical estimators and predictors. [sent-10, score-0.656]
4 Say, for example, in a supervised learning problem, we are interested in measuring the expected error of our favorite prediction method on test points 1 which are not contained in the training set D0 . [sent-11, score-0.138]
5 If we have no hold out data, we can use the bootstrap approach to create artificial bootstrap data sets D by resampling with replacement training data from the original set D0 . [sent-12, score-1.205]
6 , some of the examples will appear several times in the bootstrap sample and others not at all. [sent-15, score-0.599]
7 A proxy for the true average test error can be obtained by retraining the model on each bootstrap training set D, calculating the test error only on those points which are not contained in D and finally averaging over all possible sets D. [sent-16, score-0.871]
8 While in general bootstrap averages can be approximated to any desired accuracy by the Monte-Carlo method, by generating a large enough number of random samples, it is useful to have also analytical approximations which avoid the time consuming retraining of the model for each new sample. [sent-17, score-0.873]
9 Existing analytical approximations (based on asymptotic techniques) such as the delta method and the saddle point method require usually explicit analytical formulas for the estimators of the parameters for a trained model (see e. [sent-18, score-0.466]
10 These may not be easily obtained for more complex models in machine learning such as support vector machines (SVMs). [sent-21, score-0.04]
11 Recently, we introduced a novel approach for the approximate calculation of bootstrap averages [4] which avoids explicit formulas for parameter estimates. [sent-22, score-0.844]
12 Instead, we define statistical estimators and predictors implicitly 1 The average is over the unknown distribution of training data sets. [sent-23, score-0.246]
13 Within this formulation, it becomes possible to perform averages over bootstrap samples analytically using the so-called “replica trick” of statistical physics [5]. [sent-25, score-0.819]
14 The latter involves a specific analytic continuation of the original statistical model. [sent-26, score-0.094]
15 As a final step, we use techniques for approximate inference to treat the probabilistic model. [sent-28, score-0.074]
16 This combination of techniques allows us to obtain approximate bootstrap averages by solving a set of nonlinear equations rather than by explicit sampling. [sent-29, score-0.803]
17 Our method has passed a first test successfully on the simple case of Gaussian process (GP) regression, where explicit predictions are still cheaply computed. [sent-30, score-0.099]
18 Also, since the original model is a smooth probabilistic one, the success of approximate inference techniques may be not too surprising. [sent-31, score-0.074]
19 In this paper, we will address a more challenging problem, that of the support vector machine. [sent-32, score-0.04]
20 Hence it is not clear a priori if our framework would survive these delicate limiting manipulations and still be able to give good approximate answers. [sent-35, score-0.076]
21 2 Hard Margin Support Vector Classifiers ˆ The hard margin SVM is a classifier which predicts binary class labels y = sign[ fD0 (x)] ∈ {−1, 1} for inputs x ∈ I d based on a set of training points D0 = (z1 , z2 , . [sent-36, score-0.149]
22 , zN ), where R ˆ zi = (xi , yi ) (for details see [6]). [sent-39, score-0.115]
23 The usually nonlinear activation function fD0 (x) (which ˆ (x) = N yi αi K(x, xi ), where K(x, x ) we will call “internal field”) is expressed as fD0 i=1 is a positive definite kernel and the set of αi ’s is computed from D0 by solving a certain convex optimization problem. [sent-40, score-0.238]
24 For bootstrap problems, we fix the pool of training data D0 , and consider the statistics ˆ ˆ ˆ of vectors ˆD = (fD (x1 ), . [sent-41, score-0.606]
25 , fD (xN )) at all inputs xi ∈ D0 , when the predictor f is f computed on randomly chosen subsets D of D0 . [sent-44, score-0.111]
26 Unfortunately, we do not have an explicit analytical expression for ˆD , but it is obtained implicitly as the vector f = (f1 , . [sent-45, score-0.199]
27 , fN ) f which solves the constraint optimization problem Minimize f T K−1 f with fi yi ≥ 1 for all i such that (xi , yi ) ∈ D (1) K is the kernel matrix with elements K(xi , xj ). [sent-48, score-0.399]
28 (3) Our general notation suggests that this principle will apply to a variety of estimators and predictors of the MAP type. [sent-50, score-0.121]
29 (4) The pseudo-likelihood 2 is defined by P (D|f ) = j: zj ∈D P (zj |fj ) = j: zj ∈D Θ(yj fj − 1) (5) where Θ(u) = 1 for u > 0 and 0 otherwise. [sent-55, score-0.334]
30 In the limit β → ∞, the measure P [f |D] ∝ µ[f ]P (D|f ) obviously concentrates at the vector ˆ which solves Eq. [sent-56, score-0.045]
31 f 4 Analytical Bootstrap Averages Using the Replica Trick With the bootstrap method, we would like to compute average properties of the estimator ˆD , Eq. [sent-58, score-0.636]
32 An important class of such f averages are of the type of a generalization error ε which are expectations of loss functions ˆ g(fD (xi ); xi , yi ) over test points i, i. [sent-60, score-0.434]
33 , those examples which are in D0 but not contained in the bootstrap training set D. [sent-62, score-0.643]
34 1 ε= N N ˆ ED δsi ,0 g(fD (xi ); xi , yi ) (6) ED [δsi ,0 ] i=1 where ED [· · · ] denotes the expectation over random bootstrap samples D which are created from the original training set D0 . [sent-64, score-0.844]
35 , sN ) where si is the number of times example zi appears in the set N D and i=1 si = S. [sent-68, score-0.26]
36 The Kronecker symbol, defined by δsi ,0 = 1 for si = 0 and 0 else, guarantees that only realizations of bootstrap training sets D contribute to Eq. [sent-69, score-0.766]
37 For fixed bootstrap sample size S, the distribution of s i ’s is multinomial. [sent-71, score-0.631]
38 Then we get the simpler, factorizing joint distribution N P (s) = S ( N )si e−S/N si ! [sent-73, score-0.162]
39 (6) can be rewritten as 1 ε(S) = N N i=1 ED δsi ,0 Z −r r a=1 df a µ[f a ] fia N a sj j=1 (P (zj |fj )) ED [δsi ,0 ] . [sent-80, score-0.288]
40 (8) were raised to positive powers rather than negative 2 3 It does not allow a full probabilistic interpretation [8]. [sent-88, score-0.079]
41 The superscripts should NOT be confused with powers of the variables. [sent-89, score-0.045]
42 ones, one could perform the bootstrap average over the distribution Eq. [sent-90, score-0.63]
43 To enable such an analytical average over the vector s (which is the “quenched disorder” in the language of statistical physics) one introduces the following “trick” extensively used in statistical physics of amorphous systems [5]. [sent-92, score-0.281]
44 We introduce the auxiliary quantity r N N 1 a (P (zj |fj ))sj (9) ED δsi ,0 Z n−r df a µ[f a ] fia εn (S) = S e− N N a=1 i=1 j=1 for arbitrary real n, which allows to write ε(S) = lim εn (S). [sent-93, score-0.352]
45 (10) n→0 The advantage of this definition is that for integers n ≥ r, εn (S) can be represented in terms of n replicas f 1 , f 2 , . [sent-94, score-0.045]
46 , f n of the original variable f for which an explicit average over si ’s is possible. [sent-97, score-0.224]
47 At the end of all calculations an analytical continuation to arbitrary real n and the limit n → 0 must be performed. [sent-98, score-0.246]
48 (3), exchange the expectation over datasets with the expectation over f ’s and use the explicit form of the distribution Eq. [sent-100, score-0.185]
49 The resulting expressions can be rewritten as 4 N r i=1 a=1 \i Ξn εn (S) = N fia , (11) \i where · · · \i denotes an average with respect to the so called cavity distribution P\i for replicated variables fi = (fi1 , . [sent-102, score-0.496]
50 , fin ) defined by P\i (fi ) ∝ N 1 Li ( f i ) df j P ( f 1 , . [sent-105, score-0.132]
51 n a=1 The joint distribution of replica variables P (f1 , . [sent-109, score-0.195]
52 The ADATAP approach replaces the analytically intractable cavity distribution Eq. [sent-117, score-0.206]
53 , fin and assume replica symmetry and further 4 \i \i P\i (fi ), Eq. [sent-126, score-0.214]
54 introduce an explicit scaling of all parameters with β. [sent-128, score-0.097]
55 This scaling was found to make all final expressions finite in the limit β → ∞. [sent-129, score-0.045]
56 (15) keeps the number of adjustable parameters independent of n and allows to perform the “replica limit” n → 0 and the “SVM-limit” β → ∞ in all equations analytically before we start the final numerical parameter optimization. [sent-137, score-0.062]
57 (14) and (15) and resumming the power series over r yields the final theoretical expression for Eq. [sent-140, score-0.051]
58 (6) ε(S) = 1 N N γc (i) + u −λc (i) ; xi , yi ∆λc (i) dG(u) g i=1 (16) u2 1 where dG(u) = du(2π)− 2 e− 2 and g is an arbitrary loss function. [sent-141, score-0.194]
59 ˆ ˆ g(fD (xi ); xi , yi ) = Θ(−yi fD (xi )) we obtain the bootstrapped classification error ε(S) = where Φ(x) = x −∞ 1 N N Φ − i=1 yi γc (i) With (17) −λc (i) dG(u). [sent-142, score-0.543]
60 Besides the computation of generalization errors, we can use our method to quantify the uncertainty of the SVM prediction at test points. [sent-143, score-0.081]
61 This can be obtained by computing the ˆ bootstrap distribution of the “internal fields” fD (xi ) at a test input xi . [sent-144, score-0.714]
62 This is obtained ˆ (xi ); xi , yi ) = δ(fD (xi ) − h) using the Dirac δ-function ˆ from Eq. [sent-145, score-0.194]
63 , mc = ∆λc (i) and Vii = − (∆λcc(i) 2 are the predicted mean and variance of the internal i (i)) field. [sent-148, score-0.159]
64 (The predicted posterior variance of the internal field is (β∆λc (i))−1 and goes to zero as β → ∞ indicating the transition to a deterministic model. [sent-149, score-0.057]
65 / This replaces ∆λc (i), γc (i), λc (i) by −1 N ∆λc (x) = K(x, x) − K(x, xi )∆λ(i)Ti (x) (19) i=1 N γc (x) = ∆λc (x) Ti (x)γ(i) i=1 N λc (x) = (∆λc (x))2 (Ti (x))2 λ(i) i=1 N with Ti (x) = j=1 K(x, xj )(I + diag(∆λ)K)−1 . [sent-152, score-0.12]
66 5 1 Bootstrapped local field at a test input x Figure 1: Left: Average bootstrapped generalization error for hard margin support vector classification on different data sets (simulation: symbols, theory: lines). [sent-187, score-0.465]
67 Right: Bootstrapped distribution of the internal field for Sonar data at a test input x ∈ D0 . [sent-188, score-0.123]
68 We show an atypical case (simulation: histogram, theory line) which nevertheless predicts the relative weights for both class labels fairly well. [sent-191, score-0.071]
69 (21)-(23) for four benchmark data sets D0 [10] and different sample sizes S using a RBF kernel K(x, x ) = d 1 exp(− 2 k=1 vk (xk − xk )2 )) with individually customized hyperparameters vk . [sent-194, score-0.201]
70 1 compares our theoretical results for the bootstrapped learning curves obtained by Eq. [sent-196, score-0.204]
71 The Gaussian approximation of the cavity distribution is based on the assumption that the model prediction at a training input is influenced by a sufficiently large number of neighboring inputs. [sent-198, score-0.172]
72 We expect it to work well for sufficiently broad kernel functions. [sent-199, score-0.044]
73 This was the case for the Crabs and Wisconsin data sets where our theory is very accurate. [sent-200, score-0.069]
74 In comparison, the Sonar and Pima data sets were learnt with narrow RBF kernels. [sent-203, score-0.06]
75 However, our results provide still a reasonable estimate for the bootstrapped generalization error at sample size S = N . [sent-205, score-0.311]
76 While for practical applications of estimating the “true” generalization error using Efron’s 0. [sent-206, score-0.077]
77 632 bootstrap estimator the case S = N is of main importance, it is also interesting to discuss the limit of extreme oversampling S → ∞. [sent-207, score-0.652]
78 Since the hard margin SVM gains no additional information from the multiple presentation of the same data point, in this limit all bootstrap sets D supply exactly the same information as the data set D0 and the data average ED [. [sent-208, score-0.753]
79 (21)-(23), we can write the average prediction mi at input xi ∈ D0 as N ∆λ(j)∆λc (j) mi = j=1 yj αj K(xi , xj ) with weights αj = ∆λ(j)+∆λc (j) (yj mj − yj mc ) and recover j for S → ∞ the Kuhn-Tucker conditions αi ≥ 0 and αi Θ(yi mi − 1) = 0. [sent-217, score-0.726]
80 (17) is found to converge to the approximate leave-one-out error of Opper and Winther [8] lim ε(S) = S→∞ 1 N N SV Θ (−yi mc ) = i i=1 Θ i αi −1 [K−1 ]ii SV (20) where the weights αi are given by the SVM algorithm on D0 and KSV is the kernel matrix on the set of SV’s. [sent-219, score-0.279]
81 632 ε(N ) bootstrap estimate [2] of the generalization error approximated within our theory results in a differentiable expression Eq. [sent-221, score-0.685]
82 (17) which may be used for kernel hyperparameter estimation. [sent-222, score-0.044]
83 1 shows results for the bootstrapped distribution of the internal field on test inputs x ∈ D0 . [sent-225, score-0.327]
84 The data set D0 contained N = 188 Sonar data and the bootstrap / is at sample size S = N . [sent-226, score-0.636]
85 We find that the true distribution is often very Gaussian-like and well described by the theory Eq. [sent-227, score-0.071]
86 Both SVM training and the computation of our approximate SVM bootstrap requires the running of iterative algorithms. [sent-231, score-0.646]
87 We compared the time ttrain for training a single SVM on each of the four benchmark data sets D0 with the time ttheo needed to solve our theory for SVM bootstrap estimates on these data for S = N . [sent-232, score-0.856]
88 For sufficiently broad kernels we find ttrain ≥ ttheo and our theory is reliable. [sent-233, score-0.183]
89 1 (left)) we find ttheo > ttrain where our theory is still faster to compute but less reliable than a good Monte-Carlo estimate of the bootstrap. [sent-236, score-0.183]
90 7 Outlook Our experiments on SVMs show that the approximate replica bootstrap approach appears to be highly robust to apply to models which only fit into our framework after some delicate limiting process. [sent-237, score-0.808]
91 Experiments on benchmark data showed that our theory is appreciably faster to compute than a good Monte-Carlo estimate of the bootstrap and yields reliable results for kernels which are sufficiently broad. [sent-239, score-0.645]
92 It will be interesting to apply our approach to other kernel methods such as kernel PCA. [sent-240, score-0.088]
93 Since our method is based on a fairly general framework, we will also investigate if it can be applied to models where the bootstrapped parameters have a more complicated structure like, e. [sent-241, score-0.236]
94 Acknowledgments DM gratefully acknowledges financial support by the Copenhagen Image and Signal Processing Graduate School and by the Postgraduate Programme ”Natural Disasters” at the University of Karlsruhe. [sent-244, score-0.04]
95 Appendix: TAP Equations The ADATAP approach computes the set of parameters Λc (i), γc (i) by constructing an T 1 T ˆ alternative set of tractable likelihoods Lj (f ) = e− 2 f Λ(j)f +γ(j) f defining an auxiliary n N ˆ Gaussian joint distribution PG (f1 , . [sent-245, score-0.121]
96 We use replica a symmetry and a specific scaling of the parameters with β: γ (j) = βγ(j), Λaa (j) = Λ0 (j) = β 2 λ0 (j) for all a, Λab (j) = Λ(j) = β 2 λ(j) for a = b and ∆λ(j) = β −1 (Λ0 (j)− Λ(j)). [sent-249, score-0.195]
97 Further i = (G)ii = (G γ)i = − (G diag(λ) G)ii Vii (22) with the N × N matrix G = (K−1 + diag(∆λ))−1 and 1 (23) χii = ∆λ(i) + ∆λc (i) γ(i) + γc (i) mi = ∆λ(i) + ∆λc (i) λ(i) + λc (i) Vii = − (∆λ(i) + ∆λc (i))2 We solve Eq. [sent-255, score-0.142]
98 (23) to update the sets of parameters {γc (i), ∆λc (i), λc (i)} and {γ(i), ∆λ(i), λ(i)}, respectively. [sent-258, score-0.062]
99 Reasonable start values are ∆λ(i) = ∆λ, λ(i) = −∆λ, N ωi 1 γ(i) = yi ∆λ where ∆λ is obtained as the root of 0 = 1 − N i=1 1+ω∆λ − (1 − (1 − i ∆λ −S/N c c e )Φ(∆ )) with ∆ = −0. [sent-259, score-0.115]
100 Opper, A statistical mechanics approach to approximate analytical Bootstrap averages, NIPS 15, S. [sent-277, score-0.201]
wordName wordTfidf (topN-words)
[('bootstrap', 0.569), ('fd', 0.265), ('vii', 0.232), ('bootstrapped', 0.204), ('fia', 0.179), ('replica', 0.163), ('mi', 0.142), ('analytical', 0.134), ('si', 0.13), ('averages', 0.129), ('fi', 0.125), ('fj', 0.12), ('yi', 0.115), ('zj', 0.107), ('svm', 0.106), ('cavity', 0.103), ('mc', 0.102), ('opper', 0.095), ('tap', 0.089), ('lj', 0.082), ('sonar', 0.082), ('winther', 0.082), ('df', 0.081), ('ii', 0.08), ('xi', 0.079), ('ttheo', 0.077), ('continuation', 0.067), ('ttrain', 0.067), ('explicit', 0.065), ('gp', 0.065), ('physics', 0.064), ('lim', 0.063), ('dg', 0.061), ('efron', 0.061), ('pima', 0.061), ('predictors', 0.061), ('estimators', 0.06), ('internal', 0.057), ('pg', 0.057), ('adatap', 0.051), ('crabs', 0.051), ('fib', 0.051), ('fin', 0.051), ('karlsruhe', 0.051), ('occupation', 0.051), ('resumming', 0.051), ('fn', 0.051), ('wisconsin', 0.051), ('sv', 0.049), ('generalization', 0.047), ('ti', 0.047), ('limit', 0.045), ('powers', 0.045), ('replicas', 0.045), ('denmark', 0.045), ('yj', 0.045), ('expectation', 0.044), ('kernel', 0.044), ('gibbs', 0.041), ('replaces', 0.041), ('formulas', 0.041), ('retraining', 0.041), ('hard', 0.041), ('approximate', 0.04), ('support', 0.04), ('theory', 0.039), ('margin', 0.039), ('trick', 0.038), ('estimator', 0.038), ('contained', 0.037), ('benchmark', 0.037), ('training', 0.037), ('delicate', 0.036), ('aa', 0.036), ('diag', 0.036), ('test', 0.034), ('ab', 0.034), ('cc', 0.034), ('partition', 0.034), ('probabilistic', 0.034), ('li', 0.033), ('inset', 0.032), ('predictor', 0.032), ('parameters', 0.032), ('predicts', 0.032), ('distribution', 0.032), ('eld', 0.031), ('suf', 0.031), ('analytically', 0.03), ('narrow', 0.03), ('vk', 0.03), ('sample', 0.03), ('error', 0.03), ('sets', 0.03), ('auxiliary', 0.029), ('average', 0.029), ('likelihoods', 0.028), ('rewritten', 0.028), ('symbols', 0.028), ('statistical', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
2 0.44061336 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
3 0.12064029 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
4 0.11720975 1 nips-2003-1-norm Support Vector Machines
Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie
Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1
5 0.10909735 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
6 0.10632102 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
7 0.10621244 122 nips-2003-Margin Maximizing Loss Functions
8 0.10215144 100 nips-2003-Laplace Propagation
9 0.095215775 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
10 0.092578717 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
11 0.081852391 115 nips-2003-Linear Dependent Dimensionality Reduction
12 0.081754953 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
13 0.079117097 113 nips-2003-Learning with Local and Global Consistency
14 0.07526613 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
15 0.074972384 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
16 0.073854014 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
17 0.072746642 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
18 0.069817863 117 nips-2003-Linear Response for Approximate Inference
19 0.069356449 78 nips-2003-Gaussian Processes in Reinforcement Learning
20 0.066523328 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
topicId topicWeight
[(0, -0.228), (1, -0.095), (2, -0.045), (3, -0.01), (4, 0.168), (5, 0.041), (6, 0.073), (7, -0.204), (8, -0.004), (9, -0.001), (10, -0.08), (11, 0.044), (12, -0.203), (13, -0.145), (14, 0.098), (15, 0.195), (16, 0.041), (17, -0.399), (18, 0.138), (19, 0.188), (20, 0.163), (21, 0.01), (22, 0.026), (23, -0.097), (24, 0.07), (25, -0.031), (26, -0.075), (27, 0.062), (28, 0.151), (29, -0.007), (30, -0.006), (31, 0.088), (32, 0.129), (33, -0.095), (34, 0.058), (35, 0.144), (36, 0.067), (37, 0.054), (38, -0.125), (39, -0.139), (40, 0.022), (41, -0.056), (42, 0.03), (43, -0.062), (44, -0.018), (45, 0.012), (46, -0.06), (47, 0.053), (48, -0.076), (49, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.934026 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
2 0.89420855 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty
Author: Harald Steck, Tommi S. Jaakkola
Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1
3 0.37202394 1 nips-2003-1-norm Support Vector Machines
Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie
Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1
4 0.32981881 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
Author: Yoshua Bengio, Yves Grandvalet
Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1
5 0.31837851 100 nips-2003-Laplace Propagation
Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1
6 0.30499473 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
7 0.29366291 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
8 0.27576435 124 nips-2003-Max-Margin Markov Networks
9 0.27212769 111 nips-2003-Learning the k in k-means
10 0.2673347 193 nips-2003-Variational Linear Response
11 0.25712079 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
12 0.24739523 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
13 0.247235 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models
14 0.24388197 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
15 0.24231517 122 nips-2003-Margin Maximizing Loss Functions
16 0.23563136 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
17 0.23110542 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
18 0.22965127 115 nips-2003-Linear Dependent Dimensionality Reduction
19 0.22776763 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
20 0.22742848 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
topicId topicWeight
[(0, 0.033), (11, 0.019), (29, 0.022), (35, 0.07), (49, 0.012), (53, 0.109), (71, 0.071), (73, 0.343), (76, 0.065), (85, 0.105), (91, 0.058), (99, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.79128164 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Author: Dörthe Malzahn, Manfred Opper
Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1
2 0.77639747 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models
Author: Charles Rosenberg, Alok Ladsariya, Tom Minka
Abstract: We present a Bayesian approach to color constancy which utilizes a nonGaussian probabilistic model of the image formation process. The parameters of this model are estimated directly from an uncalibrated image set and a small number of additional algorithmic parameters are chosen using cross validation. The algorithm is empirically shown to exhibit RMS error lower than other color constancy algorithms based on the Lambertian surface reflectance model when estimating the illuminants of a set of test images. This is demonstrated via a direct performance comparison utilizing a publicly available set of real world test images and code base.
3 0.52048784 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
4 0.49091619 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
5 0.49070352 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
6 0.48927703 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
7 0.48815051 189 nips-2003-Tree-structured Approximations by Expectation Propagation
8 0.48753703 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
9 0.48731989 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
10 0.48639214 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
11 0.4862971 124 nips-2003-Max-Margin Markov Networks
12 0.4847647 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
13 0.48206338 112 nips-2003-Learning to Find Pre-Images
14 0.48195228 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
15 0.48170877 126 nips-2003-Measure Based Regularization
16 0.48141649 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
17 0.4808093 145 nips-2003-Online Classification on a Budget
18 0.48000708 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
19 0.47911724 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
20 0.47896168 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers