nips nips2006 nips2006-193 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. [sent-12, score-0.245]
2 The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. [sent-13, score-0.493]
3 Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. [sent-14, score-0.424]
4 In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. [sent-15, score-0.324]
5 The danger of overfitting in such high-dimensional spaces is countered by maximising the margin of the classifier on the training examples. [sent-17, score-0.142]
6 For this reason there has been considerable interest in bounds on the generalisation in terms of the margin. [sent-18, score-0.107]
7 Early bounds have relied on covering number computations [7], while later bounds have considered Rademacher complexity. [sent-19, score-0.146]
8 The tightest bounds for practical applications appear to be the PAC-Bayes bound [4, 5]. [sent-20, score-0.356]
9 In particular the form given in [3] is specially attractive for margin classifiers, like SVM. [sent-21, score-0.051]
10 The PAC-Bayesian bounds are also present in other Machine Learning models such as Gaussian Processes [6]. [sent-22, score-0.073]
11 The aim of this paper is to consider a refinement of the PAC-Bayes approach and investigate whether it can improve on the original PAC-Bayes bound and uphold its capabilities of delivering reliable model selection. [sent-23, score-0.326]
12 The standard PAC-Bayes bound uses a Gaussian prior centred at the origin in weight space. [sent-24, score-0.556]
13 The key to the new bound is to use part of the training set to compute a more informative prior and then compute the bound on the remainder of the examples relative to this prior. [sent-25, score-0.8]
14 The bounds are tested experimentally in several classification tasks, including the model selection, on common benchmark datasets. [sent-26, score-0.073]
15 Section 2 briefly reviews the PAC-Bayes bound for SVMs obtained in [3]. [sent-28, score-0.245]
16 The new bound obtained by means of the refinement of the prior is presented in Section 3. [sent-29, score-0.402]
17 The experimental work, included in Section 4, compares the tightness of the new bound with the original one and indicates about its usability in a model selection task. [sent-30, score-0.355]
18 Let us consider a distribution D of patterns x lying in a certain input space X , with their corresponding output labels y, y ∈ {−1, 1}. [sent-33, score-0.076]
19 For these two quantities we can derive the PAC-Bayes Bound on the true error of the distribution of classifiers: Theorem 2. [sent-37, score-0.071]
20 1 (PAC-Bayes Bound) For all prior distributions P (c) over the classifiers c, and for any δ ∈ (0, 1] PrS∼Dm KL(Q(c)||P (c)) + ln( m+1 ) δ ˆ ∀Q(c) : KL(QS ||QD ) ≤ m ≥ 1 − δ, 1−q q where KL is the Kullback-Leibler divergence, KL(p||q) = q ln p + (1 − q) ln 1−p and KL(Q(c)||P (c)) = Ec∼Q ln Q(c) . [sent-38, score-0.64]
21 P (c) The proof of the theorem can be found in [3]. [sent-39, score-0.067]
22 This bound can be particularised for the case of linear classifiers in the following way. [sent-40, score-0.321]
23 For any vector w we can define a stochastic classifier in the following way: we choose the distribution Q = Q(w, µ) to be a spherical Gaussian with identity covariance matrix centred on the direction given by w at a distance µ from the origin. [sent-42, score-0.341]
24 Moreover, we can choose the prior P (c) to be a spherical Gaussian with identity covariance matrix centred on the origin. [sent-43, score-0.435]
25 2 (PAC-Bayes Bound for margin classifiers [3]) For all distributions D, for all classifiers given by w and µ > 0, for all δ ∈ (0, 1], we have ˆ Pr KL(QS (w, µ)||QD (w, µ)) ≤ µ2 2 + ln( m+1 ) δ m ≥ 1 − δ. [sent-48, score-0.051]
26 It can be shown (see [3]) that ˆ ˜ QS (w, µ) = Em [F (µγ(x, y))] (2) where Em is the average over the m training examples, γ(x, y) is the normalised margin of the training patterns ywT φ(x) γ(x, y) = (3) φ(x) w ˜ and F = 1 − F , where F is the cumulative normal distribution x 2 1 √ e−x /2 dx. [sent-49, score-0.347]
27 The generalisation error of such a classifier can be bounded by at most twice the true (stochastic) error QD (w, µ) in Corollary 2. [sent-51, score-0.176]
28 3 Choosing a prior for the PAC-Bayes Bound Our first contribution is motivated by the fact that the PAC-Bayes bound allows us to choose the prior distribution, P (c). [sent-53, score-0.559]
29 In the standard application of the bound this is chosen to be a Gaussian centred at the origin. [sent-54, score-0.377]
30 We now consider learning a different prior based on training an SVM on a subset R of the training set comprising r training patterns and labels. [sent-55, score-0.506]
31 With these r examples we can determine an SVM classifier, wr and form a prior P (w|wr ) consisting of a Gaussian distribution with identity covariance matrix centred on wr . [sent-57, score-1.293]
32 The introduction of this prior P (w|wr ) in Theorem 2. [sent-58, score-0.157]
33 1 (Single Prior based PAC-Bayes Bound for margin classifiers) Let us consider a prior on the distribution of classifiers consisting in a spherical Gaussian with identity covariance centred along the direction given by wr at a distance η from the origin. [sent-61, score-0.968]
34 Then, for all distributions D, for all classifiers wm and µ > 0, for all δ ∈ (0, 1], we have PrS∼D ˆ KL(QS\R (wm , µ)||QD (wm , µ)) ≤ ||ηwr −µwm ||2 2 + ln( m−r+1 ) δ m−r ≥1−δ ˆ where QS\R is a stochastic measure of the error of the classifier on the m − r samples not used to learn the prior. [sent-62, score-0.353]
35 This stochastic error is computed as indicated in equation (2) averaged over S\R. [sent-63, score-0.155]
36 Proof Since we separate r instances to learn the prior, the actual size of the training set to which we apply the bound is m − r. [sent-64, score-0.431]
37 In addition, the stochastic error must be computed only on the instances not used to learn the prior, i. [sent-65, score-0.182]
38 2 when applied to the SVM weight vector on the whole training set. [sent-69, score-0.091]
39 It is perhaps worth stressing that the bound holds for all wm and so can be applied to the SVM trained on the whole set. [sent-70, score-0.5]
40 This might at first appear as ’cheating’, but the critical point is that the bound is evaluated on the set S\R not involved in generating the prior. [sent-71, score-0.269]
41 The experimental work illustrates how in fact this bound can be tighter than the standard PAC-Bayes bound. [sent-72, score-0.343]
42 Moreover, the selection of the prior may be further refined in exchange for a very small increase in the penalty term. [sent-73, score-0.211]
43 Then, for all priors P (c) ∈ {Pj (c)}J , j=1 j=1 for all posterior distributions Q(c), for all δ ∈ (0, 1], PrS∼Dm ˆ ∀Q(c), ∀j : KL(QS ||QD ) ≤ 1 KL(Q(c)||Pj (c)) + ln m+1 + ln πj δ m ≥ 1 − δ, Proof The bound in Theorem 2. [sent-77, score-0.613]
44 This result can be also particularised for the case of SVM classifiers. [sent-80, score-0.076]
45 The set of priors is constructed by allocating Gaussian distributions with identity covariance matrix along the direction given by wr at distances {ηj }J from the origin where {ηj }J are real numbers. [sent-81, score-0.658]
46 3 (Multiple Prior PAC-Bayes Bound for linear classifiers) Let us consider a set {Pj (w|wr , ηj )}J of prior distributions of classifiers consisting in spherical Gaussian distribuj=1 tions with identity covariance matrix centred on ηj wr , where {ηj }J are real numbers. [sent-83, score-0.894]
47 Then, for j=1 all distributions D, for all classifiers w, for all µ > 0, for all δ ∈ (0, 1], we have PrS∼D ˆ KL(QS\R (w, µ)||QD (w, µ)) ≤ ||ηj wr −µw||2 2 + ln( m−r+1 ) + ln J δ m−r ≥1−δ 1 Proof The proof is straightforward, substituting πj = J for all j in Theorem 3. [sent-84, score-0.652]
48 2 and computing the KL divergence between prior and posterior as in the proof of Corollary 3. [sent-85, score-0.218]
49 In the case of several priors, the search is repeated for every prior and the reported value of the bound is the tightest. [sent-90, score-0.425]
50 In Section 4 we present experimental results comparing this new bound to the standard PAC-Bayes bound and using it to guide model selection. [sent-91, score-0.49]
51 4 Experiments The tightness of the new bound is evaluated in a model selection and classification task using some UCI [1] datasets (see their description in terms of number of instances, input dimension and number of positive/negative examples in Table 1). [sent-92, score-0.38]
52 For every dataset, we obtain 50 different training/test set partitions with 80% of the samples forming the training set and the remaining 20% forming the test set. [sent-95, score-0.24]
53 With each of the partitions we learn a SVM classifier with Gaussian RBF kernel preceded by a model selection. [sent-96, score-0.055]
54 The model selection consists in the determination of an optimal pair of hyperparameters (C, σ). [sent-97, score-0.122]
55 C is the SVM trade-off between the maximisation of the margin and the minimisation of the hinge loss of the training samples, while σ is the width of the Gaussian kernel. [sent-98, score-0.142]
56 For completeness, this model selection is guided by the PAC-Bayes bound: we select the model corresponding to the pair that yields a lower value of QD in the bound. [sent-105, score-0.115]
57 For every partition we use the minimum value of the bound resulting from all the pairs (C, σ) of the grid. [sent-107, score-0.29]
58 Note that this procedure is computationally less costly than the commonly used N -fold cross validation model selection, since it saves the training of N classifiers (one for each fold) for each parameter combination. [sent-108, score-0.117]
59 005 Table 2: Averaged PAC-Bayes Bound and Test Error Rate obtained by the model that yielded the lowest bound in each of the 50 training/test partitions. [sent-125, score-0.245]
60 We repeated this experiment using the Prior PAC-Bayes Bound with different configurations for learning the prior distribution of classifiers. [sent-126, score-0.157]
61 These configurations are defined by variations on the percentage of training patterns separated to compute the prior and on the number of scalings of the magnitude of that prior. [sent-127, score-0.791]
62 The scalings represent different lengths η of ||wr || equally spaced between η = 1 and η = 100. [sent-128, score-0.323]
63 To summarize, for every training/test partition and for every pair (% patterns, # of scalings) we look at the pair (C, σ) that outputs the smaller value of QD . [sent-129, score-0.142]
64 In this case, the use of the Prior PAC-Bayes Bound to perform the model selection increases the computational burden of using the PAC-Bayes one in the training of one classifier (the one used to learn the prior), in comparison to the extra N classifiers needed by N -fold cross validation. [sent-130, score-0.275]
65 It seems that ten scalings of the prior are enough to obtain tighter bounds, since the use of 100 or 500 scalings does not improve the best results. [sent-132, score-0.928]
66 With respect to the percentage of training instances left out to learn the prior, something close to 50% of the training set works well in the considered problems. [sent-133, score-0.366]
67 It is worth mentioning that we treat each position in the Table as a separate experiment. [sent-134, score-0.048]
68 005) Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-137, score-0.279]
69 003) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-179, score-0.279]
70 002) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-221, score-0.279]
71 002) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-263, score-0.279]
72 024 Scalings Table 3: Averaged Prior PAC-Bayes bound for different settings of percentage of training instances reserved to compute the prior and of number of scalings of the normalised prior. [sent-303, score-1.036]
73 This would have involved a further application of the union bound with the 20 entries of the Table for each problem, at the cost of adding an extra ln(20)/m (0. [sent-305, score-0.292]
74 We decided to fix the number of scalings and the amount of training patterns to compute the prior since to perform all of the different options would augment the computational burden of the model selection. [sent-308, score-0.749]
75 Table 5 displays the test error rate obtained by SVMs with their hyperparameters tuned on the above mentioned grid by means of ten-fold cross-validation, that serves as a baseline method for comparison purposes. [sent-310, score-0.235]
76 According to the values shown in the tables, the Prior PAC-Bayes bound achieves tighter predictions of the generalization error of the randomized classifier in almost all cases. [sent-311, score-0.438]
77 Notice how the length of the prior is not so critical in comparison with its direction. [sent-312, score-0.157]
78 The goodness of the latter relying on the subset of samples left out for the purpose of learning the prior classifier. [sent-313, score-0.191]
79 021) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-317, score-0.279]
80 014) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-359, score-0.279]
81 008) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-401, score-0.279]
82 005) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-443, score-0.279]
83 005 Table 4: Averaged Test Error Rate corresponding to the model determined by the bound for the different settings of Table 3. [sent-483, score-0.245]
84 Problem Wdbc Image Waveform Ringnorm Cross-validation error rate 0. [sent-484, score-0.119]
85 For every partition we select the test error rate corresponding to the model reporting the smaller cross-validation error. [sent-501, score-0.236]
86 However, the comparison with Table 5 points out that the PAC-Bayes bound is not as accurate as Ten Fold cross-validation when it comes to selecting a model that yields a low test error rate. [sent-502, score-0.364]
87 Nevertheless, in two out of the four problems (waveform, and wdbc) the bound provided a model as good as the one found by cross-validation, added to the fact that in ringnorm the error bars overlap. [sent-503, score-0.449]
88 We conclude the discussion by pointing that the Cross-validation error rate cannot be used directly as a prediction on the expected test error rate in the sense of worse case performances. [sent-504, score-0.286]
89 Of course the values of the cross-validation error rate and the test error rate are close, but it is difficult to predict how close they are going to be. [sent-505, score-0.286]
90 5 Conclusions and ongoing research In this paper we have presented a version of the PAC-Bayes bound for linear classifiers that introduces the learning of the prior distribution over the classifiers. [sent-506, score-0.428]
91 This prior distribution is a Gaussian with identity covariance matrix. [sent-507, score-0.243]
92 The mean weight vector is learnt in the following way: its direction is determined from a separate subset of the training examples, while its length has to be chosen from an a priori fixed set of lengths. [sent-508, score-0.138]
93 The experimental work shows that this new version of the bound achieves tighter predictions of the generalization error of the stochastic classifier, compared to the original PAC-Bayes bound predictions. [sent-509, score-0.723]
94 Moreover, if the model selection is driven by the bound, the Prior PAC-Bayes does not degrade the quality of the model selected by the original bound. [sent-510, score-0.078]
95 Nevertheless, it has to be said that in some of our experiments the model selected by the bounds resulted as accurate as the ones selected by ten-fold cross-validation in terms of test error rate on a separate test. [sent-511, score-0.312]
96 This fact is remarkable since to include the model selection in the training of the classifier roughly multiplies by ten the computational burden of the training when using ten-fold cross-validation but roughly by two when using the prior PAC-Bayes bound. [sent-512, score-0.491]
97 Of course the original PAC-Bayes provides with a cheaper model selection, but its predictions about the generalization capabilities are more pessimistic. [sent-513, score-0.105]
98 The amount of training patterns used to learn the prior seems to be a key aspect in the goodness of this prior and thus in the tightness of the bound. [sent-514, score-0.604]
99 Therefore, ongoing research includes methods to systematically determine an amount of patterns that provides with suitable priors. [sent-515, score-0.102]
100 Another line of research explores the use of these bounds to reinforce different properties of the design of classifiers, such as sparsity. [sent-516, score-0.073]
wordName wordTfidf (topN-words)
[('wr', 0.459), ('scalings', 0.323), ('qd', 0.28), ('bound', 0.245), ('wm', 0.209), ('kl', 0.181), ('qs', 0.169), ('ln', 0.161), ('prior', 0.157), ('prs', 0.153), ('classi', 0.152), ('ers', 0.15), ('ringnorm', 0.133), ('centred', 0.132), ('pj', 0.122), ('percentage', 0.113), ('er', 0.111), ('wdbc', 0.111), ('ew', 0.107), ('waveform', 0.107), ('tighter', 0.098), ('training', 0.091), ('capabilities', 0.081), ('misclassifying', 0.076), ('particularised', 0.076), ('patterns', 0.076), ('svm', 0.074), ('corollary', 0.073), ('bounds', 0.073), ('error', 0.071), ('burden', 0.071), ('dm', 0.068), ('spherical', 0.06), ('table', 0.058), ('tightness', 0.056), ('selection', 0.054), ('cd', 0.053), ('lund', 0.051), ('winsconsin', 0.051), ('margin', 0.051), ('rate', 0.048), ('identity', 0.048), ('test', 0.048), ('priors', 0.046), ('ec', 0.045), ('cs', 0.045), ('spain', 0.044), ('averaged', 0.044), ('pr', 0.043), ('stochastic', 0.04), ('instances', 0.038), ('covariance', 0.038), ('breast', 0.038), ('nement', 0.038), ('tightest', 0.038), ('normalised', 0.038), ('displays', 0.037), ('pair', 0.037), ('theorem', 0.035), ('generalisation', 0.034), ('goodness', 0.034), ('learn', 0.033), ('proof', 0.032), ('compute', 0.031), ('hyperparameters', 0.031), ('divergence', 0.029), ('wt', 0.029), ('cancer', 0.028), ('forming', 0.028), ('gaussian', 0.028), ('fold', 0.027), ('ten', 0.027), ('ongoing', 0.026), ('trick', 0.026), ('cross', 0.026), ('description', 0.025), ('london', 0.024), ('uci', 0.024), ('selected', 0.024), ('involved', 0.024), ('separate', 0.024), ('select', 0.024), ('worth', 0.024), ('predictions', 0.024), ('arrive', 0.023), ('union', 0.023), ('direction', 0.023), ('every', 0.023), ('partition', 0.022), ('origin', 0.022), ('gurations', 0.022), ('partitions', 0.022), ('ywt', 0.022), ('selections', 0.022), ('allocating', 0.022), ('cheating', 0.022), ('isabelle', 0.022), ('stressing', 0.022), ('remarked', 0.022), ('favourably', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
2 0.16194411 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
3 0.12385213 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
4 0.1140115 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
5 0.10969641 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier
Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classifier. We show that these bounds can be smaller than the risk of the Gibbs classifier and can be arbitrarily close to zero even if the risk of the Gibbs classifier is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1
6 0.092297308 186 nips-2006-Support Vector Machines on a Budget
7 0.090551741 50 nips-2006-Chained Boosting
8 0.087754592 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
9 0.087013252 33 nips-2006-Analysis of Representations for Domain Adaptation
10 0.084656477 109 nips-2006-Learnability and the doubling dimension
11 0.07624846 21 nips-2006-AdaBoost is Consistent
12 0.07514964 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
13 0.074051805 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.073349729 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
15 0.073301353 156 nips-2006-Ordinal Regression by Extended Binary Classification
16 0.070177801 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.069801778 150 nips-2006-On Transductive Regression
18 0.06965135 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
19 0.068754219 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
20 0.06764181 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
topicId topicWeight
[(0, -0.2), (1, 0.074), (2, -0.044), (3, -0.064), (4, -0.064), (5, 0.26), (6, -0.067), (7, 0.075), (8, 0.03), (9, 0.053), (10, 0.08), (11, 0.092), (12, -0.062), (13, -0.004), (14, -0.019), (15, -0.02), (16, -0.001), (17, -0.048), (18, 0.012), (19, 0.07), (20, -0.013), (21, 0.018), (22, -0.032), (23, -0.024), (24, -0.036), (25, -0.026), (26, 0.05), (27, 0.112), (28, 0.07), (29, -0.034), (30, 0.032), (31, 0.043), (32, 0.094), (33, 0.032), (34, 0.025), (35, 0.04), (36, 0.014), (37, 0.016), (38, 0.074), (39, -0.017), (40, -0.063), (41, 0.045), (42, -0.03), (43, -0.14), (44, 0.107), (45, -0.116), (46, -0.13), (47, 0.044), (48, -0.076), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.94361913 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
2 0.67474723 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
3 0.64581579 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier
Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classifier. We show that these bounds can be smaller than the risk of the Gibbs classifier and can be arbitrarily close to zero even if the risk of the Gibbs classifier is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1
4 0.62919885 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
5 0.58585709 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
6 0.57658786 116 nips-2006-Learning from Multiple Sources
7 0.56501937 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
8 0.53844815 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
9 0.4964608 159 nips-2006-Parameter Expanded Variational Bayesian Methods
10 0.49308848 186 nips-2006-Support Vector Machines on a Budget
11 0.48492587 155 nips-2006-Optimal Single-Class Classification Strategies
12 0.47859415 50 nips-2006-Chained Boosting
13 0.47564092 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
14 0.47241631 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
15 0.47120148 33 nips-2006-Analysis of Representations for Domain Adaptation
16 0.46348971 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
17 0.46085832 150 nips-2006-On Transductive Regression
18 0.45370644 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
19 0.41512167 185 nips-2006-Subordinate class recognition using relational object models
20 0.40934792 21 nips-2006-AdaBoost is Consistent
topicId topicWeight
[(1, 0.109), (3, 0.016), (7, 0.086), (9, 0.034), (10, 0.299), (20, 0.027), (22, 0.079), (44, 0.108), (57, 0.071), (65, 0.048), (69, 0.017), (90, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.76073676 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
2 0.66103411 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
3 0.57180744 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
4 0.56612337 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
5 0.56585342 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
6 0.56581002 65 nips-2006-Denoising and Dimension Reduction in Feature Space
7 0.56545782 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.56115723 117 nips-2006-Learning on Graph with Laplacian Regularization
9 0.56008953 21 nips-2006-AdaBoost is Consistent
10 0.55988973 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
11 0.55968708 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
12 0.55668622 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
13 0.55600756 5 nips-2006-A Kernel Method for the Two-Sample-Problem
14 0.5555231 159 nips-2006-Parameter Expanded Variational Bayesian Methods
15 0.5545758 169 nips-2006-Relational Learning with Gaussian Processes
16 0.55394995 121 nips-2006-Learning to be Bayesian without Supervision
17 0.55346328 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
18 0.55309349 139 nips-2006-Multi-dynamic Bayesian Networks
19 0.55271429 98 nips-2006-Inferring Network Structure from Co-Occurrences
20 0.55269098 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions