nips nips2006 nips2006-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
Reference: text
sentIndex sentText sentNum sentScore
1 A PAC-Bayes Risk Bound for General Loss Functions Pascal Germain D´ partement IFT-GLO e Universit´ Laval e Qu´ bec, Canada e Pascal. [sent-1, score-0.069]
2 ca Alexandre Lacasse D´ partement IFT-GLO e Universit´ Laval e Qu´ bec, Canada e Alexandre. [sent-4, score-0.069]
3 ca Francois Laviolette ¸ D´ partement IFT-GLO e Universit´ Laval e Qu´ bec, Canada e Francois. [sent-7, score-0.069]
4 ca Mario Marchand D´ partement IFT-GLO e Universit´ Laval e Qu´ bec, Canada e Mario. [sent-10, score-0.069]
5 ca Abstract We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). [sent-13, score-0.917]
6 Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. [sent-14, score-0.381]
7 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. [sent-15, score-0.398]
8 A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. [sent-17, score-0.585]
9 In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. [sent-18, score-0.43]
10 Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. [sent-19, score-0.634]
11 Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. [sent-20, score-1.055]
12 While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. [sent-21, score-1.034]
13 Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. [sent-22, score-0.399]
14 Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. [sent-24, score-1.005]
15 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. [sent-25, score-0.036]
16 We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. [sent-28, score-0.129]
17 Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. [sent-29, score-0.036]
18 For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . [sent-30, score-0.022]
19 Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. [sent-32, score-0.148]
20 Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. [sent-33, score-0.029]
21 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. [sent-35, score-0.133]
22 In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . [sent-36, score-0.542]
23 Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. [sent-37, score-0.636]
24 Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. [sent-38, score-0.843]
25 The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). [sent-39, score-0.73]
26 More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. [sent-40, score-0.596]
27 , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . [sent-44, score-0.411]
28 (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . [sent-45, score-0.051]
29 In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). [sent-46, score-0.139]
30 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. [sent-48, score-0.263]
31 Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. [sent-50, score-0.641]
32 For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . [sent-51, score-0.069]
33 ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). [sent-52, score-0.174]
34 We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. [sent-53, score-0.224]
35 If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . [sent-54, score-0.069]
36 (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. [sent-55, score-0.216]
37 Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . [sent-56, score-0.392]
38 The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. [sent-59, score-0.245]
39 We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. [sent-60, score-0.392]
40 Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . [sent-63, score-0.286]
41 Therefore, within this approach, the bound for ζQ can be tight only for small values of c. [sent-64, score-0.249]
42 Note however that loss functions having a small value of c are commonly used in practice. [sent-65, score-0.198]
43 Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. [sent-66, score-0.234]
44 Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . [sent-68, score-0.345]
45 This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. [sent-70, score-0.603]
46 Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). [sent-71, score-0.52]
47 The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . [sent-72, score-0.639]
48 In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. [sent-73, score-0.286]
49 To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. [sent-74, score-0.021]
50 Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). [sent-76, score-0.021]
51 In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q . [sent-77, score-0.117]
52 E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . [sent-80, score-0.499]
53 Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. [sent-82, score-0.174]
54 Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. [sent-83, score-0.428]
55 Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). [sent-84, score-0.119]
56 Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. [sent-86, score-0.198]
57 Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. [sent-87, score-0.244]
58 Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. [sent-89, score-0.306]
59 Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i. [sent-90, score-0.679]
60 , small expected margin) and a small Var (x,y) WQ (x, y) (i. [sent-92, score-0.024]
61 Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. [sent-95, score-0.87]
62 We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . [sent-96, score-0.394]
63 After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . [sent-100, score-0.328]
64 t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. [sent-101, score-0.603]
65 Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. [sent-102, score-0.081]
66 For each decision stump h ∈ H, its boolean complement is also in H. [sent-103, score-0.053]
67 Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). [sent-104, score-0.076]
68 Data sets having continuous-valued attributes have been discretized in our numerical experiments. [sent-106, score-0.04]
69 From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). [sent-107, score-0.143]
70 We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . [sent-109, score-0.547]
71 KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . [sent-110, score-0.35]
72 This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. [sent-111, score-0.143]
73 Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. [sent-112, score-0.135]
74 We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. [sent-113, score-0.679]
75 1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . [sent-115, score-0.903]
76 More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . [sent-116, score-0.196]
77 EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . [sent-117, score-0.198]
78 k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . [sent-118, score-0.245]
79 Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. [sent-119, score-0.106]
80 Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. [sent-121, score-0.049]
81 Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. [sent-122, score-0.47]
82 We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. [sent-123, score-0.804]
83 Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. [sent-124, score-0.863]
84 3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. [sent-126, score-0.388]
85 Interestingly however, the risk bound curves remain parallel to the true risk curves. [sent-127, score-0.682]
86 2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . [sent-129, score-0.655]
87 TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. [sent-130, score-0.051]
88 The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. [sent-131, score-0.424]
89 3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0. [sent-141, score-0.407]
90 2 E(WQ ) on test MV error on test Var(WQ ) on test 0. [sent-142, score-0.205]
91 The risk bound and the true risk were computed for β = ln 2. [sent-144, score-0.779]
92 1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. [sent-158, score-0.862]
93 cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. [sent-161, score-1.029]
94 Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. [sent-162, score-0.594]
95 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). [sent-163, score-1.71]
96 3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0. [sent-172, score-0.407]
97 2 E(WQ ) on test MV error on test Var(WQ ) on test 0. [sent-173, score-0.205]
98 The risk bound and the true risk were computed for β = ln 2. [sent-175, score-0.779]
99 bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . [sent-176, score-0.762]
100 However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. [sent-177, score-0.905]
wordName wordTfidf (topN-words)
[('wq', 0.429), ('gq', 0.395), ('risk', 0.245), ('fq', 0.223), ('vote', 0.209), ('loss', 0.198), ('def', 0.196), ('sigmoid', 0.183), ('gibbs', 0.153), ('majority', 0.149), ('bound', 0.143), ('tq', 0.137), ('kl', 0.132), ('ers', 0.132), ('classi', 0.132), ('mv', 0.125), ('qk', 0.124), ('ln', 0.117), ('rs', 0.116), ('boosting', 0.114), ('tight', 0.106), ('yh', 0.103), ('eq', 0.101), ('exponential', 0.1), ('er', 0.091), ('var', 0.089), ('adaboost', 0.087), ('mushroom', 0.078), ('tanh', 0.078), ('hk', 0.069), ('laval', 0.069), ('partement', 0.069), ('bec', 0.063), ('sgn', 0.063), ('dt', 0.062), ('test', 0.059), ('round', 0.057), ('qu', 0.055), ('sonar', 0.051), ('upper', 0.051), ('bounds', 0.047), ('universit', 0.046), ('theorem', 0.041), ('ht', 0.04), ('yfq', 0.039), ('equation', 0.039), ('canada', 0.039), ('taylor', 0.038), ('binary', 0.036), ('convex', 0.035), ('francois', 0.034), ('laviolette', 0.034), ('mario', 0.034), ('stump', 0.034), ('behaves', 0.032), ('bagging', 0.031), ('stumps', 0.029), ('yoav', 0.029), ('behavior', 0.029), ('true', 0.029), ('attribute', 0.028), ('margin', 0.028), ('error', 0.028), ('testing', 0.026), ('curve', 0.026), ('robert', 0.026), ('twice', 0.025), ('err', 0.024), ('expected', 0.024), ('training', 0.023), ('indeed', 0.023), ('numerical', 0.022), ('empirical', 0.022), ('posterior', 0.022), ('incorrectly', 0.022), ('combination', 0.022), ('prior', 0.021), ('apart', 0.021), ('relate', 0.021), ('valued', 0.021), ('freund', 0.021), ('combinations', 0.021), ('hi', 0.02), ('schapire', 0.02), ('zt', 0.02), ('parallel', 0.02), ('yi', 0.019), ('uci', 0.019), ('decision', 0.019), ('consequently', 0.019), ('hence', 0.019), ('relates', 0.018), ('normally', 0.018), ('stochastic', 0.018), ('precisely', 0.018), ('attributes', 0.018), ('john', 0.018), ('germain', 0.017), ('lacasse', 0.017), ('marchand', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
2 0.64252555 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier
Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classifier. We show that these bounds can be smaller than the risk of the Gibbs classifier and can be arbitrarily close to zero even if the risk of the Gibbs classifier is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1
3 0.16194411 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
4 0.16161604 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
5 0.12184228 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
6 0.10681783 50 nips-2006-Chained Boosting
8 0.093938231 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
9 0.08716017 33 nips-2006-Analysis of Representations for Domain Adaptation
10 0.073991917 61 nips-2006-Convex Repeated Games and Fenchel Duality
11 0.066714592 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
12 0.065985449 156 nips-2006-Ordinal Regression by Extended Binary Classification
13 0.065476209 181 nips-2006-Stability of $K$-Means Clustering
14 0.065264061 109 nips-2006-Learnability and the doubling dimension
15 0.064845681 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
16 0.060418587 186 nips-2006-Support Vector Machines on a Budget
17 0.059158724 150 nips-2006-On Transductive Regression
18 0.057202954 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
19 0.053251483 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
20 0.052394483 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
topicId topicWeight
[(0, -0.179), (1, 0.082), (2, -0.142), (3, -0.078), (4, -0.117), (5, 0.463), (6, -0.106), (7, 0.134), (8, -0.014), (9, 0.15), (10, 0.336), (11, 0.247), (12, -0.195), (13, 0.065), (14, -0.112), (15, -0.201), (16, -0.191), (17, 0.058), (18, 0.036), (19, 0.176), (20, 0.071), (21, -0.173), (22, 0.013), (23, -0.038), (24, 0.074), (25, 0.107), (26, -0.032), (27, -0.159), (28, -0.014), (29, -0.008), (30, -0.052), (31, 0.022), (32, -0.044), (33, -0.087), (34, -0.001), (35, 0.015), (36, 0.034), (37, 0.016), (38, 0.057), (39, 0.061), (40, 0.043), (41, 0.026), (42, -0.033), (43, 0.015), (44, -0.03), (45, 0.029), (46, -0.001), (47, -0.042), (48, -0.003), (49, -0.01)]
simIndex simValue paperId paperTitle
1 0.96642512 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier
Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classifier. We show that these bounds can be smaller than the risk of the Gibbs classifier and can be arbitrarily close to zero even if the risk of the Gibbs classifier is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1
same-paper 2 0.93880647 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
3 0.55747694 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
4 0.39253229 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
5 0.35765529 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
6 0.30185002 50 nips-2006-Chained Boosting
7 0.29953286 156 nips-2006-Ordinal Regression by Extended Binary Classification
8 0.2896443 109 nips-2006-Learnability and the doubling dimension
9 0.258654 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
10 0.25268131 33 nips-2006-Analysis of Representations for Domain Adaptation
11 0.25005129 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
12 0.23488018 150 nips-2006-On Transductive Regression
13 0.21687225 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
14 0.19689175 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
15 0.19390038 146 nips-2006-No-regret Algorithms for Online Convex Programs
16 0.19251195 155 nips-2006-Optimal Single-Class Classification Strategies
17 0.18653995 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
18 0.18440299 163 nips-2006-Prediction on a Graph with a Perceptron
19 0.18266612 186 nips-2006-Support Vector Machines on a Budget
20 0.18140349 61 nips-2006-Convex Repeated Games and Fenchel Duality
topicId topicWeight
[(1, 0.082), (7, 0.051), (9, 0.025), (20, 0.011), (22, 0.123), (44, 0.111), (57, 0.09), (58, 0.025), (62, 0.321), (65, 0.038), (69, 0.019)]
simIndex simValue paperId paperTitle
1 0.80868089 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
Author: Joseph M. Kimmel, Richard M. Salter, Peter J. Thomas
Abstract: Chemical reaction networks by which individual cells gather and process information about their chemical environments have been dubbed “signal transduction” networks. Despite this suggestive terminology, there have been few attempts to analyze chemical signaling systems with the quantitative tools of information theory. Gradient sensing in the social amoeba Dictyostelium discoideum is a well characterized signal transduction system in which a cell estimates the direction of a source of diffusing chemoattractant molecules based on the spatiotemporal sequence of ligand-receptor binding events at the cell membrane. Using Monte Carlo techniques (MCell) we construct a simulation in which a collection of individual ligand particles undergoing Brownian diffusion in a three-dimensional volume interact with receptors on the surface of a static amoeboid cell. Adapting a method for estimation of spike train entropies described by Victor (originally due to Kozachenko and Leonenko), we estimate lower bounds on the mutual information between the transmitted signal (direction of ligand source) and the received signal (spatiotemporal pattern of receptor binding/unbinding events). Hence we provide a quantitative framework for addressing the question: how much could the cell know, and when could it know it? We show that the time course of the mutual information between the cell’s surface receptors and the (unknown) gradient direction is consistent with experimentally measured cellular response times. We find that the acquisition of directional information depends strongly on the time constant at which the intracellular response is filtered. 1 Introduction: gradient sensing in eukaryotes Biochemical signal transduction networks provide the computational machinery by which neurons, amoebae or other single cells sense and react to their chemical environments. The precision of this chemical sensing is limited by fluctuations inherent in reaction and diffusion processes involving a ∗ Current address: Computational Neuroscience Graduate Program, The University of Chicago. Oberlin Center for Computation and Modeling, http://occam.oberlin.edu/. ‡ To whom correspondence should be addressed. http://www.case.edu/artsci/math/thomas/thomas.html; Oberlin College Research Associate. † finite quantity of molecules [1, 2]. The theory of communication provides a framework that makes explicit the noise dependence of chemical signaling. For example, in any reaction A + B → C, we may view the time varying reactant concentrations A(t) and B(t) as input signals to a noisy channel, and the product concentration C(t) as an output signal carrying information about A(t) and B(t). In the present study we show that the mutual information between the (known) state of the cell’s surface receptors and the (unknown) gradient direction follows a time course consistent with experimentally measured cellular response times, reinforcing earlier claims that information theory can play a role in understanding biochemical cellular communication [3, 4]. Dictyostelium is a soil dwelling amoeba that aggregates into a multicellular form in order to survive conditions of drought or starvation. During aggregation individual amoebae perform chemotaxis, or chemically guided movement, towards sources of the signaling molecule cAMP, secreted by nearby amoebae. Quantitive studies have shown that Dictyostelium amoebae can sense shallow, static gradients of cAMP over long time scales (∼30 minutes), and that gradient steepness plays a crucial role in guiding cells [5]. The chemotactic efficiency (CE), the population average of the cosine between the cell displacement directions and the true gradient direction, peaks at a cAMP concentration of 25 nanoMolar, similar to the equilibrium constant for the cAMP receptor (the Keq is the concentration of cAMP at which the receptor has a 50% chance of being bound or unbound, respectively). For smaller or larger concentrations the CE dropped rapidly. Nevertheless over long times cells were able (on average) to detect gradients as small as 2% change in [cAMP] per cell length. At an early stage of development when the pattern of chemotactic centers and spirals is still forming, individual amoebae presumably experience an inchoate barrage of weak, noisy and conflicting directional signals. When cAMP binds receptors on a cell’s surface, second messengers trigger a chain of subsequent intracellular events including a rapid spatial reorganization of proteins involved in cell motility. Advances in fluorescence microscopy have revealed that the oriented subcellular response to cAMP stimulation is already well underway within two seconds [6, 7]. In order to understand the fundamental limits to communication in this cell signaling process we abstract the problem faced by a cell to that of rapidly identifying the direction of origin of a stimulus gradient superimposed on an existing mean background concentration. We model gradient sensing as an information channel in which an input signal – the direction of a chemical source – is noisily transmitted via a gradient of diffusing signaling molecules; and the “received signal” is the spatiotemporal pattern of binding events between cAMP and the cAMP receptors [8]. We neglect downstream intracellular events, which cannot increase the mutual information between the state of the cell and the direction of the imposed extracellular gradient [9]. The analysis of any signal transmission system depends on precise representation of the noise corrupting transmitted signals. We develop a Monte Carlo simulation (MCell, [10, 11]) in which a simulated cell is exposed to a cAMP distribution that evolves from a uniform background to a gradient at low (1 nMol) average concentration. The noise inherent in the communication of a diffusionmediated signal is accurately represented by this method. Our approach bridges both the transient and the steady state regimes and allows us to estimate the amount of stimulus-related information that is in principle available to the cell through its receptors as a function of time after stimulus initiation. Other efforts to address aspects of cell signaling using the conceptual tools of information theory have considered neurotransmitter release [3] and sensing temporal signals [4], but not gradient sensing in eukaryotic cells. A typical natural habitat for social amoebae such as Dictyostelium is the complex anisotropic threedimensional matrix of the forest floor. Under experimental conditions cells typically aggregate on a flat two-dimensional surface. We approach the problem of gradient sensing on a sphere, which is both harder and more natural for the ameoba, while still simple enough for us analytically and numerically. Directional data is naturally described using unit vectors in spherical coordinates, but the ameobae receive signals as binding events involving intramembrane protein complexes, so we have developed a method for projecting the ensemble of receptor bindings onto coordinates in R3 . In loose analogy with the chemotactic efficiency [5], we compare the projected directional estimate with the true gradient direction represented as a unit vector on S2 . Consistent with observed timing of the cell’s response to cAMP stimulation, we find that the directional signal converges quickly enough for the cell to make a decision about which direction to move within the first two seconds following stimulus onset. 2 Methods 2.1 Monte Carlo simulations Using MCell and DReAMM [10, 11] we construct a spherical cell (radius R = 7.5µm, [12]) centered in a cubic volume (side length L = 30µm). N = 980 triangular tiles partition the surface (mesh generated by DOME1 ); each contained one cell surface receptor for cAMP with binding rate k+ = 4.4 × 107 sec−1 M−1 , first-order cAMP unbinding rate k− = 1.1 sec−1 [12] and Keq = k− /k+ = 25nMol cAMP. We established a baseline concentration of approximately 1nMol by releasing a cAMP bolus at time 0 inside the cube with zero-flux boundary conditions imposed on each wall. At t = 2 seconds we introduced a steady flux at the x = −L/2 wall of 1 molecule of cAMP per square micron per msec, adding signaling molecules from the left. Simultaneously, the x = +L/2 wall of the cube assumes absorbing boundary conditions. The new boundary conditions lead (at equilibrium) to a linear gradient of 2 nMol/30µm, ranging from ≈ 2.0 nMol at the flux source wall to ≈ 0 nMol at the absorbing wall (see Figure 1); the concentration profile approaches this new steady state with time constant of approximately 1.25 msec. Sampling boxes centered along the planes x = ±13.5µm measured the local concentration, allowing us to validate the expected model behavior. Figure 1: Gradient sensing simulations performed with MCell (a Monte Carlo simulator of cellular microphysiology, http://www.mcell.cnl.salk.edu/) and rendered with DReAMM (Design, Render, and Animate MCell Models, http://www.mcell.psc.edu/). The model cell comprised a sphere triangulated with 980 tiles with one cAMP receptor per tile. Cell radius R = 7.5µm; cube side L = 30µm. Left: Initial equilibrium condition, before imposition of gradient. [cAMP] ≈ 1nMol (c. 15,000 molecules in the volume outside the sphere). Right: Gradient condition after transient (c. 15,000 molecules; see Methods for details). 2.2 Analysis 2.2.1 Assumptions We make the following assumptions to simplify the analysis of the distribution of receptor activities at equilibrium, whether pre- or post-stimulus onset: 1. Independence. At equilibrium, the state of each receptor (bound vs unbound) is independent of the states of the other receptors. 2. Linear Gradient. At equilibrium under the imposed gradient condition, the concentration of ligand molecule varies linearly with position along the gradient axis. 3. Symmetry. 1 http://nwg.phy.bnl.gov/∼bviren/uno/other/ (a) Rotational equivariance of receptor activities. In the absence of an applied gradient signal, the probability distribution describing the receptor states is equivariant with respect to arbitrary rotations of the sphere. (b) Rotational invariance of gradient direction. The imposed gradient seen by a model cell is equally likely to be coming from any direction; therefore the gradient direction vector is uniformly distributed over S2 . (c) Axial equivariance about the gradient direction. Once a gradient direction is imposed, the probability distribution describing receptor states is rotationally equivariant with respect to rotations about the axis parallel with the gradient. Berg and Purcell [1] calculate the inaccuracy in concentration estimates due to nonindependence of adjacent receptors; for our parameters (effective receptor radius = 5nm, receptor spacing ∼ 1µm) the fractional error in estimating concentration differences due to receptor nonindependence is negligible ( 10−11 ) [1, 2]. Because we fix receptors to be in 1:1 correspondence with surface tiles, spherical symmetry and uniform distribution of the receptors are only approximate. The gradient signal communicated via diffusion does not involve sharp spatial changes on the scale of the distance between nearby receptors, therefore spherical symmetry and uniform identical receptor distribution are good analytic approximations of the model configuration. By rotational equivariance we mean that combining any rotation of the sphere with a corresponding rotation of the indices labeling the N receptors, {j = 1, · · · , N }, leads to a statistically indistinguishable distribution of receptor activities. This same spherical symmetry is reflected in the a priori distribution of gradient directions, which is uniform over the sphere (with density 1/4π). Spherical symmetry is broken by the gradient signal, which fixes a preferred direction in space. About this axis however, we assume the system retains the rotational symmetry of the cylinder. 2.2.2 Mutual information of the receptors In order to quantify the directional information available to the cell from its surface receptors we construct an explicit model for the receptor states and the cell’s estimated direction. We model the receptor states via a collection of random variables {Bj } and develop an expression for the entropy of {Bj }. Then in section 2.2.3 we present a method for projecting a temporally filtered estimated direction, g , into three (rather than N ) dimensions. ˆ N Let the random variables {Bj }j=1 represent the states of the N cAMP receptors on the cell surface; Bj = 1 if the receptor is bound to a molecule of cAMP, otherwise Bj = 0. Let xj ∈ S2 represent the direction from the center of the center of the cell to the j th receptor. Invoking assumption 2 above, we take the equilibrium concentration of cAMP at x to be c(x|g) = a + b(x · g) where g ∈ S2 is a unit vector in the direction of the gradient. The parameter a is the mean concentration over the cell surface, and b = R| c| is half the drop in concentration from one extreme on the cell surface to the other. Before the stimulus begins, the gradient direction is undefined. It can be shown (see Supplemental Materials) that the entropy of receptor states given a fixed gradient direction g, H[{Bj }|g], is given by an integral over the sphere: π 2π a + b cos(θ) sin(θ) dφ dθ (as N → ∞). (1) a + b cos(θ) + Keq 4π θ=0 φ=0 On the other hand, if the gradient direction remains unspecified, the entropy of receptor states is given by H[{Bj }|g] ∼ N Φ π 2π θ=0 φ=0 H[{Bj }] ∼ N Φ a + b cos(θ) a + b cos(θ) + Keq sin(θ) dφ dθ (as N → ∞), 4π − (p log2 (p) + (1 − p) log2 (1 − p)) , 0 < p < 1 0, p = 0 or 1 binary random variable with state probabilities p and (1 − p). where Φ[p] = (2) denotes the entropy for a In both equations (1) and (2), the argument of Φ is a probability taking values 0 ≤ p ≤ 1. In (1) the values of Φ are averaged over the sphere; in (2) Φ is evaluated after averaging probabilities. Because Φ[p] is convex for 0 ≤ p ≤ 1, the integral in equation 1 cannot exceed that in equation 2. Therefore the mutual information upon receiving the signal is nonnegative (as expected): ∆ M I[{Bj }; g] = H[{Bj }] − H[{Bj }|g] ≥ 0. The analytic solution for equation (1) involves the polylogarithm function. For the parameters shown in the simulation (a = 1.078 nMol, b = .512 nMol, Keq = 25 nMol), the mutual information with 980 receptors is 2.16 bits. As one would expect, the mutual information peaks when the mean concentration is close to the Keq of the receptor, exceeding 16 bits when a = 25, b = 12.5 and Keq = 25 (nMol). 2.2.3 Dimension reduction The estimate obtained above does not give tell us how quickly the directional information available to the cell evolves over time. Direct estimate of the mutual information from stochastic simulations is impractical because the aggregate random variables occupy a 980 dimensional space that a limited number of simulation runs cannot sample adequately. Instead, we construct a deterministic function from the set of 980 time courses of the receptors, {Bj (t)}, to an aggregate directional estimate in R3 . Because of the cylindrical symmetry inherent in the system, our directional estimator g is ˆ an unbiased estimator of the true gradient direction g. The estimator g (t) may be thought of as ˆ representing a downstream chemical process that accumulates directional information and decays with some time constant τ . Let {xj }N be the spatial locations of the N receptors on the cell’s j=1 surface. Each vector is associated with a weight wj . Whenever the j th receptor binds a cAMP molecule, wj is incremented by one; otherwise wj decays with time constant τ . We construct an instantaneous estimate of the gradient direction from the linear combination of receptor positions, N gτ (t) = j=1 wj (t)xj . This procedure reflects the accumulation and reabsorption of intracellular ˆ second messengers released from the cell membrane upon receptor binding. Before the stimulus is applied, the weighted directional estimates gτ are small in absolute magniˆ tude, with direction uniformly distributed on S2 . In order to determine the information gained as the estimate vector evolves after stimulus application, we wish to determine the change in entropy in an ensemble of such estimates. As the cell gains information about the direction of the gradient signal from its receptors, the entropy of the estimate should decrease, leading to a rise in mutual information. By repeating multiple runs (M = 600) of the simulation we obtain samples from the ensemble of direction estimates, given a particular stimulus direction, g. In the method of Kozachenko and Leonenko [13], adapted for the analysis of neural spike train data by Victor [14] (“KLV method”), the cumulative distribution function is approximated directly from the observed samples, and the entropy is estimated via a change of variables transformation (see below). This method may be formulated in vector spaces Rd for d > 1 ([13]), but it is not guaranteed to be unbiased in the multivariate case [15] and has not been extended to curved manifolds such as the sphere. In the present case, however, we may exploit the symmetries inherent in the model (Assumptions 3a-3c) to reduce the empirical entropy estimation problem to one dimension. Adapting the argument in [14] to the case of spherical data from a distribution with rotational symmetry about a given axis, we obtain an estimate of the entropy based on a series of observations of the angles {θ1 , · · · , θM } between the estimates gτ and the true gradient direction g (for details, see ˆ Supplemental Materials): 1 H∼ M M log2 (λk ) + log2 (2(M − 1)) + k=1 γ + log2 (2π) + log2 (sin(θk )) loge (2) (3) ∆ (as M → ∞) where after sorting the θk in monotonic order, λk = min(|θk − θk±1 |) is the distance between each angle and its nearest neighbor in the sample, and γ is the Euler-Mascheroni constant. As shown in Figure 2, this approximation agrees with the analytic result for the uniform distribution, Hunif = log2 (4π) ≈ 3.651. 3 Results Figure 3 shows the results of M = 600 simulation runs. Panel A shows the concentration averaged across a set of 1µm3 sample boxes, four in the x = −13.5µm plane and four in the x = +13.5µm Figure 2: Monte Carlo simulation results and information analysis. A: Average concentration profiles along two planes perpendicular to the gradient, at x = ±13.5µm. B: Estimated direction vector (x, y, and z components; x = dark blue trace) gτ , τ = 500 msec. C: Entropy of the ensemble of diˆ rectional vector estimates for different values of the intracellular filtering time constant τ . Given the directions of the estimates θk , φk on each of M runs, we calculate the entropy of the ensemble using equation (3). All time constants yield uniformly distributed directional estimates in the pre-stimulus period, 0 ≤ t ≤ 2 (sec). After stimulus onset, directional estimates obtained with shorter time constants respond more quickly but achieve smaller gains in mutual information (smaller reductions in entropy). Filtering time constants τ range from lightest to darkest colors: 20, 50, 100, 200, 500, 1000, 2000 msec. plane. The initial bolus of cAMP released into the volume at t = 0 sec is not uniformly distributed, but spreads out evenly within 0.25 sec. At t = 2.0 sec the boundary conditions are changed, causing a gradient to emerge along a realistic time course. Consistent with the analytic solution for the mean concentration (not shown), the concentration approaches equilibrium more rapidly near the absorbing wall (descending trace) than at the imposed flux wall (ascending trace). Panel B shows the evolution of a directional estimate vector gτ for a single run, with τ = 500 ˆ msec. During uniform conditions all vectors fluctuate near the origin. After gradient onset the variance increases and the x component (dark trace) becomes biased towards the gradient source (g = [−1, 0, 0]) while the y and z components still have a mean of zero. Across all 600 runs the mean of the y and z components remains close to zero, while the mean of the x component systematically departs from zero shortly after stimulus onset (not shown). Hence the directional estimator is unbiased (as required by symmetry). See Supplemental Materials for the population average of g . ˆ Panel C shows the time course of the entropy of the ensemble of normalized directional estimate vectors gτ /|ˆτ | over M = 600 simulations, for intracellular filtering time constants ranging from 20 ˆ g msec to 2000 msec (light to dark shading), calculated using equation (3). Following stimulus onset, entropy decreases steadily, showing an increase in information available to the amoeba about the direction of the stimulus; the mutual information at a given point in time is the difference between the entropy at that time and before stimulus onset. For a cell with roughly 1000 receptors the mutual information has increased at most by ∼ 2 bits of information by one second (for τ = 500 msec), and at most by ∼ 3 bits of information by two seconds (for τ =1000 or 2000 msec), under our stimulation protocol. A one bit reduction in uncertainty is equivalent to identifying the correct value of the x component (positive versus negative) when the stimulus direction is aligned along the x-axis. Alternatively, note that a one bit reduction results in going from the uniform distribution on the sphere to the uniform distribution on one hemisphere. For τ ≤ 100 msec, the weighted average with decay time τ never gains more than one bit of information about the stimulus direction, even at long times. This observation suggestions that signaling must involve some chemical components with lifetimes longer than 100 msec. The τ = 200 msec filter saturates after about one second, at ∼ 1 bit of information gain. Longer lived second messengers would respond more slowly to changes from the background stimulus distribution, but would provide better more informative estimates over time. The τ = 500 msec estimate gains roughly two bits of information within 1.5 seconds, but not much more over time. Heuristically, we may think of a two bit gain in information as corresponding to the change from a uniform distribution to one covering uniformly covering one quarter of S2 , i.e. all points within π/3 of the true direction. Within two seconds the τ = 1000 msec and τ = 2000 msec weighted averages have each gained approximately three bits of information, equivalent to a uniform distribution covering all points with 0.23π or 41o of the true direction. 4 Discussion & conclusions Clearly there is an opportunity for more precise control of experimental conditions to deepen our understanding of spatio-temporal information processing at the membranes of gradient-sensitive cells. Efforts in this direction are now using microfluidic technology to create carefully regulated spatial profiles for probing cellular responses [16]. Our results suggest that molecular processes relevant to these responses must have lasting effects ≥ 100 msec. We use a static, immobile cell. Could cell motion relative to the medium increase sensitivity to changes in the gradient? No: the Dictyostelium velocity required to affect concentration perception is on order 1cm sec−1 [1], whereas reported velocities are on the order µm sec−1 [5]. The chemotactic response mechanism is known to begin modifying the cell membrane on the edge facing up the gradient within two seconds after stimulus initiation [7, 6], suggesting that the cell strikes a balance between gathering data and deciding quickly. Indeed, our results show that the reported activation of the G-protein signaling system on the leading edge of a chemotactically responsive cell [7] rises at roughly the same rate as the available chemotactic information. Results such as these ([7, 6]) are obtained by introducing a pipette into the medium near the amoeba; the magnitude and time course of cAMP release are not precisely known, and when estimated the cAMP concentration at the cell surface is over 25 nMol by a full order of magnitude. Thomson and Kristan [17] show that for discrete probability distributions and for continuous distributions over linear spaces, stimulus discriminability may be better quantified using ideal observer analysis (mean squared error, for continuous variables) than information theory. The machinery of mean squared error (variance, expectation) do not carry over to the case of directional data without fundamental modifications [18]; in particular the notion of mean squared error is best represented by the mean resultant length 0 ≤ ρ ≤ 1, the expected length of the vector average of a collection of unit vectors representing samples from directional data. A resultant with length ρ ≈ 1 corresponds to a highly focused probability density function on the sphere. In addition to measuring the mutual information between the gradient direction and an intracellular estimate of direction, we also calculated the time evolution of ρ (see Supplemental Materials.) We find that ρ rapidly approaches 1 and can exceed 0.9, depending on τ . We found that in this case at least the behavior of the mean resultant length and the mutual information are very similar; there is no evidence of discrepancies of the sort described in [17]. We have shown that the mutual information between an arbitrarily oriented stimulus and the directional signal available at the cell’s receptors evolves with a time course consistent with observed reaction times of Dictyostelium amoeba. Our results reinforce earlier claims that information theory can play a role in understanding biochemical cellular communication. Acknowledgments MCell simulations were run on the Oberlin College Beowulf Cluster, supported by NSF grant CHE0420717. References [1] Howard C. Berg and Edward M. Purcell. Physics of chemoreception. Biophysical Journal, 20:193, 1977. [2] William Bialek and Sima Setayeshgar. Physical limits to biochemical signaling. PNAS, 102(29):10040– 10045, July 19 2005. [3] S. Qazi, A. Beltukov, and B.A. Trimmer. Simulation modeling of ligand receptor interactions at nonequilibrium conditions: processing of noisy inputs by ionotropic receptors. Math Biosci., 187(1):93–110, Jan 2004. [4] D. J. Spencer, S. K. Hampton, P. Park, J. P. Zurkus, and P. J. Thomas. The diffusion-limited biochemical signal-relay channel. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16. MIT Press, Cambridge, MA, 2004. [5] P.R. Fisher, R. Merkl, and G. Gerisch. Quantitative analysis of cell motility and chemotaxis in Dictyostelium discoideum by using an image processing system and a novel chemotaxis chamber providing stationary chemical gradients. J. Cell Biology, 108:973–984, March 1989. [6] Carole A. Parent, Brenda J. Blacklock, Wendy M. Froehlich, Douglas B. Murphy, and Peter N. Devreotes. G protein signaling events are activated at the leading edge of chemotactic cells. Cell, 95:81–91, 2 October 1998. [7] Xuehua Xu, Martin Meier-Schellersheim, Xuanmao Jiao, Lauren E. Nelson, and Tian Jin. Quantitative imaging of single live cells reveals spatiotemporal dynamics of multistep signaling events of chemoattractant gradient sensing in dictyostelium. Molecular Biology of the Cell, 16:676–688, February 2005. [8] Jan Wouter-Rappel, Peter. J Thomas, Herbert Levine, and William F. Loomis. Establishing direction during chemotaxis in eukaryotic cells. Biophys. J., 83:1361–1367, 2002. [9] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley, New York, 1990. [10] J. R. Stiles, D. Van Helden, T. M. Bartol, E.E. Salpeter, and M. M. Salpeter. Miniature endplate current rise times less than 100 microseconds from improved dual recordings can be modeled with passive acetylcholine diffusion from a synaptic vesicle. Proc. Natl. Acad. Sci. U.S.A., 93(12):5747–52, Jun 11 1996. [11] J. R. Stiles and T. M. Bartol. Computational Neuroscience: Realistic Modeling for Experimentalists, chapter Monte Carlo methods for realistic simulation of synaptic microphysiology using MCell, pages 87–127. CRC Press, Boca Raton, FL, 2001. [12] M. Ueda, Y. Sako, T. Tanaka, P. Devreotes, and T. Yanagida. Single-molecule analysis of chemotactic signaling in Dictyostelium cells. Science, 294:864–867, October 2001. [13] L.F. Kozachenko and N.N. Leonenko. Probl. Peredachi Inf. [Probl. Inf. Transm.], 23(9):95, 1987. [14] Jonathan D. Victor. Binless strategies for estimation of information from neural data. Physical Review E, 66:051903, Nov 11 2002. [15] Marc M. Van Hulle. Edgeworth approximation of multivariate differential entropy. Neural Computation, 17:1903–1910, 2005. [16] Loling Song, Sharvari M. Nadkarnia, Hendrik U. B¨ dekera, Carsten Beta, Albert Bae, Carl Franck, o Wouter-Jan Rappel, William F. Loomis, and Eberhard Bodenschatz. Dictyostelium discoideum chemotaxis: Threshold for directed motion. Euro. J. Cell Bio, 85(9-10):981–9, 2006. [17] Eric E. Thomson and William B. Kristan. Quantifying stimulus discriminability: A comparison of information theory and ideal observer analysis. Neural Computation, 17:741–778, 2005. [18] Kanti V. Mardia and Peter E. Jupp. Directional Statistics. John Wiley & Sons, West Sussex, England, 2000.
same-paper 2 0.79487163 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand
Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.
3 0.71192694 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier
Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classifier. We show that these bounds can be smaller than the risk of the Gibbs classifier and can be arbitrarily close to zero even if the risk of the Gibbs classifier is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1
4 0.51929426 61 nips-2006-Convex Repeated Games and Fenchel Duality
Author: Shai Shalev-shwartz, Yoram Singer
Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
5 0.51547557 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.51434839 175 nips-2006-Simplifying Mixture Models through Function Approximation
7 0.51185966 98 nips-2006-Inferring Network Structure from Co-Occurrences
8 0.51107967 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
9 0.51086962 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
10 0.5086776 21 nips-2006-AdaBoost is Consistent
11 0.50692952 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
12 0.50627118 139 nips-2006-Multi-dynamic Bayesian Networks
13 0.50624096 194 nips-2006-Towards a general independent subspace analysis
14 0.50513172 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
15 0.5041858 5 nips-2006-A Kernel Method for the Two-Sample-Problem
16 0.50233001 131 nips-2006-Mixture Regression for Covariate Shift
17 0.50151461 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
18 0.50063396 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
19 0.49961928 193 nips-2006-Tighter PAC-Bayes Bounds
20 0.49946573 159 nips-2006-Parameter Expanded Variational Bayesian Methods