nips nips2006 nips2006-193 knowledge-graph by maker-knowledge-mining

193 nips-2006-Tighter PAC-Bayes Bounds

Source: pdf

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classiﬁers. The bound is based on learning a prior over the distribution of classiﬁers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classiﬁer compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classiﬁers. [sent-12, score-0.245]

2 The bound is based on learning a prior over the distribution of classiﬁers with a part of the training samples. [sent-13, score-0.493]

3 Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. [sent-14, score-0.424]

4 In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classiﬁer compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. [sent-15, score-0.324]

5 The danger of overﬁtting in such high-dimensional spaces is countered by maximising the margin of the classiﬁer on the training examples. [sent-17, score-0.142]

6 For this reason there has been considerable interest in bounds on the generalisation in terms of the margin. [sent-18, score-0.107]

7 Early bounds have relied on covering number computations [7], while later bounds have considered Rademacher complexity. [sent-19, score-0.146]

8 The tightest bounds for practical applications appear to be the PAC-Bayes bound [4, 5]. [sent-20, score-0.356]

9 In particular the form given in [3] is specially attractive for margin classiﬁers, like SVM. [sent-21, score-0.051]

10 The PAC-Bayesian bounds are also present in other Machine Learning models such as Gaussian Processes [6]. [sent-22, score-0.073]

11 The aim of this paper is to consider a reﬁnement of the PAC-Bayes approach and investigate whether it can improve on the original PAC-Bayes bound and uphold its capabilities of delivering reliable model selection. [sent-23, score-0.326]

12 The standard PAC-Bayes bound uses a Gaussian prior centred at the origin in weight space. [sent-24, score-0.556]

13 The key to the new bound is to use part of the training set to compute a more informative prior and then compute the bound on the remainder of the examples relative to this prior. [sent-25, score-0.8]

14 The bounds are tested experimentally in several classiﬁcation tasks, including the model selection, on common benchmark datasets. [sent-26, score-0.073]

15 Section 2 brieﬂy reviews the PAC-Bayes bound for SVMs obtained in [3]. [sent-28, score-0.245]

16 The new bound obtained by means of the reﬁnement of the prior is presented in Section 3. [sent-29, score-0.402]

17 The experimental work, included in Section 4, compares the tightness of the new bound with the original one and indicates about its usability in a model selection task. [sent-30, score-0.355]

18 Let us consider a distribution D of patterns x lying in a certain input space X , with their corresponding output labels y, y ∈ {−1, 1}. [sent-33, score-0.076]

19 For these two quantities we can derive the PAC-Bayes Bound on the true error of the distribution of classiﬁers: Theorem 2. [sent-37, score-0.071]

20 1 (PAC-Bayes Bound) For all prior distributions P (c) over the classiﬁers c, and for any δ ∈ (0, 1] PrS∼Dm KL(Q(c)||P (c)) + ln( m+1 ) δ ˆ ∀Q(c) : KL(QS ||QD ) ≤ m ≥ 1 − δ, 1−q q where KL is the Kullback-Leibler divergence, KL(p||q) = q ln p + (1 − q) ln 1−p and KL(Q(c)||P (c)) = Ec∼Q ln Q(c) . [sent-38, score-0.64]

21 P (c) The proof of the theorem can be found in [3]. [sent-39, score-0.067]

22 This bound can be particularised for the case of linear classiﬁers in the following way. [sent-40, score-0.321]

23 For any vector w we can deﬁne a stochastic classiﬁer in the following way: we choose the distribution Q = Q(w, µ) to be a spherical Gaussian with identity covariance matrix centred on the direction given by w at a distance µ from the origin. [sent-42, score-0.341]

24 Moreover, we can choose the prior P (c) to be a spherical Gaussian with identity covariance matrix centred on the origin. [sent-43, score-0.435]

25 2 (PAC-Bayes Bound for margin classiﬁers [3]) For all distributions D, for all classiﬁers given by w and µ > 0, for all δ ∈ (0, 1], we have ˆ Pr KL(QS (w, µ)||QD (w, µ)) ≤ µ2 2 + ln( m+1 ) δ m ≥ 1 − δ. [sent-48, score-0.051]

26 It can be shown (see [3]) that ˆ ˜ QS (w, µ) = Em [F (µγ(x, y))] (2) where Em is the average over the m training examples, γ(x, y) is the normalised margin of the training patterns ywT φ(x) γ(x, y) = (3) φ(x) w ˜ and F = 1 − F , where F is the cumulative normal distribution x 2 1 √ e−x /2 dx. [sent-49, score-0.347]

27 The generalisation error of such a classiﬁer can be bounded by at most twice the true (stochastic) error QD (w, µ) in Corollary 2. [sent-51, score-0.176]

28 3 Choosing a prior for the PAC-Bayes Bound Our ﬁrst contribution is motivated by the fact that the PAC-Bayes bound allows us to choose the prior distribution, P (c). [sent-53, score-0.559]

29 In the standard application of the bound this is chosen to be a Gaussian centred at the origin. [sent-54, score-0.377]

30 We now consider learning a different prior based on training an SVM on a subset R of the training set comprising r training patterns and labels. [sent-55, score-0.506]

31 With these r examples we can determine an SVM classiﬁer, wr and form a prior P (w|wr ) consisting of a Gaussian distribution with identity covariance matrix centred on wr . [sent-57, score-1.293]

32 The introduction of this prior P (w|wr ) in Theorem 2. [sent-58, score-0.157]

33 1 (Single Prior based PAC-Bayes Bound for margin classiﬁers) Let us consider a prior on the distribution of classiﬁers consisting in a spherical Gaussian with identity covariance centred along the direction given by wr at a distance η from the origin. [sent-61, score-0.968]

34 Then, for all distributions D, for all classiﬁers wm and µ > 0, for all δ ∈ (0, 1], we have PrS∼D ˆ KL(QS\R (wm , µ)||QD (wm , µ)) ≤ ||ηwr −µwm ||2 2 + ln( m−r+1 ) δ m−r ≥1−δ ˆ where QS\R is a stochastic measure of the error of the classiﬁer on the m − r samples not used to learn the prior. [sent-62, score-0.353]

35 This stochastic error is computed as indicated in equation (2) averaged over S\R. [sent-63, score-0.155]

36 Proof Since we separate r instances to learn the prior, the actual size of the training set to which we apply the bound is m − r. [sent-64, score-0.431]

37 In addition, the stochastic error must be computed only on the instances not used to learn the prior, i. [sent-65, score-0.182]

38 2 when applied to the SVM weight vector on the whole training set. [sent-69, score-0.091]

39 It is perhaps worth stressing that the bound holds for all wm and so can be applied to the SVM trained on the whole set. [sent-70, score-0.5]

40 This might at ﬁrst appear as ’cheating’, but the critical point is that the bound is evaluated on the set S\R not involved in generating the prior. [sent-71, score-0.269]

41 The experimental work illustrates how in fact this bound can be tighter than the standard PAC-Bayes bound. [sent-72, score-0.343]

42 Moreover, the selection of the prior may be further reﬁned in exchange for a very small increase in the penalty term. [sent-73, score-0.211]

43 Then, for all priors P (c) ∈ {Pj (c)}J , j=1 j=1 for all posterior distributions Q(c), for all δ ∈ (0, 1], PrS∼Dm ˆ ∀Q(c), ∀j : KL(QS ||QD ) ≤ 1 KL(Q(c)||Pj (c)) + ln m+1 + ln πj δ m ≥ 1 − δ, Proof The bound in Theorem 2. [sent-77, score-0.613]

44 This result can be also particularised for the case of SVM classiﬁers. [sent-80, score-0.076]

45 The set of priors is constructed by allocating Gaussian distributions with identity covariance matrix along the direction given by wr at distances {ηj }J from the origin where {ηj }J are real numbers. [sent-81, score-0.658]

46 3 (Multiple Prior PAC-Bayes Bound for linear classiﬁers) Let us consider a set {Pj (w|wr , ηj )}J of prior distributions of classiﬁers consisting in spherical Gaussian distribuj=1 tions with identity covariance matrix centred on ηj wr , where {ηj }J are real numbers. [sent-83, score-0.894]

47 Then, for j=1 all distributions D, for all classiﬁers w, for all µ > 0, for all δ ∈ (0, 1], we have PrS∼D ˆ KL(QS\R (w, µ)||QD (w, µ)) ≤ ||ηj wr −µw||2 2 + ln( m−r+1 ) + ln J δ m−r ≥1−δ 1 Proof The proof is straightforward, substituting πj = J for all j in Theorem 3. [sent-84, score-0.652]

48 2 and computing the KL divergence between prior and posterior as in the proof of Corollary 3. [sent-85, score-0.218]

49 In the case of several priors, the search is repeated for every prior and the reported value of the bound is the tightest. [sent-90, score-0.425]

50 In Section 4 we present experimental results comparing this new bound to the standard PAC-Bayes bound and using it to guide model selection. [sent-91, score-0.49]

51 4 Experiments The tightness of the new bound is evaluated in a model selection and classiﬁcation task using some UCI [1] datasets (see their description in terms of number of instances, input dimension and number of positive/negative examples in Table 1). [sent-92, score-0.38]

52 For every dataset, we obtain 50 different training/test set partitions with 80% of the samples forming the training set and the remaining 20% forming the test set. [sent-95, score-0.24]

53 With each of the partitions we learn a SVM classiﬁer with Gaussian RBF kernel preceded by a model selection. [sent-96, score-0.055]

54 The model selection consists in the determination of an optimal pair of hyperparameters (C, σ). [sent-97, score-0.122]

55 C is the SVM trade-off between the maximisation of the margin and the minimisation of the hinge loss of the training samples, while σ is the width of the Gaussian kernel. [sent-98, score-0.142]

56 For completeness, this model selection is guided by the PAC-Bayes bound: we select the model corresponding to the pair that yields a lower value of QD in the bound. [sent-105, score-0.115]

57 For every partition we use the minimum value of the bound resulting from all the pairs (C, σ) of the grid. [sent-107, score-0.29]

58 Note that this procedure is computationally less costly than the commonly used N -fold cross validation model selection, since it saves the training of N classiﬁers (one for each fold) for each parameter combination. [sent-108, score-0.117]

59 005 Table 2: Averaged PAC-Bayes Bound and Test Error Rate obtained by the model that yielded the lowest bound in each of the 50 training/test partitions. [sent-125, score-0.245]

60 We repeated this experiment using the Prior PAC-Bayes Bound with different conﬁgurations for learning the prior distribution of classiﬁers. [sent-126, score-0.157]

61 These conﬁgurations are deﬁned by variations on the percentage of training patterns separated to compute the prior and on the number of scalings of the magnitude of that prior. [sent-127, score-0.791]

62 The scalings represent different lengths η of ||wr || equally spaced between η = 1 and η = 100. [sent-128, score-0.323]

63 To summarize, for every training/test partition and for every pair (% patterns, # of scalings) we look at the pair (C, σ) that outputs the smaller value of QD . [sent-129, score-0.142]

64 In this case, the use of the Prior PAC-Bayes Bound to perform the model selection increases the computational burden of using the PAC-Bayes one in the training of one classiﬁer (the one used to learn the prior), in comparison to the extra N classiﬁers needed by N -fold cross validation. [sent-130, score-0.275]

65 It seems that ten scalings of the prior are enough to obtain tighter bounds, since the use of 100 or 500 scalings does not improve the best results. [sent-132, score-0.928]

66 With respect to the percentage of training instances left out to learn the prior, something close to 50% of the training set works well in the considered problems. [sent-133, score-0.366]

67 It is worth mentioning that we treat each position in the Table as a separate experiment. [sent-134, score-0.048]

68 005) Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-137, score-0.279]

69 003) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-179, score-0.279]

70 002) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-221, score-0.279]

71 002) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-263, score-0.279]

72 024 Scalings Table 3: Averaged Prior PAC-Bayes bound for different settings of percentage of training instances reserved to compute the prior and of number of scalings of the normalised prior. [sent-303, score-1.036]

73 This would have involved a further application of the union bound with the 20 entries of the Table for each problem, at the cost of adding an extra ln(20)/m (0. [sent-305, score-0.292]

74 We decided to ﬁx the number of scalings and the amount of training patterns to compute the prior since to perform all of the different options would augment the computational burden of the model selection. [sent-308, score-0.749]

75 Table 5 displays the test error rate obtained by SVMs with their hyperparameters tuned on the above mentioned grid by means of ten-fold cross-validation, that serves as a baseline method for comparison purposes. [sent-310, score-0.235]

76 According to the values shown in the tables, the Prior PAC-Bayes bound achieves tighter predictions of the generalization error of the randomized classiﬁer in almost all cases. [sent-311, score-0.438]

77 Notice how the length of the prior is not so critical in comparison with its direction. [sent-312, score-0.157]

78 The goodness of the latter relying on the subset of samples left out for the purpose of learning the prior classiﬁer. [sent-313, score-0.191]

79 021) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-317, score-0.279]

80 014) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-359, score-0.279]

81 008) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-401, score-0.279]

82 005) Scalings Percentage of training set used to compute the prior 10 % 20% 30% 40% 50% 1 0. [sent-443, score-0.279]

83 005 Table 4: Averaged Test Error Rate corresponding to the model determined by the bound for the different settings of Table 3. [sent-483, score-0.245]

84 Problem Wdbc Image Waveform Ringnorm Cross-validation error rate 0. [sent-484, score-0.119]

85 For every partition we select the test error rate corresponding to the model reporting the smaller cross-validation error. [sent-501, score-0.236]

86 However, the comparison with Table 5 points out that the PAC-Bayes bound is not as accurate as Ten Fold cross-validation when it comes to selecting a model that yields a low test error rate. [sent-502, score-0.364]

87 Nevertheless, in two out of the four problems (waveform, and wdbc) the bound provided a model as good as the one found by cross-validation, added to the fact that in ringnorm the error bars overlap. [sent-503, score-0.449]

88 We conclude the discussion by pointing that the Cross-validation error rate cannot be used directly as a prediction on the expected test error rate in the sense of worse case performances. [sent-504, score-0.286]

89 Of course the values of the cross-validation error rate and the test error rate are close, but it is difﬁcult to predict how close they are going to be. [sent-505, score-0.286]

90 5 Conclusions and ongoing research In this paper we have presented a version of the PAC-Bayes bound for linear classiﬁers that introduces the learning of the prior distribution over the classiﬁers. [sent-506, score-0.428]

91 This prior distribution is a Gaussian with identity covariance matrix. [sent-507, score-0.243]

92 The mean weight vector is learnt in the following way: its direction is determined from a separate subset of the training examples, while its length has to be chosen from an a priori ﬁxed set of lengths. [sent-508, score-0.138]

93 The experimental work shows that this new version of the bound achieves tighter predictions of the generalization error of the stochastic classiﬁer, compared to the original PAC-Bayes bound predictions. [sent-509, score-0.723]

94 Moreover, if the model selection is driven by the bound, the Prior PAC-Bayes does not degrade the quality of the model selected by the original bound. [sent-510, score-0.078]

95 Nevertheless, it has to be said that in some of our experiments the model selected by the bounds resulted as accurate as the ones selected by ten-fold cross-validation in terms of test error rate on a separate test. [sent-511, score-0.312]

96 This fact is remarkable since to include the model selection in the training of the classiﬁer roughly multiplies by ten the computational burden of the training when using ten-fold cross-validation but roughly by two when using the prior PAC-Bayes bound. [sent-512, score-0.491]

97 Of course the original PAC-Bayes provides with a cheaper model selection, but its predictions about the generalization capabilities are more pessimistic. [sent-513, score-0.105]

98 The amount of training patterns used to learn the prior seems to be a key aspect in the goodness of this prior and thus in the tightness of the bound. [sent-514, score-0.604]

99 Therefore, ongoing research includes methods to systematically determine an amount of patterns that provides with suitable priors. [sent-515, score-0.102]

100 Another line of research explores the use of these bounds to reinforce different properties of the design of classiﬁers, such as sparsity. [sent-516, score-0.073]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wr', 0.459), ('scalings', 0.323), ('qd', 0.28), ('bound', 0.245), ('wm', 0.209), ('kl', 0.181), ('qs', 0.169), ('ln', 0.161), ('prior', 0.157), ('prs', 0.153), ('classi', 0.152), ('ers', 0.15), ('ringnorm', 0.133), ('centred', 0.132), ('pj', 0.122), ('percentage', 0.113), ('er', 0.111), ('wdbc', 0.111), ('ew', 0.107), ('waveform', 0.107), ('tighter', 0.098), ('training', 0.091), ('capabilities', 0.081), ('misclassifying', 0.076), ('particularised', 0.076), ('patterns', 0.076), ('svm', 0.074), ('corollary', 0.073), ('bounds', 0.073), ('error', 0.071), ('burden', 0.071), ('dm', 0.068), ('spherical', 0.06), ('table', 0.058), ('tightness', 0.056), ('selection', 0.054), ('cd', 0.053), ('lund', 0.051), ('winsconsin', 0.051), ('margin', 0.051), ('rate', 0.048), ('identity', 0.048), ('test', 0.048), ('priors', 0.046), ('ec', 0.045), ('cs', 0.045), ('spain', 0.044), ('averaged', 0.044), ('pr', 0.043), ('stochastic', 0.04), ('instances', 0.038), ('covariance', 0.038), ('breast', 0.038), ('nement', 0.038), ('tightest', 0.038), ('normalised', 0.038), ('displays', 0.037), ('pair', 0.037), ('theorem', 0.035), ('generalisation', 0.034), ('goodness', 0.034), ('learn', 0.033), ('proof', 0.032), ('compute', 0.031), ('hyperparameters', 0.031), ('divergence', 0.029), ('wt', 0.029), ('cancer', 0.028), ('forming', 0.028), ('gaussian', 0.028), ('fold', 0.027), ('ten', 0.027), ('ongoing', 0.026), ('trick', 0.026), ('cross', 0.026), ('description', 0.025), ('london', 0.024), ('uci', 0.024), ('selected', 0.024), ('involved', 0.024), ('separate', 0.024), ('select', 0.024), ('worth', 0.024), ('predictions', 0.024), ('arrive', 0.023), ('union', 0.023), ('direction', 0.023), ('every', 0.023), ('partition', 0.022), ('origin', 0.022), ('gurations', 0.022), ('partitions', 0.022), ('ywt', 0.022), ('selections', 0.022), ('allocating', 0.022), ('cheating', 0.022), ('isabelle', 0.022), ('stressing', 0.022), ('remarked', 0.022), ('favourably', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 193 nips-2006-Tighter PAC-Bayes Bounds

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

2 0.16194411 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand

Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classiﬁers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classiﬁers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classiﬁers that characterizes our prior belief about good classiﬁers (before the observation of the data) and a posterior distribution Q (over the same space of classiﬁers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classiﬁer (deﬁned on the posterior Q) called the Gibbs classiﬁer. In the context of binary classiﬁcation, the Q-weighted majority vote classiﬁer (related to this stochastic classiﬁer) labels any input instance with the label output by the stochastic classiﬁer with probability more than half. Since at least half of the Q measure of the classiﬁers err on an example incorrectly classiﬁed by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classiﬁer. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classiﬁer only when the risk of the Gibbs classiﬁer is small. While the Gibbs classiﬁers related to the large-margin SVM classiﬁers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classiﬁers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classiﬁer is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classiﬁer, we provide here a risk bound for convex combinations of classiﬁers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classiﬁers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Deﬁnitions and Motivation We consider binary classiﬁcation problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a ﬁxed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a ﬁxed hypothesis space H of binary classiﬁers and produce a convex combination fQ of binary classiﬁers taken from H. Each binary classiﬁer h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classiﬁer randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classiﬁers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by deﬁnition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the ﬁrst moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classiﬁers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classiﬁer. Given any distribution over a space of classiﬁers, the Gibbs classiﬁer labels any example x ∈ X according to a classiﬁer randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classiﬁers, let us relate ζQ to the zero-one loss of a Gibbs classiﬁer. For this task, let us ﬁrst write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) deﬁnes another binary classiﬁer that we denote as h1−k (x). We now deﬁne the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classiﬁer. Indeed, if we deﬁne def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classiﬁer is denoted above by GQ , where Q is a distribution over the product classiﬁers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we ﬁrst choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classiﬁer is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be ampliﬁed by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classiﬁcation data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classiﬁers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q deﬁned by the absolute values of the Taylor coefﬁcients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q deﬁned on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) deﬁned ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one deﬁned by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P deﬁned on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisﬁes Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classiﬁers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classiﬁers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors deﬁned on the space H∗ of binary classiﬁers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) deﬁned by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as deﬁned by Equation 3. Let c and k be deﬁned by Equations 6 and 11 respectively. Then for any set H of binary classiﬁers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to ﬁnd a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classiﬁer ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classiﬁers that can be chosen by the weak learner. Each decision stump is a threshold classiﬁer that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classiﬁers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classiﬁed by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny ﬂuctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We ﬁrst note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classiﬁer, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) deﬁned by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classiﬁers and the zero-one loss of a related Gibbs classiﬁer GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classiﬁcation. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classiﬁers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using conﬁdence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.

3 0.12385213 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

4 0.1140115 159 nips-2006-Parameter Expanded Variational Bayesian Methods

Author: Tommi S. Jaakkola, Yuan Qi

Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1

5 0.10969641 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier

Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier

Abstract: We propose new PAC-Bayes bounds for the risk of the weighted majority vote that depend on the mean and variance of the error of its associated Gibbs classiﬁer. We show that these bounds can be smaller than the risk of the Gibbs classiﬁer and can be arbitrarily close to zero even if the risk of the Gibbs classiﬁer is close to 1/2. Moreover, we show that these bounds can be uniformly estimated on the training data for all possible posteriors Q. Moreover, they can be improved by using a large sample of unlabelled data. 1

6 0.092297308 186 nips-2006-Support Vector Machines on a Budget

7 0.090551741 50 nips-2006-Chained Boosting

8 0.087754592 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

9 0.087013252 33 nips-2006-Analysis of Representations for Domain Adaptation

10 0.084656477 109 nips-2006-Learnability and the doubling dimension

11 0.07624846 21 nips-2006-AdaBoost is Consistent

12 0.07514964 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

13 0.074051805 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

14 0.073349729 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

15 0.073301353 156 nips-2006-Ordinal Regression by Extended Binary Classification

16 0.070177801 65 nips-2006-Denoising and Dimension Reduction in Feature Space

17 0.069801778 150 nips-2006-On Transductive Regression

18 0.06965135 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

19 0.068754219 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

20 0.06764181 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.2), (1, 0.074), (2, -0.044), (3, -0.064), (4, -0.064), (5, 0.26), (6, -0.067), (7, 0.075), (8, 0.03), (9, 0.053), (10, 0.08), (11, 0.092), (12, -0.062), (13, -0.004), (14, -0.019), (15, -0.02), (16, -0.001), (17, -0.048), (18, 0.012), (19, 0.07), (20, -0.013), (21, 0.018), (22, -0.032), (23, -0.024), (24, -0.036), (25, -0.026), (26, 0.05), (27, 0.112), (28, 0.07), (29, -0.034), (30, 0.032), (31, 0.043), (32, 0.094), (33, 0.032), (34, 0.025), (35, 0.04), (36, 0.014), (37, 0.016), (38, 0.074), (39, -0.017), (40, -0.063), (41, 0.045), (42, -0.03), (43, -0.14), (44, 0.107), (45, -0.116), (46, -0.13), (47, 0.044), (48, -0.076), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94361913 193 nips-2006-Tighter PAC-Bayes Bounds

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

2 0.67474723 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand

3 0.64581579 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier

Author: Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, Nicolas Usunier

4 0.62919885 109 nips-2006-Learnability and the doubling dimension

Author: Yi Li, Philip M. Long

Abstract: Given a set of classiﬁers and a probability distribution over their domain, one can deﬁne a metric by taking the distance between a pair of classiﬁers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classiﬁer with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).

5 0.58585709 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

Abstract: We present a reduction framework from ordinal regression to binary classiﬁcation based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classiﬁer on the extended examples with any binary classiﬁcation algorithm, and constructing a ranking rule from the binary classiﬁer. A weighted 0/1 loss of the binary classiﬁer would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classiﬁcation approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classiﬁcation. In addition, our framework uniﬁes many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1

6 0.57658786 116 nips-2006-Learning from Multiple Sources

7 0.56501937 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

8 0.53844815 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

9 0.4964608 159 nips-2006-Parameter Expanded Variational Bayesian Methods

10 0.49308848 186 nips-2006-Support Vector Machines on a Budget

11 0.48492587 155 nips-2006-Optimal Single-Class Classification Strategies

12 0.47859415 50 nips-2006-Chained Boosting

13 0.47564092 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

14 0.47241631 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

15 0.47120148 33 nips-2006-Analysis of Representations for Domain Adaptation

16 0.46348971 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

17 0.46085832 150 nips-2006-On Transductive Regression

18 0.45370644 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

19 0.41512167 185 nips-2006-Subordinate class recognition using relational object models

20 0.40934792 21 nips-2006-AdaBoost is Consistent

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.109), (3, 0.016), (7, 0.086), (9, 0.034), (10, 0.299), (20, 0.027), (22, 0.079), (44, 0.108), (57, 0.071), (65, 0.048), (69, 0.017), (90, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76073676 193 nips-2006-Tighter PAC-Bayes Bounds

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

2 0.66103411 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

3 0.57180744 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classiﬁcation, multi-class classiﬁcation, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

4 0.56612337 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off ﬁdelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational beneﬁts of the approach to the some of the problems on which IC has been demonstrated. 1

5 0.56585342 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf

Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an inﬁnite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantiﬁed. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1

6 0.56581002 65 nips-2006-Denoising and Dimension Reduction in Feature Space

7 0.56545782 175 nips-2006-Simplifying Mixture Models through Function Approximation

8 0.56115723 117 nips-2006-Learning on Graph with Laplacian Regularization

9 0.56008953 21 nips-2006-AdaBoost is Consistent

10 0.55988973 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

11 0.55968708 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

12 0.55668622 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

13 0.55600756 5 nips-2006-A Kernel Method for the Two-Sample-Problem

14 0.5555231 159 nips-2006-Parameter Expanded Variational Bayesian Methods

15 0.5545758 169 nips-2006-Relational Learning with Gaussian Processes

16 0.55394995 121 nips-2006-Learning to be Bayesian without Supervision

17 0.55346328 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

18 0.55309349 139 nips-2006-Multi-dynamic Bayesian Networks

19 0.55271429 98 nips-2006-Inferring Network Structure from Co-Occurrences

20 0.55269098 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions