jmlr jmlr2012 jmlr2012-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
Reference: text
sentIndex sentText sentNum sentScore
1 COM Department of Computer Science and Technology East China Normal University 500 Dongchuan Road Shanghai 200241, China Editor: Gabor Lugosi Abstract This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. [sent-11, score-0.553]
2 The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. [sent-12, score-0.9]
3 We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. [sent-13, score-0.836]
4 The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. [sent-14, score-1.311]
5 The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. [sent-15, score-1.185]
6 The PAC-Bayes bound (retrospected in Section 2) uses a Gaussian prior centered at the origin in the weight space. [sent-39, score-0.545]
7 The key to the new bounds introduced here is to use part of the training set to compute a more informative prior and then compute the bound on the remainder of the examples relative to this prior. [sent-40, score-0.653]
8 The prior PAC-Bayes bound was initially presented by Ambroladze et al. [sent-42, score-0.514]
9 A slight nuisance of the prior PAC-Bayes bound is that a separate data set should be available in order to fix the prior. [sent-44, score-0.546]
10 1 introduces a new classification algorithm, the prior SVM, which replaces the margin maximization in the optimization problem by a regularization term that pushes towards the minimization of the PAC-Bayes bound. [sent-52, score-0.444]
11 The first one involves the learning of a prior formed by an ensemble of Gaussian distributions centered at different distances along the same direction. [sent-54, score-0.456]
12 During the second stage, each component of the prior is mapped with a posterior that improves its classification accuracy while tightening the PAC-Bayes bound. [sent-55, score-0.524]
13 In the last stage the prior component/posterior pair that achieves the lowest value of the PAC-Bayes bound is selected as prior SVM classifier. [sent-56, score-0.9]
14 2 presents a second algorithm, named η-prior SVM as a variant of prior SVMs where the position of component of the prior that goes into the overall classifier is optimised in a continuous range (not picked from a fixed set). [sent-58, score-0.772]
15 Therefore, η-prior SVMs include a first optimization where the direction of the prior is learnt from a separate set of training patterns, and a second optimization that determines (i) the exact position of the prior along the already learnt direction and (ii) the position of the posterior. [sent-59, score-0.974]
16 The experiments illustrate the capabilities of the prior PAC-Bayes bound to provide tighter predictions of the generalisation of an SVM. [sent-62, score-0.606]
17 Moreover, the combination of the new bounds and the two prior SVM algorithms yields more dramatic tightenings of the bound. [sent-63, score-0.46]
18 We finish the experimental work showing that the use of a different value of C for the prior and the posterior that form the (η)prior SVM lead to a further tightening of the bound. [sent-65, score-0.524]
19 Moreover, we can choose the prior cu : u ∼ N (0, I) to be a spherical Gaussian with identity covariance matrix centered at the origin. [sent-81, score-0.556]
20 These new bounds introduce more sophisticated designs for the prior distribution over the classifiers in order to reduce its divergence with the posterior distribution. [sent-91, score-0.585]
21 The first set of bounds learns the prior distribution from a separate training data set that will not be used in the computation of the bound, whilst the second set learns the prior from mathematical expectations, avoiding to leave out a subset of patterns to calculate the bound. [sent-92, score-0.945]
22 Our first contribution is motivated by the fact that the PAC-Bayes bound allows us to choose the prior distribution, P(c). [sent-96, score-0.514]
23 We now consider learning a different prior based on training an SVM on a subset T of the training set comprising r training patterns and labels. [sent-98, score-0.533]
24 With these r examples we can learn an (unit and biased) SVM classifier, wr , and form a prior P(wr , η) ∼ N (ηwr , I) consisting of a Gaussian distribution with identity covariance matrix centered along wr at a distance η from the origin. [sent-100, score-1.233]
25 Corollary 5 (Single-prior PAC-Bayes bound for SVMs) Let us consider a prior on the distribution of classifiers consisting of a spherical Gaussian with identity covariance centered along the direction given by wr at a distance η from the origin. [sent-102, score-1.051]
26 Classifier wr has been learnt from a subset T of r examples a priori separated from a training set S of m samples. [sent-103, score-0.478]
27 2 Intuitively, if the selection of the prior is appropriate, the bound can be tighter than the one given in Corollary 4 when applied to the SVM weight vector on the whole training set. [sent-110, score-0.607]
28 Theorem 6 (Mixture prior PAC-Bayes bound) Let P (c) = ∑J π j Pj (c) be a prior distribution j=1 over classifiers consisting of a mixture of J components {Pj (c)}J combined with positive weights j=1 {π j }J so that ∑J π j = 1. [sent-116, score-0.796]
29 Proof The bound in Theorem 3 can be instantiated for the ensemble prior P (c) ˆ PrS∼D m ∀Q(c) : KL+ (QS ||QD ) ≤ KL(Q(c)||P (c)) + ln( m+1 ) δ m ≥ 1 − δ. [sent-118, score-0.535]
30 We now bound the KL divergence between the posterior Q(c) and the ensemble prior P (c). [sent-119, score-0.66]
31 The mixture prior is constructed by allocating Gaussian distributions with identity covariance matrix along the direction given by wr at distances {η j }J from the origin where {η j }J are positive real numbers. [sent-125, score-0.89]
32 Then, for all distributions D , for all posteriors (w, µ) and for all δ ∈ (0, 1], we have that with probability greater than 1 − δ over all the training sets S of size m sampled from D ˆ KL+ (QS\T (w, µ)||QD (w, µ)) ≤ min j ||η j wr −µw||2 2 + ln( m−r+1 ) + ln J δ . [sent-127, score-0.561]
33 m−r Proof The proof is straightforward and can be completed by substituting 1/J for all π j in Theorem 6 and computing the KL divergence between prior and posterior as in the proof of Corollary 5. [sent-128, score-0.511]
34 Rather than take a spherically symmetric prior distribution we choose the variance in the direction of the prior vector to be τ > 1. [sent-133, score-0.795]
35 As with the prior PAC-Bayes bound the mean of the prior distribution is also shifted from the original in the direction wr . [sent-134, score-1.3]
36 Our application to SVMs is not restricted to using specific priors and posteriors so that we have the flexibility to adapt our distributions in order to accommodate the prior derived from the last part of the data. [sent-136, score-0.515]
37 Theorem 8 (τ-prior PAC-Bayes bound for linear classifiers) Let us consider a prior P(c|wr , τ, η) distribution of classifiers consisting of a Gaussian distribution centred on ηwr , with identity covariance matrix in all directions except wr in which the variance is τ2 . [sent-138, score-0.981]
38 Theorem 10 (Mixture-expectation-prior PAC-Bayes bound for SVMs) For all D , for all mixtures of Gaussian prior P (c) = ∑J π j Pj (c) where Pj ∼ N (η j w p , I) ( j = 1, . [sent-166, score-0.514]
39 Optimising the Prior PAC-Bayes Bound in the Design of the Classifier Up to this point we have introduced the prior PAC-Bayes bounds as a means to tighten the original PAC-Bayes bound (this fact is illustrated in the experiments included in Section 5). [sent-182, score-0.613]
40 The next contribution of this paper consists of the introduction of the optimisation of the prior PAC-Bayes bound into the design of the classifier. [sent-183, score-0.514]
41 1 Prior SVM The new philosophy is implemented in the prior SVM by replacing the maximization of the margin in the optimization problem defining the original SVM with a term that pushes towards the tightening of the prior PAC-Bayes bound. [sent-186, score-0.867]
42 This subsection introduces the formulation of the new algorithm, a method to determine the classifier by means of off-the-shelf quadratic programming solvers, and a procedure to compute the prior PAC-Bayes bound for these new classifiers. [sent-187, score-0.514]
43 1 F ORMULATION OF THE P RIOR SVM S As stated before, the design criterion for the prior SVMs involves the minimization of the prior PAC-Bayes bound. [sent-190, score-0.772]
44 Let us consider the simplest case of the bound, that is, a single prior centered on ηwr , where wr is the unit vector weight of the SVM constructed with r training samples and η is a scalar fixed a priori. [sent-191, score-0.834]
45 ∑m l=m−r+1 yl αl φ(xl ) wr = In such a case, a small bound on the error of the classifier is the result of a small value of ηwr − µw 2 , and a large value of the normalized margin of Equation (3) for the remaining training examples γ(xi , yi ), i = 1, . [sent-194, score-0.663]
46 Once w is found through the solution of (13) with constraints (14) the proper bound on the average true error of the prior SVM can be obtained by means of a further tuning of µ (that is, using µw instead of w as mean of the posterior distribution), where this last tuning will not change w. [sent-204, score-0.639]
47 Then we construct a mixture prior with J Gaussian components with identity covariance matrices centered at η j wr , with η j being J real positive constants. [sent-219, score-0.88]
48 For every element in the mixture we obtain a prior SVM classifier w j solving min w j ,ξi 1 j w − η j wr 2 2 m−r +C ∑ ξi i=1 subject to yi φ(xi )T w j ≥ 1 − ξi i = 1, . [sent-221, score-0.834]
49 j Afterwards, we obtain the bounds QD corresponding to the average true error of each one of the J prior SVMs by tuning µ (see Corollary 6). [sent-228, score-0.484]
50 We finally select as the prior SVM the w j that reports the lowest bound QD . [sent-230, score-0.514]
51 It should be pointed out that each prior scaling (η j ) that is tried increases the computational burden of the training of the prior SVMs by an amount corresponding to an SVM problem with m − r data points. [sent-231, score-0.812]
52 2 C OMPUTING THE PAC-BAYES B OUND P RIOR SVM S FOR THE The remainder of the section presents a method to compute the PAC-Bayes bound for a prior SVM obtained through the procedure described above. [sent-235, score-0.514]
53 To simplify notation we have introduced the nonunit weight vector wm−r = w−ηwr , that includes the posterior part of the prior SVM. [sent-236, score-0.487]
54 The bound is based on the relationship between two distributions of classifiers: the prior P(wr , η) ∼ N (ηwr , I) and the posterior Q(w, µ) ∼ N (µw, I). [sent-237, score-0.633]
55 2 η-Prior SVM When the prior SVM is learnt within a mixture priors setting, the last stage of the optimization is the selection of the best prior-component/posterior pair, among the J possibilities. [sent-241, score-0.563]
56 These priorcomponent/posterior pairs are denoted by (η j , w j ), where η j is the jth scaling of the normalized prior wr . [sent-242, score-0.763]
57 Therefore, instead of optimizing a posterior for every scaling of the prior, the optimal scaling and posterior given a normalized prior are the output of the same optimization problem. [sent-248, score-0.588]
58 The sequel is devoted to the derivation of the resulting algorithm, called the η-prior SVMs, and to its analysis using the prior PAC-Bayes bound framework. [sent-249, score-0.533]
59 , m − r, i=1 ˜ ˜ where gi = ∑m k=m−r+1 αk yk κ(xi , xk ) and αk are the normalized dual variables for the prior learnt from m the last r samples, {xk }k=m−r+1 . [sent-266, score-0.499]
60 2 B OUNDS FOR η-P RIOR SVM S The statistical analysis of the η-prior SVMs can be performed using the τ-prior PAC-Bayes bound of Theorem 8, and τ-expectation prior PAC-Bayes bound. [sent-270, score-0.514]
61 Rather than take a spherically symmetric prior distribution we choose the variance in the direction of the prior vector to be τ2 > 1. [sent-271, score-0.795]
62 As with the prior SVM analysis the mean of the prior distribution is also shifted from the origin in the direction wr . [sent-272, score-1.172]
63 With respect to the parameters needed by the prior PAC-Bayes bounds, the number of priors J and the amount of patterns separated to learn the prior, the experiments reported by Ambroladze et al. [sent-301, score-0.543]
64 2 Results and Discussion The section starts presenting an analysis of the performance of SVM with the prior PAC Bayes bounds introduced in this paper. [sent-320, score-0.46]
65 We show how in most cases the use of an informative prior leads to a significant tightening of the bounds on the true error of the classifier. [sent-321, score-0.546]
66 Finally, the prior SVM framework enables the use of a different value of parameter C for prior and posterior, that can be tuned using the prior PAC Bayes bound. [sent-325, score-1.158]
67 E prior PB: expectation-prior PAC-Bayes bound of Theorem 10. [sent-333, score-0.514]
68 τ-E prior PB: τ-expectation prior PAC-Bayes bound of Theorem 11. [sent-334, score-0.9]
69 In most of the cases, the bounds with an informative prior are tighter than the original PAC Bayes bound with an spherical prior centred on the origin. [sent-337, score-1.124]
70 The expectation prior is significantly better in data sets wav and pim, whilst the prior PAC Bayes and the τ-prior PAC Bayes are the tighter in problems rin and spa. [sent-338, score-1.036]
71 Moreover, an examination of the slopes of the plots corresponding to the bounds point out that those that learn the prior from a separate training set do converge faster than the original PAC Bayes and the expectation prior PAC Bayes bounds. [sent-340, score-0.918]
72 However, the experimental results show that it is better to devote those separate training patterns to acquire a more informative prior than to increase the weight of the denominator in the penalty term. [sent-342, score-0.51]
73 2 A NALYSIS OF P RIOR SVM AND η-P RIOR SVM We repeated the study on the new algorithms, prior SVM and η-prior SVM, which are designed to actually optimise prior PAC-Bayes bounds. [sent-441, score-0.772]
74 The configurations classifier-bound considered for this study were the following: prior SVM + Prior PB: prior SVM described in page 14 and mixture-prior PAC-Bayes bound of Corollary 7 with J = 10 priors . [sent-442, score-1.011]
75 1 and mixture-prior PAC-Bayes bound of Corollary 7 considering η comes from a mixture prior setting of J = 50 components η j wr with the η j equally spaced between η1 = 1 and η50 = 100. [sent-507, score-0.915]
76 This setting minimizes the penalty term in the prior PAC-Bayes bound as we are not actually using these components to learn the posterior. [sent-508, score-0.514]
77 As baseline results we include the better bounds found in the analysis of the SVM: τ-Prior PB: τ prior PAC-Bayes bound of Theorem 8 with J = 10 and τ = 50. [sent-510, score-0.588]
78 In general, the bounds achieved on prior SVM and η-prior SVM are significantly tighter than the bounds on the SVM, being the mixture-prior PAC Bayes bound on prior SVM the tightest result. [sent-514, score-1.131]
79 001 Table 3: Values of the bounds on the prior SVM and η-prior SVM classifiers. [sent-565, score-0.46]
80 9 1 Figure 3: Bounds when prior and posterior have a different value of C. [sent-627, score-0.487]
81 Notice that in most of the configurations where the prior is learnt from a separate set the new bounds achieve a significant cut in the value of the PAC-Bayes bound, which indicates that learning an informative prior distribution helps to tighten the PAC-Bayes bound. [sent-628, score-0.97]
82 Furthermore, the two stages training of prior SVM and η-prior SVM enable the use of a different value of C for the prior and posterior classifiers. [sent-629, score-0.913]
83 The intuition behind this proposal is that once the prior is fixed, the posterior could possibly accept a higher value C without overfitting. [sent-630, score-0.487]
84 005 Table 4: Values of the bounds on the prior SVM and η-prior SVM classifiers when different values of C are used for prior and posterior. [sent-691, score-0.846]
85 006 Table 5: Test error rates achieved by prior SVM and η-prior SVM classifiers when the hyperparameters are those that minimise a PAC Bayes bound. [sent-762, score-0.463]
86 To evaluate the goodness of this modification, we carried out again the experiments in this subsection but now allowing the prior and posterior to take different values of C from within the range proposed at the beginning of the section. [sent-764, score-0.487]
87 Conclusions In this paper we have presented some strategies to tighten the already tight PAC-Bayes bound for binary classifiers by learning an informative prior distribution on classifiers. [sent-789, score-0.564]
88 The first strategy, named prior PAC Bayes bound, considers an identity covariance matrix. [sent-791, score-0.448]
89 This prior can be further refined in the τprior PAC Bayes bound case, where this direction is also used to stretch the covariance matrix. [sent-793, score-0.574]
90 The second strategy, named expectation prior PAC-Bayes bound also considers identity covariances, but expresses the direction to place the prior as an statistic of the training data distribution and uses all the training samples to estimate such statistic. [sent-794, score-1.028]
91 The expectation prior can also be refined stretching the covariance along the direction of the mean, yielding the τ-expectation prior PAC-Bayes bound. [sent-795, score-0.832]
92 The experimental work shows that these prior PAC-Bayes bounds achieve estimations of the expected true error of SVMs significantly tighter than those obtained with the original PAC-Bayes bound. [sent-796, score-0.556]
93 It is remarkable that the prior PAC Bayes bounds improve the tightness of the PAC-Bayes bound even when the size of the training set experiences reductions of up to an 80% of its size. [sent-797, score-0.628]
94 The structure of the prior PAC-Bayes bound: learn a prior classifier using some data and then consider the SVM to be a posterior classifier inspired the design of new algorithms to train SVM-like classifiers. [sent-798, score-0.873]
95 The prior SVM proposes a set of prior parts (fixed scalings along a prior direction learnt with separate data) and then fits a posterior part to each prior. [sent-799, score-1.383]
96 The η-prior SVM learns the scaling of the prior part and the posterior in the same quadratic program, thus significantly reducing the computational burden of the training. [sent-801, score-0.487]
97 The analysis of these classifiers under the prior PACBayes framework shows that the achieved bounds are dramatically tighter than those obtained for the original SVM under the same framework. [sent-802, score-0.513]
98 Moreover, the prior SVM enables the use of different values of the regularisation constant C for both prior and posterior parts, which further tightens the bounds. [sent-804, score-0.873]
99 We find the study of ways of extracting relevant prior domain knowledge from the available data and incorporating such knowledge in the form of the prior distribution to be a really promising line of research. [sent-807, score-0.772]
100 Once (26) is solved, the overall prior SVM classifier w can be retrieved from (20): m m−r w= ∑ ∑ αi yi φ(xi ) + η i=1 ˜ αk yk φ(xk ). [sent-848, score-0.457]
wordName wordTfidf (topN-words)
[('prior', 0.386), ('wr', 0.377), ('pb', 0.275), ('qd', 0.268), ('svm', 0.248), ('pac', 0.239), ('qs', 0.19), ('kl', 0.154), ('prs', 0.15), ('hawe', 0.139), ('mbroladze', 0.129), ('parrado', 0.129), ('wav', 0.129), ('bound', 0.128), ('ln', 0.126), ('bayes', 0.124), ('priors', 0.111), ('ependent', 0.11), ('andez', 0.11), ('ern', 0.11), ('pim', 0.107), ('ounds', 0.107), ('posterior', 0.101), ('riors', 0.091), ('dataset', 0.087), ('ambroladze', 0.086), ('spa', 0.086), ('rin', 0.082), ('svms', 0.079), ('wp', 0.076), ('classi', 0.074), ('bounds', 0.074), ('rior', 0.058), ('ers', 0.058), ('pj', 0.057), ('er', 0.055), ('pwr', 0.054), ('tighter', 0.053), ('hyperparameters', 0.053), ('langford', 0.053), ('han', 0.051), ('un', 0.051), ('spherical', 0.044), ('yk', 0.043), ('learnt', 0.042), ('pw', 0.041), ('training', 0.04), ('covariance', 0.037), ('tightening', 0.037), ('cu', 0.033), ('yl', 0.033), ('margin', 0.033), ('crossvalidated', 0.032), ('shiliang', 0.032), ('slt', 0.032), ('separate', 0.032), ('centered', 0.031), ('tightest', 0.03), ('xk', 0.028), ('centred', 0.028), ('yi', 0.028), ('gaussian', 0.027), ('scalings', 0.027), ('supx', 0.027), ('patterns', 0.027), ('informative', 0.025), ('tighten', 0.025), ('pushes', 0.025), ('misclassifying', 0.025), ('ndez', 0.025), ('identity', 0.025), ('divergence', 0.024), ('wt', 0.024), ('mixture', 0.024), ('cs', 0.024), ('error', 0.024), ('pu', 0.023), ('direction', 0.023), ('predictions', 0.021), ('amiran', 0.021), ('emilio', 0.021), ('equiprobable', 0.021), ('germain', 0.021), ('pacbayes', 0.021), ('tbilisi', 0.021), ('xl', 0.021), ('ensemble', 0.021), ('fold', 0.021), ('wm', 0.02), ('corollary', 0.02), ('cd', 0.02), ('separated', 0.019), ('estimations', 0.019), ('ec', 0.019), ('theorem', 0.019), ('subject', 0.019), ('devoted', 0.019), ('capabilities', 0.018), ('distributions', 0.018), ('xi', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
2 0.099244691 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
3 0.093944304 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
4 0.084697656 80 jmlr-2012-On Ranking and Generalization Bounds
Author: Wojciech Rejchel
Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process
5 0.07803826 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
6 0.062215295 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
7 0.060700893 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
8 0.057206463 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
9 0.056771994 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
10 0.05184735 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
11 0.051272701 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
12 0.051020846 13 jmlr-2012-Active Learning via Perfect Selective Classification
13 0.050553657 111 jmlr-2012-Structured Sparsity and Generalization
14 0.045505755 91 jmlr-2012-Plug-in Approach to Active Learning
15 0.044974253 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
16 0.044856638 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
17 0.044202834 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
18 0.043380704 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
19 0.043313399 73 jmlr-2012-Multi-task Regression using Minimal Penalties
20 0.042957053 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
topicId topicWeight
[(0, -0.206), (1, 0.038), (2, -0.01), (3, 0.069), (4, 0.074), (5, -0.05), (6, 0.122), (7, 0.097), (8, -0.119), (9, 0.063), (10, -0.004), (11, 0.003), (12, 0.068), (13, -0.0), (14, 0.055), (15, -0.141), (16, 0.203), (17, 0.14), (18, 0.048), (19, 0.071), (20, 0.065), (21, 0.077), (22, -0.021), (23, 0.077), (24, -0.061), (25, 0.059), (26, -0.162), (27, -0.103), (28, 0.144), (29, 0.032), (30, 0.051), (31, 0.032), (32, 0.089), (33, -0.084), (34, 0.063), (35, -0.084), (36, -0.136), (37, -0.063), (38, -0.121), (39, -0.027), (40, -0.057), (41, -0.047), (42, 0.1), (43, 0.09), (44, -0.118), (45, -0.012), (46, -0.025), (47, 0.129), (48, -0.129), (49, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.9698568 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
2 0.62735581 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
3 0.53824329 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
4 0.50389779 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski
Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions
5 0.45029333 80 jmlr-2012-On Ranking and Generalization Bounds
Author: Wojciech Rejchel
Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process
6 0.40176153 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
7 0.3966662 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
8 0.36863872 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
9 0.35998347 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
10 0.35436985 97 jmlr-2012-Regularization Techniques for Learning with Matrices
11 0.35197118 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
12 0.34904304 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
13 0.30747569 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
14 0.29487774 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
15 0.28981078 13 jmlr-2012-Active Learning via Perfect Selective Classification
16 0.28597841 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
17 0.28284025 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
18 0.27939552 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks
19 0.26963314 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
20 0.25806284 111 jmlr-2012-Structured Sparsity and Generalization
topicId topicWeight
[(4, 0.356), (7, 0.013), (21, 0.024), (26, 0.054), (29, 0.038), (35, 0.015), (49, 0.066), (56, 0.013), (57, 0.015), (64, 0.018), (75, 0.073), (79, 0.012), (81, 0.012), (92, 0.083), (96, 0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.74479604 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
2 0.45050246 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
3 0.4330354 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
4 0.43273532 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
5 0.43142858 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan
Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods
6 0.43087769 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
7 0.42883885 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
8 0.42750639 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
9 0.42723548 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
10 0.42535949 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
11 0.42257547 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
12 0.4215042 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
13 0.42083907 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
14 0.41950941 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
15 0.41891634 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
16 0.41860607 13 jmlr-2012-Active Learning via Perfect Selective Classification
17 0.41828918 80 jmlr-2012-On Ranking and Generalization Bounds
18 0.41662771 103 jmlr-2012-Sampling Methods for the Nyström Method
19 0.41609031 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
20 0.41564956 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices