nips nips2009 nips2009-98 knowledge-graph by maker-knowledge-mining

98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Source: pdf

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classiﬁer that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classiﬁer that upper-bound the standard zero-one loss used for the weighted majority vote. [sent-4, score-0.961]

2 By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. [sent-5, score-0.277]

3 We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. [sent-6, score-0.533]

4 We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. [sent-7, score-0.106]

5 But the universally accepted guarantee on the true risk, however, always comes with a so-called risk bound that holds uniformly over a set of classiﬁers. [sent-10, score-0.237]

6 Since a risk bound can be computed from what a classiﬁer achieves on the training data, it automatically suggests that learning algorithms should ﬁnd a classiﬁer that minimizes a tight risk (upper) bound. [sent-11, score-0.442]

7 Among the data-dependent bounds that have been proposed recently, the PAC-Bayes bounds [6, 8, 4, 1, 3] seem to be especially tight. [sent-12, score-0.06]

8 In that respect, [4, 5, 3] have proposed to use isotropic Gaussian posteriors over the space of linear classiﬁers. [sent-14, score-0.059]

9 But a computational drawback of this approach is the fact the Gibbs empirical risk is not a quasi-convex function of the parameters of the posterior. [sent-15, score-0.178]

10 Consequently, the resultant PAC-Bayes bound may have several local minima for certain data sets—thus giving an intractable optimization problem in the general case. [sent-16, score-0.059]

11 To avoid such computational problems, we propose here to use convex loss functions for stochastic Gibbs classiﬁers that upper-bound the standard zero-one loss used for the weighted majority vote. [sent-17, score-0.547]

12 By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. [sent-18, score-0.277]

13 We show that there are no loss of discriminative power by restricting the posterior to be quasi uniform. [sent-19, score-0.428]

14 We also show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. [sent-20, score-0.533]

15 We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost [7]. [sent-21, score-0.106]

16 The risk R(h) of any classiﬁer h : X → Y is deﬁned as the probability that h misclassiﬁes an example drawn according to D. [sent-25, score-0.178]

17 Given a training set S of m examples, the empirical risk RS (h) of any classiﬁer h is deﬁned by the frequency of training errors of h on S. [sent-26, score-0.178]

18 Hence def R(h) = E I(h(x) = y) def ; RS (h) = (x,y)∼D 1 m m I(h(xi ) = yi ) , i=1 where I(a) = 1 if predicate a is true and 0 otherwise. [sent-27, score-0.742]

19 After observing the training set S, the task of the learner is to choose a posterior distribution Q over a space H of classiﬁers such that the Q-weighted majority vote classiﬁer BQ will have the smallest possible risk. [sent-28, score-0.374]

20 On any input example x, the output BQ (x) of the majority vote classiﬁer BQ (sometimes called the Bayes classiﬁer) is given by def BQ (x) = sgn E h(x) , h∼Q where sgn(s) = +1 if s > 0 and sgn(s) = −1 otherwise. [sent-29, score-0.758]

21 The output of the deterministic majority vote classiﬁer BQ is closely related to the output of a stochastic classiﬁer called the Gibbs classiﬁer GQ . [sent-30, score-0.403]

22 The true risk R(GQ ) and the empirical risk RS (GQ ) of the Gibbs classiﬁer are thus given by R(GQ ) = E R(h) h∼Q ; RS (GQ ) = E RS (h) . [sent-32, score-0.356]

23 h∼Q Any bound for R(GQ ) can straightforwardly be turned into a bound for the risk of the majority vote R(BQ ). [sent-33, score-0.642]

24 3 PAC-Bayes Bounds and General Loss Functions In this paper, we use the following PAC-Bayes bound which is obtained directly from Theorem 1. [sent-38, score-0.059]

25 P (h) h∼Q Note that the dependence on Q of the upper bound on R(GQ ) is realized via Gibbs’ empirical risk RS (GQ ) and the PAC-Bayes regularizer KL(Q P ). [sent-45, score-0.281]

26 As in boosting, we focus on the case where the a priori deﬁned class H consists (mostly) of “weak” classiﬁers having large risk R(h) . [sent-46, score-0.178]

27 In this case, R(GQ ) is (almost) always large (near 1/2) for any Q even if the majority vote BQ has null risk. [sent-47, score-0.305]

28 On way to obtain a more relevant bound on R(BQ ) from PAC-Bayes theory is to use a loss function ζQ (x, y) for stochastic classiﬁers which is distinct from the loss used for the deterministic classiﬁers (the zero-one loss in our case). [sent-49, score-0.592]

29 In order to obtain a tractable optimization problem for a learning algorithm to solve, we propose here to use a loss ζQ (x, y) which is convex in Q and that upper-bounds as closely as possible the zero-one loss of the deterministic majority vote BQ . [sent-50, score-0.699]

30 2 def Consider WQ (x, y) = Eh∼Q I(h(x) = y), the Q-fraction of binary classiﬁers that err on example (x, y). [sent-51, score-0.334]

31 Following [2], we consider any non-negative convex loss ζQ (x, y) that can be expanded in a Taylor series around WQ (x, y) = 1/2: ∞ def ∞ k ak (2WQ (x, y) − 1) ζQ (x, y) = 1 + k = 1+ k=1 E − yh(x) ak h∼Q k=1 , that upper bounds the risk of the majority vote BQ , i. [sent-53, score-1.219]

32 It has been shown [2] that ζQ (x, y) can be expressed in terms of the risk on example (x, y) of a Gibbs classiﬁer described by a transformed posterior Q on N × H∞ , i. [sent-56, score-0.215]

33 , ζQ (x, y) = 1 + ca 2WQ (x, y) − 1 , def where ca = ∞ k=1 |ak | and where def WQ (x, y) = 1 ca ∞ . [sent-58, score-0.896]

34 |ak | E h1 ∼Q k=1 hk ∼Q Since WQ (x, y) is the expectation of boolean random variable, Theorem 3. [sent-65, score-0.122]

35 1 holds if we replace def (P, Q) by (P , Q) with R(GQ ) = E (x,y)∼D def 1 m WQ (x, y) and RS (GQ ) = m i=1 WQ (xi , yi ). [sent-66, score-0.742]

36 More- over, it has been shown [2] that def KL(Q P ) = k · KL(Q P ) , where k = 1 ca ∞ |ak | · k . [sent-67, score-0.41]

37 k=1 If we deﬁne ζQ ζQ def = E (x,y)∼D def = 1 m ζ(x, y) = 1 + ca [2R(GQ ) − 1] m ζ(xi , yi ) = 1 + ca [2RS (GQ ) − 1] , i=1 Theorem 3. [sent-68, score-0.894]

38 1 gives an upper bound on ζQ and, consequently, on the true risk R(BQ ) of the majority vote. [sent-69, score-0.397]

39 For any D, any H, any P of support H, any δ ∈ (0, 1], any positive real number C , any loss function ζQ (x, y) deﬁned above, we have Pr S∼D m ∀ Q on H : ζQ ≤ g(ca , C ) + def where g(ca , C ) = 1 − ca + 4 C 1−e−C C 1 − e−C ζQ + 2ca 1 k · KL(Q P ) + ln mC δ ≥ 1−δ, · (ca − 1). [sent-73, score-0.689]

40 Bound Minimization Learning Algorithms The task of the learner is to ﬁnd the posterior Q that minimizes the upper bound on ζQ for a ﬁxed loss function given by the coefﬁcients {ak }∞ of the Taylor series expansion for ζQ (x, y). [sent-74, score-0.339]

41 Finding k=1 Q that minimizes the upper bound given by Theorem 3. [sent-75, score-0.109]

42 2 is equivalent to ﬁnding Q that minimizes m def f (Q) = C ζQ (xi , yi ) + KL(Q P ) , i=1 def where C = C /(2ca k) . [sent-76, score-0.769]

43 For this choice of loss, we have ca = 2γ −1 + γ −2 and k = (2γ + 2)/(2γ + 1). [sent-80, score-0.076]

44 Note that this loss has the minimum value of zero for examples having a margin y h∈H Q(h)h(x) = γ. [sent-81, score-0.161]

45 With these two choices of loss functions, ζQ (x, y) is convex in Q. [sent-82, score-0.205]

46 Since a sum of convex functions is also convex, it follows that objective function f is convex in Q (which has a convex domain). [sent-84, score-0.189]

47 , that h∈H Q(h) = 1), each coordinate minimization will consist of a transfer of weight from one classiﬁer to another. [sent-89, score-0.045]

48 1 Quasi Uniform Posteriors We consider learning algorithms that work in a space H of binary classiﬁers such that for each h ∈ H, the boolean complement of h is also in H. [sent-91, score-0.174]

49 , h2n } where hi (x) = −hn+i (x) ∀x ∈ X and ∀i ∈ {1, . [sent-98, score-0.075]

50 We thus say that (hi , hn+i ) constitutes a boolean complement pair of classiﬁers. [sent-102, score-0.194]

51 We consider a uniform prior distribution P over H, i. [sent-103, score-0.047]

52 The posterior distribution Q over H is constrained to be quasi uniform. [sent-109, score-0.233]

53 , the total weight assigned to each boolean complement pair of def classiﬁers is ﬁxed to 1/n. [sent-115, score-0.528]

54 For any quasi uniform Q, the output BQ (x) of the majority vote on any example x is given by 2n BQ (x) = sgn n Qi hi (x) = sgn i=1 wi hi (x) def = sgn w · h(x) . [sent-127, score-1.392]

55 i=1 Consequently, the set of majority votes BQ over quasi uniform posteriors is isomorphic to the set of linear separators with real weights. [sent-128, score-0.439]

56 There is thus no loss of discriminative power if we restrict ourselves to quasi uniform posteriors. [sent-129, score-0.404]

57 Since all loss functions that we consider are functions of 2WQ (x, y) − 1 = −y i Qi hi (x), they are thus functions of yw · h(x). [sent-130, score-0.341]

58 The basic iteration for the learning algorithm consists of choosing (at random) a boolean complement pair of classiﬁers, call it (h1 , hn+1 ), and then attempting to change only Q1 , Qn+1 , w1 according to: δ δ ; Qn+1 ← Qn+1 − ; w1 ← w1 + δ , (1) Q1 ← Q1 + 2 2 for some optimally chosen value of δ. [sent-132, score-0.194]

59 Let Qδ and wδ be, respectively, the new posterior and the new weight vector obtained with such a change. [sent-133, score-0.037]

60 The above-mentioned convex properties of objective function f imply that we only need to look for the value of δ ∗ satisfying df (Qδ ) = 0. [sent-134, score-0.104]

61 For objective function f we simply have df (Qδ ) dζQδ dKL(Qδ P ) = Cm + , dδ dδ dδ (3) where dKL(Qδ P ) dδ d dδ = Q1 + δ 2 ln Q1 + δ 2 + Qn+1 − 1 2n δ 2 ln Qn+1 − δ 2 1 2n 1 Q1 + δ/2 ln . [sent-138, score-0.414]

62 2 Qn+1 − δ/2 For the quadratic loss, we ﬁnd = m dζQδ dδ 2 2mδ + 2 2 γ γ = (4) m ql Dw (i)yi h1 (xi ) , (5) i=1 where def ql Dw (i) = yi w · h(xi ) − γ . [sent-139, score-0.711]

63 Consequently, for the quadratic loss case, the optimal value δ ∗ satisﬁes m Q1 + δ/2 1 2Cmδ 2C ql + 2 Dw (i)yi h1 (xi ) + ln 2 γ γ i=1 2 Qn+1 − δ/2 (6) = 0. [sent-140, score-0.456]

64 (7) For the exponential loss, we ﬁnd m dζQδ dδ = eδ/γ γ m el Dw (i)I(h1 (xi ) = yi ) − i=1 e−δ/γ γ m el Dw (i)I(h1 (xi ) = yi ) , (8) i=1 where 1 exp − yi w · h(xi ) . [sent-141, score-0.429]

65 γ Consequently, for the exponential loss case, the optimal value δ ∗ satisﬁes def el Dw (i) Ceδ/γ γ = (9) m el Dw (i)I(h1 (xi ) = yi ) i=1 − Ce−δ/γ γ m el Dw (i)I(h1 (xi ) = yi ) + i=1 Q1 + δ/2 1 ln 2 Qn+1 − δ/2 = 0 . [sent-142, score-1.054]

66 ql ql Dw (i) ← Dw (i) + yi h1 (xi )δ (quadratic loss case) (11) 1 el el Dw (i) ← Dw (i)e− γ yi h1 (xi )δ (exponential loss case) . [sent-148, score-0.894]

67 Since, initially we have ql Dw (i) = −γ ∀i ∈ {1, . [sent-149, score-0.126]

68 , m} (quadratic loss case) (12) (13) el Dw (i) = 1 ∀i ∈ {1, . [sent-152, score-0.247]

69 , m} (exponential loss case) , (14) the dot product present in Equations 6 and 9 never needs to be computed. [sent-155, score-0.161]

70 1 ql el Dw (i) stands for either Dw (i) or Dw (i). [sent-161, score-0.212]

71 5 Algorithm 1 : f minimization 1: Initialization: Let Qi = Qn+i = 1 2n , wi = 0, ∀i ∈ {1, . [sent-162, score-0.072]

72 2: repeat 3: Choose at random h ∈ H and call it h1 (hn+1 is then the boolean complement of h1 ). [sent-167, score-0.174]

73 We ﬁrst mix at random the n boolean complement pairs of classiﬁers and then go sequentially over each pair (hi , hn+i ) to update wi and Dw . [sent-174, score-0.245]

74 2 From KL(Q P ) to We can recover we use 2 p Regularization regularization if we upper-bound KL(Q P ) by a quadratic function. [sent-177, score-0.074]

75 Indeed, if 1 − q ln n q ln q + 1 −q n ≤ 2 1 1 1 ln + 4n q − n 2n 2n ∀q ∈ [0, 1/n] , (15) we obtain, for the uniform prior Pi = 1/(2n), n KL(Q P ) = ln(2n) + Qi ln Qi + i=1 n Qi − ≤ 4n i=1 1 2n 1 − Qi ln n 1 − Qi n n 2 2 wi . [sent-178, score-0.688]

76 = n (16) i=1 With this approximation, the objective function to minimize becomes m f 2 (w) = C ζ i=1 1 yi w · h(xi ) + w γ 2 2 , (17) subject to the ∞ constraint |wj | ≤ 1/n ∀j ∈ {1, . [sent-179, score-0.132]

77 Here w 2 denotes the Euclidean norm of w and ζ(x) = (x − 1)2 for the quadratic loss and e−x for the exponential loss. [sent-183, score-0.247]

78 def If, instead, we minimize f 2 for v = w/γ and remove the ∞ constraint, we recover exactly ridge regression for the quadratic loss case and 2 -regularized boosting for the exponential loss case. [sent-184, score-0.922]

79 We can obtain a 1 2 ing 4n q − 2n n i=1 1 -regularized 1 ≤ 2 q − 2n version of Equation 17 by repeating the above steps and us∀q ∈ [0, 1/n] since, in that case, we ﬁnd that KL(Q P ) ≤ def |wi | = w 1 . [sent-185, score-0.334]

80 To sum up, the KL-regularized objective function f immediately follows from PAC-Bayes theory and p regularization is obtained from a relaxation of f . [sent-186, score-0.081]

81 Consequently, PAC-Bayes theory favors the use of KL regularization if the goal of the learner is to produce a weighted majority vote with good generalization. [sent-187, score-0.36]

82 2 2 Interestingly, [9] has recently proposed a KL-regularized version of LPBoost but their objective function was not derived from a uniform risk bound. [sent-188, score-0.26]

83 6 5 Empirical Results For the sake of comparison, all learning algorithms of this subsection are producing a weighted majority vote classiﬁer on the set of basis functions {h1 , . [sent-189, score-0.327]

84 Each decision stump hi is a threshold classiﬁer that depends on a single attribute: its output is +b if the tested attribute exceeds a threshold value t, and −b otherwise, where b ∈ {−1, +1}. [sent-193, score-0.177]

85 Recall that, although Algorithm 1 needs a set H of 2n classiﬁers containing n boolean complement pairs, it outputs a majority vote with n real-valued weights deﬁned on {h1 , . [sent-195, score-0.479]

86 We have compared Algorithm 1 with quadratic loss (KL-QL) and exponential loss (KL-EL) to AdaBoost [7] (AdB) and ridge regression (RR). [sent-200, score-0.514]

87 For AdaBoost, the number of boosting rounds was ﬁxed to 200. [sent-204, score-0.051]

88 In addition to this, the “C and “γ” columns in Table 1 refer, respectively, to the C value of the objective function f and to the γ parameter present in the loss functions. [sent-206, score-0.196]

89 These hyperparameters were determined from the training set only by performing the 10-fold cross validation (CV) method. [sent-207, score-0.047]

90 The hyperparameters that gave the smallest 10-fold CV error were then used to train the Algorithms on the whole training set and the resulting classiﬁers were then run on the testing set. [sent-208, score-0.072]

91 This, in turn, gives a risk bound (computed from Theorem 3. [sent-364, score-0.237]

92 We have also tried to choose C and γ from the risk bound values. [sent-366, score-0.237]

93 3 This method for selecting hyperparameters turned out to produce classiﬁers having larger testing errors (results not shown here). [sent-367, score-0.113]

94 To determine whether or not a difference of empirical risk measured on the testing set T is statistically signiﬁcant, we have used the test set bound method of [4] (based on the binomial tail inversion) 3 From the standard union bound argument, the bound of Theorem 3. [sent-368, score-0.404]

95 It turns out that no algorithm has succeeded in choosing a majority vote classiﬁer which was statistically signiﬁcantly better (SSB) than the one chosen by another algorithm except for the 4 cases that are listed in the column “SSB” of Table 1. [sent-371, score-0.329]

96 We see that on these cases, Algorithm 1 turned out to be statistically signiﬁcantly better. [sent-372, score-0.065]

97 6 Conclusion Our numerical results indicate that Algorithm 1 generally outperforms AdaBoost and ridge regression when the hyperparameters C and γ are chosen by cross-validation. [sent-373, score-0.153]

98 This indicates that the empirical loss ζQ and the KL(Q P ) regularizer that are present in the PAC-Bayes bound of Theorem 3. [sent-374, score-0.241]

99 2 at selecting good values for hyperparameters indicates that PAC-Bayes theory does not yet capture quantitatively the proper tradeoff between ζQ and KL(Q P ) that learners should optimize on the trading data. [sent-377, score-0.047]

100 A pac-bayes ¸ risk bound for general loss functions. [sent-391, score-0.398]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gq', 0.377), ('dw', 0.366), ('def', 0.334), ('bq', 0.277), ('qn', 0.204), ('quasi', 0.196), ('risk', 0.178), ('vote', 0.168), ('loss', 0.161), ('qi', 0.153), ('kl', 0.148), ('majority', 0.137), ('ql', 0.126), ('wq', 0.121), ('ln', 0.118), ('hn', 0.112), ('rs', 0.108), ('ers', 0.108), ('classi', 0.099), ('boolean', 0.097), ('sgn', 0.095), ('er', 0.089), ('el', 0.086), ('ridge', 0.08), ('adaboost', 0.079), ('complement', 0.077), ('ca', 0.076), ('consequently', 0.076), ('hi', 0.075), ('yi', 0.074), ('ak', 0.072), ('germain', 0.067), ('lacasse', 0.067), ('laviolette', 0.067), ('ssb', 0.067), ('bound', 0.059), ('posteriors', 0.059), ('gibbs', 0.057), ('francois', 0.054), ('mario', 0.054), ('rt', 0.052), ('quadratic', 0.051), ('wi', 0.051), ('boosting', 0.051), ('mnist', 0.051), ('alexandre', 0.05), ('uniform', 0.047), ('hyperparameters', 0.047), ('xi', 0.045), ('adb', 0.045), ('convex', 0.044), ('letter', 0.044), ('misclassi', 0.042), ('turned', 0.041), ('yw', 0.039), ('posterior', 0.037), ('pascal', 0.037), ('attribute', 0.036), ('objective', 0.035), ('exponential', 0.035), ('restricting', 0.034), ('summations', 0.032), ('learner', 0.032), ('bounds', 0.03), ('theorem', 0.03), ('dkl', 0.029), ('cv', 0.029), ('deterministic', 0.028), ('rr', 0.028), ('equation', 0.028), ('minimizes', 0.027), ('regression', 0.026), ('testing', 0.025), ('hk', 0.025), ('df', 0.025), ('coordinate', 0.024), ('statistically', 0.024), ('output', 0.024), ('upper', 0.023), ('minimize', 0.023), ('regularization', 0.023), ('relaxation', 0.023), ('exceeds', 0.022), ('functions', 0.022), ('stochastic', 0.022), ('regularizer', 0.021), ('minimization', 0.021), ('divergence', 0.02), ('john', 0.02), ('pair', 0.02), ('ado', 0.02), ('bec', 0.02), ('qu', 0.02), ('monograph', 0.02), ('wee', 0.02), ('breastcancer', 0.02), ('wdbc', 0.02), ('karen', 0.02), ('lpboost', 0.02), ('stump', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

2 0.2244404 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a uniﬁed asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

3 0.14733432 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

Author: Kian M. Chai

Abstract: We provide some insights into how task correlations in multi-task Gaussian process (GP) regression affect the generalization error and the learning curve. We analyze the asymmetric two-tasks case, where a secondary task is to help the learning of a primary task. Within this setting, we give bounds on the generalization error and the learning curve of the primary task. Our approach admits intuitive understandings of the multi-task GP by relating it to single-task GPs. For the case of one-dimensional input-space under optimal sampling with data only for the secondary task, the limitations of multi-task GP can be quantiﬁed explicitly. 1

4 0.13528626 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

Abstract: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M . From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Re√ gression” (CLSR) in terms of N , K, and M . When we choose M = O( K), we √ show that CLSR has an estimation error of order O(log K/ K). 1 Problem setting We consider a regression problem where we observe data DK = ({xk , yk }k≤K ) (where xk ∈ X and yk ∈ R) are assumed to be independently and identically distributed (i.i.d.) from some distribution P , where xk ∼ PX and yk = f ∗ (xk ) + ηk (xk ), where f ∗ is the (unknown) target function, and ηk a centered independent noise of variance σ 2 (xk ). For a given class of functions F, and f ∈ F, we deﬁne the empirical (quadratic) error def LK (f ) = 1 K K [yk − f (xk )]2 , k=1 and the generalization (quadratic) error def L(f ) = E(X,Y )∼P [(Y − f (X))2 ]. Our goal is to return a regression function f ∈ F with lowest possible generalization error L(f ). Notations: In the sequel we will make use of the following notations about norms: for h : X → R, we write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm n 2 1/2 of h w.r.t. the empirical measure PK , and for u ∈ Rn , ||u|| denotes by default . i=1 ui The measurable function minimizing the generalization error is f ∗ , but it may be the case that f ∗ ∈ F. For any regression function f , we deﬁne the excess risk / L(f ) − L(f ∗ ) = ||f − f ∗ ||2 , P which decomposes as the sum of the estimation error L(f ) − inf f ∈F L(f ) and the approximation error inf f ∈F L(f ) − L(f ∗ ) = inf f ∈F ||f − f ∗ ||2 which measures the distance between f ∗ and the P function space F. 1 In this paper we consider a class of linear functions FN deﬁned as the span of a set of N functions def def N {ϕn }1≤n≤N called features. Thus: FN = {fα = n=1 αn ϕn , α ∈ RN }. When the number of data K is larger than the number of features N , the ordinary Least-Squares Regression (LSR) provides the LS solution fα which is the minimizer of the empirical risk LK (f ) b 1 in FN . Note that here LK (fα ) rewrites K ||Φα − Y ||K where Φ is the K × N matrix with elements (ϕn (xk ))1≤n≤N,1≤k≤K and Y the K-vector with components (yk )1≤k≤K . Usual results provide bound on the estimation error as a function of the capacity of the function space and the number of data. In the case of linear approximation, the capacity measures (such as covering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example the pseudo-dimension is at most N + 1). For example, let fα be a LS estimate (minimizer of LK b in FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation error is bounded as: N log K E L(fα ) − inf L(f ) ≤ cσ2 , (1) b f ∈FN K def where c is a universal constant, σ = supx∈X σ(x), and the expectation is taken with respect to P . Now, the excess risk is the sum of this estimation error and the approximation error inf f ∈FN ||f − f ∗ ||P of the class FN . Since the later usually decreases when the number of features N increases [13] (e.g. when N FN is dense in L2 (P )), we see the usual tradeoff between small estimation error (low N ) and small approximation error (large N ). In this paper we are interested in the setting when N is large so that the approximation error is small. Whenever N is larger than K we face the overﬁtting problem since there are more parameters than actual data (more variables than constraints), which is illustrated in the bound (1) which provides no information about the generalization ability of any LS estimate. In addition, there are many minimizers (in fact a vector space of same dimension as the null space of ΦT Φ) of the empirical risk. To overcome the problem, several approaches have been proposed in the literature: • LS solution with minimal norm: The solution is the minimizer of the empirical error with minimal (l1 or l2 )-norm: α = arg minΦα=Y ||α||1 or 2 , (or a robust solution arg min||Φα−Y ||2 ≤ε ||α||1 ). The choice of 2 -norm yields the ordinary LS solution. The choice of 1 -norm has been used for generating sparse solutions (e.g. the Basis Pursuit [10]), and assuming that the target function admits a sparse decomposition, the ﬁeld of Compressed Sensing [9, 21] provides sufﬁcient conditions for recovering the exact solution. However, such conditions (e.g. that Φ possesses a Restricted Isometric Property (RIP)) does not hold in general in this regression setting. On another aspect, solving these problems (both for l1 or l2 -norm) when N is large is numerically expensive. • Regularization. The solution is the minimizer of the empirical error plus a penalty term, for example f = arg min LK (f ) + λ||f ||p , for p = 1 or 2. p f ∈FN where λ is a parameter and usual choices for the norm are 2 (ridge-regression [20]) and 1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: α = arg min||α||1 ≤λ ||ΦT (Y − Φα)||∞ . The numerical complexity and generalization bounds of those methods depend on the sparsity of the target function decomposition in FN . Now if we possess a sequence of function classes (FN )N ≥1 with increasing capacity, we may perform structural risk minimization [22] by solving in each model the empirical risk penalized by a term that depends on the size of the model: fN = arg minf ∈FN ,N ≥1 LK (f ) + pen(N, K), where the penalty term measures the capacity of the function space. In this paper we follow another approach where instead of searching in the large space FN (where N > K) for a solution that minimizes the empirical error plus a penalty term, we simply search for the empirical error minimizer in a (randomly generated) lower dimensional subspace GM ⊂ FN (where M < K). Our contribution: We consider a set of M random linear combinations of the initial N features and perform our favorite LS regression algorithm (possibly regularized) using those “compressed 2 features”. This is equivalent to projecting the K points {ϕ(xk ) ∈ RN , k = 1..K} from the initial domain (of size N ) onto a random subspace of dimension M , and then performing the regression in the “compressed domain” (i.e. span of the compressed features). This is made possible because random projections approximately preserve inner products between vectors (by a variant of the Johnson-Lindenstrauss Lemma stated in Proposition 1. Our main result is a bound on the excess risk of a linear estimator built in the compressed domain in terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced estimation error in the compressed domain versus increased approximation error introduced by the random projection) and the numerical complexity (reduced complexity of solving the LSR in the compressed domain versus the additional load of performing the projection). √ As a consequence, we show that by choosing M = O( K) projections we deﬁne a Compressed Least-Squares Regression which uses O(N K 3/2 ) elementary operations to compute a regression √ function with estimation error (relatively to the initial function space FN ) of order log K/ K up to a multiplicative factor which depends on the best approximation of f ∗ in FN . This is competitive with the best methods, up to our knowledge. Related works: Using dimension reduction and random projections in various learning areas has received considerable interest over the past few years. In [7], the authors use a SVM algorithm in a compressed space for the purpose of classiﬁcation and show that their resulting algorithm has good generalization properties. In [25], the authors consider a notion of compressed linear regression. For data Y = Xβ + ε, where β is the target and ε a standard noise, they use compression of the set of data, thus considering AY = AXβ + Aε, where A has a Restricted Isometric Property. They provide an analysis of the LASSO estimator built from these compressed data, and discuss a property called sparsistency, i.e. the number of random projections needed to recover β (with high probability) when it is sparse. These works differ from our approach in the fact that we do not consider a compressed (input and/or output) data space but a compressed feature space instead. In [11], the authors discuss how compressed measurements may be useful to solve many detection, classiﬁcation and estimation problems without having to reconstruct the signal ever. Interestingly, they make no assumption about the signal being sparse, like in our work. In [6, 17], the authors show how to map a kernel k(x, y) = ϕ(x) · ϕ(y) into a low-dimensional space, while still approximately preserving the inner products. Thus they build a low-dimensional feature space speciﬁc for (translation invariant) kernels. 2 Linear regression in the compressed domain We remind that the initial set of features is {ϕn : X → def N FN = {fα = n=1 αn ϕn , α ∈ components (ϕn (x))n≤N . Let us R, 1 ≤ n ≤ N } and the initial domain R } is the span of those features. We write ϕ(x) the N -vector of N now deﬁne the random projection. Let A be a M × N matrix of i.i.d. elements drawn for some distribution ρ. Examples of distributions are: • Gaussian random variables N (0, 1/M ), √ • ± Bernoulli distributions, i.e. which takes values ±1/ M with equal probability 1/2, • Distribution taking values ± 3/M with probability 1/6 and 0 with probability 2/3. The following result (proof in the supplementary material) states the property that inner-product are approximately preserved through random projections (this is a simple consequence of the JohnsonLindenstrauss Lemma): Proposition 1 Let (uk )1≤k≤K and v be vectors of RN . Let A be a M × N matrix of i.i.d. elements drawn from one of the previously deﬁned distributions. For any ε > 0, δ > 0, for M ≥ ε2 1 ε3 log 4K , we have, with probability at least 1 − δ, for all k ≤ K, δ 4 − 6 |Auk · Av − uk · v| ≤ ε||uk || ||v||. 3 def We now introduce the set of M compressed features (ψm )1≤m≤M such that ψm (x) = N We also write ψ(x) the M -vector of components (ψm (x))m≤M . Thus n=1 Am,n ϕn (x). ψ(x) = Aϕ(x). We deﬁne the compressed domain GM = {gβ = m=1 βm ψm , β ∈ RM } the span of the compressed features (vector space of dimension at most M ). Note that each ψm ∈ FN , thus GM is a subspace of FN . def 2.1 M Approximation error We now compare the approximation error assessed in the compressed domain GM versus in the initial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary LS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and ridge regression. Deﬁne α+ = arg minα∈RN L(fα ) − L(f ∗ ) the parameter of the best regression function in FN . Theorem 1 For any δ > 0, any M ≥ 15 log(8K/δ), let A be a random M × N matrix deﬁned like in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with probability at least 1 − δ, inf ||g−f ∗ ||2 ≤ P g∈GM 8 log(8K/δ) + 2 ||α || M E ||ϕ(X)||2 +2 sup ||ϕ(x)||2 x∈X log 4/δ + inf ||f −f ∗ ||2 . P f ∈FN 2K (2) This theorem shows the tradeoff in terms of estimation and approximation errors for an estimator g obtained in the compressed domain compared to an estimator f obtained in the initial domain: • Bounds on the estimation error of g in GM are usually smaller than that of f in FN when M < N (since the capacity of FN is larger than that of GM ). • Theorem 1 says that the approximation error assessed in GM increases by at most O( log(K/δ) )||α+ ||2 E||ϕ(X)||2 compared to that in FN . M def def Proof: Let us write f + = fα+ = arg minf ∈FN ||f − f ∗ ||P and g + = gAα+ . The approximation error assessed in the compressed domain GM is bounded as inf ||g − f ∗ ||2 P g∈GM ≤ ||g + − f ∗ ||2 = ||g + − f + ||2 + ||f + − f ∗ ||2 , P P P (3) since f + is the orthogonal projection of f ∗ on FN and g + belongs to FN . We now bound ||g + − def def f + ||2 using concentration inequalities. Deﬁne Z(x) = Aα+ · Aϕ(x) − α+ · ϕ(x). Deﬁne ε2 = P log(8K/δ) 8 M log(8K/δ). For M ≥ 15 log(8K/δ) we have ε < 3/4 thus M ≥ ε2 /4−ε3 /6 . Proposition 1 applies and says that on an event E of probability at least 1 − δ/2, we have for all k ≤ K, def |Z(xk )| ≤ ε||α+ || ||ϕ(xk )|| ≤ ε||α+ || sup ||ϕ(x)|| = C (4) x∈X On the event E, we have with probability at least 1 − δ , ||g + − f + ||2 P = ≤ ≤ EX∼PX |Z(X)|2 ≤ ε2 ||α+ ||2 ε2 ||α+ ||2 1 K 1 K K |Z(xk )|2 + C 2 k=1 K ||ϕ(xk )||2 + sup ||ϕ(x)||2 x∈X k=1 E ||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x∈X log(2/δ ) 2K log(2/δ ) 2K log(2/δ ) . 2K where we applied two times Chernoff-Hoeffding’s inequality. Combining with (3), unconditioning, and setting δ = δ/2 then with probability at least (1 − δ/2)(1 − δ ) ≥ 1 − δ we have (2). 4 2.2 Computational issues We now discuss the relative computational costs of a given algorithm applied either in the initial or in the compressed domain. Let us write Cx(DK , FN , P ) the complexity (e.g. number of elementary operations) of an algorithm A to compute the regression function f when provided with the data DK and function space FN . We plot in the table below, both for the initial and the compressed versions of the algorithm A, the order of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the estimator, (iii) the cost for making one prediction (i.e. computing f (x) for any x): Construction of the feature matrix Computing the regression function Making one prediction Initial domain NK Cx(DK , FN , P ) N Compressed domain N KM Cx(DK , GM , P ) NM Note that the values mentioned for the compressed domain are upper-bounds on the real complexity and do not take into account the possible sparsity of the projection matrix A (which would speed up matrix computations, see e.g. [2, 1]). 3 Compressed Least-Squares Regression We now analyze the speciﬁc case of Least-Squares Regression. 3.1 Excess risk of ordinary Least Squares regression In order to bound the estimation error, we follow the approach of [13] which truncates (up to the level ±L where L is a bound, assumed to be known, on ||f ∗ ||∞ ) the prediction of the LS regression function. The ordinary LS regression provides the regression function fα where b α= argmin α∈argminα ∈ RN ||α||. ||Y −Φα || Note that ΦΦT α = ΦT Y , hence α = Φ† Y ∈ RN where Φ† is the Penrose pseudo-inverse of Φ1 . def Then the truncated predictor is: fL (x) = TL [fα (x)], where b def TL (u) = u if |u| ≤ L, L sign(u) otherwise. Truncation after the computation of the parameter α ∈ RN , which is the solution of an unconstrained optimization problem, is easier than solving an optimization problem under the constraint that ||α|| is small (which is the approach followed in [23]) and allows for consistency results and prediction bounds. Indeed, the excess risk of fL is bounded as 1 + log K E(||f − f ∗ ||2 ) ≤ c max{σ2 , L2 } N + 8 inf ||f − f ∗ ||2 (5) P P f ∈FN K where a bound on c is 9216 (see [13]). We have a simpler bound when we consider the expectation EY conditionally on the input data: N EY (||f − f ∗ ||2 K ) ≤ σ2 + inf ||f − f ∗ ||2 K (6) P P K f ∈F Remark: Note that because we use the quadratic loss function, by following the analysis in [3], or by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of Koltchinskii’s Saint Flour course, it is actually possible to state assumptions under which we can remove the log K term in (5). We will not further detail such bounds since our motivation here is not to provide the tightest possible bounds, but rather to show how the excess risk bound for LS regression in the initial domain extends to the compressed domain. 1 In the full rank case, Φ† = (ΦT Φ)−1 ΦT when K ≥ N and Φ† = ΦT (ΦΦT )−1 when K ≤ N 5 3.2 Compressed Least-Squares Regression (CLSR) CLSR is deﬁned as the ordinary LSR in the compressed domain. Let β = Ψ† Y ∈ RM , where Ψ is the K × M matrix with elements (ψm (xk ))1≤m≤M,1≤k≤K . The CLSR estimate is deﬁned as def gL (x) = TL [gβ (x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for b the CLSR estimate: √ ||α+ || E||ϕ(X)||2 K log(8K/δ) Corollary 1 For any δ > 0, set M = 8 max(σ,L) c (1+log K) . Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate is bounded as √ E(||gL − f ∗ ||2 ) ≤ 16 c max{σ, L}||α+ || E||ϕ(X)||2 P × 1+ supx ||ϕ(x)||2 E||ϕ(X)||2 (1 + log K) log(8K/δ) K log 4/δ + 8 inf ||f − f ∗ ||2 . P f ∈FN 2K (7) √ ||α+ || E||ϕ(X)||2 Now set M = 8K log(8K/δ). Assume N > K and that the features (ϕk )1≤k≤K σ are linearly independent. Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as 2 log(8K/δ) supx ||ϕ(x)||2 1+ K E||ϕ(X)||2 EY (||gL − f ∗ ||2 K ) ≤ 4σ||α+ || E||ϕ(X)||2 P log 4/δ . 2K Proof: Whenever M ≥ 15 log(8K/δ) we deduce from Theorem 1 and (5) that the excess risk of gL is bounded as E(||gL − f ∗ ||2 ) ≤ c max{σ2 , L2 } P +8 8 log(8K/δ) + 2 ||α || M 1 + log K M K E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 . P f ∈FN 2K By optimizing on M , we deduce (7). Similarly, using (6) we deduce the following bound on EY (||gL − f ∗ ||2 K ): P σ2 8 M + log(8K/δ)||α+ ||2 K M E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 K . P f ∈FN 2K By optimizing on M and noticing that inf f ∈FN ||f − f ∗ ||2 K = 0 whenever N > K and the features P (ϕk )1≤k≤K are linearly independent, we deduce the second result. Remark 1 Note that the second term in the parenthesis of (7) is negligible whenever K Thus we have the expected excess risk log K/δ + inf ||f − f ∗ ||2 . P f ∈FN K E(||gL − f ∗ ||2 ) = O ||α+ || E||ϕ(X)||2 √ P log 1/δ. (8) The choice of M in the previous corollary depends on ||α+ || and E||ϕ(X)|| which are a priori unknown (since f ∗ and PX are unknown). If we set M independently of ||α+ ||, then an additional multiplicative factor of ||α+ || appears in the bound, and if we replace E||ϕ(X)|| by its bound supx ||ϕ(x)|| (which is known) then this latter factor will appear instead of the former in the bound. Complexity of CLSR: The complexity of LSR for computing the regression function in the compressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK , GM , P ) = O(M K 2 ) which √ is of order O(K 5/2 ) when we choose the optimized number of projections M = O( K). However the leading term when using CLSR is the cost for building the Ψ matrix: O(N K 3/2 ). 6 4 4.1 Discussion The factor ||α+ || E||ϕ(X)||2 In light of Corollary 1, the important factor which will determine whether the CLSR provides low generalization error or not is ||α+ || E||ϕ(X)||2 . This factor indicates that a good set of features (for CLSR) should be such that the norm of those features as well as the norm of the parameter α+ of the projection of f ∗ onto the span of those features should be small. A natural question is whether this product can be made small for appropriate choices of features. We now provide two speciﬁc cases for which this is actually the case: (1) when the features are rescaled orthonormal basis functions, and (2) when the features are speciﬁc wavelet functions. In both cases, we relate the bound to an assumption of regularity on the function f ∗ , and show that the dependency w.r.t. N decreases when the regularity increases, and may even vanish. Rescaled Orthonormal Features: Consider a set of orthonormal functions (ηi )i≥1 w.r.t a measure µ, i.e. ηi , ηj µ = δi,j . In addition we assume that the law of the input data is dominated by µ, i.e. PX ≤ Cµ where C is a constant. For instance, this is the case when the set X is compact, µ is the uniform measure and PX has bounded density. def We deﬁne the set of N features as: ϕi = ci ηi , where ci > 0, for i ∈ {1, . . . , N }. Then any f ∈ FN decomposes as f = 2 we have: ||α|| = ||α+ ||2 E||ϕ||2 ≤ C N bi 2 i=1 ( ci ) N bi 2 i=1 ( ci ) and N i=1 N bi i=1 ci ϕi , where N 2 2 i=1 ci X ηi (x)dPX (x) f, ηi ηi = E||ϕ|| = 2 def bi = f, ηi . Thus ≤ C N 2 i=1 ci . Thus N 2 i=1 ci . Now, linear approximation theory (Jackson-type theorems) tells us that assuming a function f ∗ ∈ L2 (µ) is smooth, it may be decomposed onto the span of the N ﬁrst (ηi )i∈{1,...,N } functions with decreasing coefﬁcients |bi | ≤ i−λ for some λ ≥ 0 that depends on the smoothness of f ∗ . For example the class of functions with bounded total variation may be decomposed with Fourier basis (in dimension 1) with coefﬁcients |bi | ≤ ||f ||V /(2πi). Thus here λ = 1. Other classes (such as Sobolev spaces) lead to larger values of λ related to the order of differentiability. √ N By choosing ci = i−λ/2 , we have ||α+ || E||ϕ||2 ≤ C i=1 i−λ . Thus if λ > 1, then this term is bounded by a constant that does not depend on N . If λ = 1 then it is bounded by O(log N ), and if 0 < λ < 1, then it is bounded by O(N 1−λ ). However any orthonormal basis, even rescaled, would not necessarily yield a small ||α+ || E||ϕ||2 term (this is all the more true when the dimension of X is large). The desired property that the coefﬁcients (α+ )i of the decomposition of f ∗ rapidly decrease to 0 indicates that hierarchical bases, such as wavelets, that would decompose the function at different scales, may be interesting. Wavelets: Consider an inﬁnite family of wavelets in [0, 1]: (ϕ0 ) = (ϕ0 ) (indexed by n ≥ 1 or n h,l equivalently by the scale h ≥ 0 and translation 0 ≤ l ≤ 2h − 1) where ϕ0 (x) = 2h/2 ϕ0 (2h x − l) h,l and ϕ0 is the mother wavelet. Then consider N = 2H features (ϕh,l )1≤h≤H deﬁned as the rescaled def wavelets ϕh,l = ch 2−h/2 ϕ0 , where ch > 0 are some coefﬁcients. Assume the mother wavelet h,l is C p (for p ≥ 1), has at least p vanishing moments, and that for all h ≥ 0, supx l ϕ0 (2h x − l)2 ≤ 1. Then the following result (proof in the supplementary material) provides a bound on supx∈X ||ϕ(x)||2 (thus on E||ϕ(X)||2 ) by a constant independent of N : Proposition 2 Assume that f ∗ is (L, γ)-Lipschitz (i.e. for all v ∈ X there exists a polynomial pv of degree γ such that for all u ∈ X , |f (u) − pv (u)| ≤ L|u − v|γ ) with 1/2 < γ ≤ p. Then setting γ 1 ch = 2h(1−2γ)/4 , we have ||α+ || supx ||ϕ(x)|| ≤ L 1−22 |ϕ0 |, which is independent of N . 1/2−γ 0 Notice that the Haar walevets has p = 1 vanishing moment but is not C 1 , thus the Proposition does not apply directly. However direct computations show that if f ∗ is L-Lipschitz (i.e. γ = 1) then L 0 αh,l ≤ L2−3h/2−2 , and thus ||α+ || supx ||ϕ(x)|| ≤ 4(1−2−1/2 ) with ch = 2−h/4 . 7 4.2 Comparison with other methods In the case when the factor ||α+ || E||ϕ(X)||2 does not depend on N (such as in the previous example), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in √ √ terms of FN ) of CLSR is O(log K/ K). It is clear that whenever N > K (which is the case of interest here), this is better than the ordinary LSR in the initial domain, whose estimation error is O(N log K/K). It is difﬁcult to compare this result with LASSO (or the Dantzig selector that has similar properties [5]) for which an important aspect is to design sparse regression functions or to recover a solution assumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation error of LASSO is of order S log N where S is the sparsity (number of non-zero coefﬁcients) of the K√ best regressor f + in FN . If S < K then LASSO is more interesting than CLSR in terms of excess risk. Otherwise CLSR may be an interesting alternative although this method does not make any assumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to make good predictions. However, in some sense our method ﬁnds a sparse solution in the fact that the regression function gL lies in a space GM of small dimension M N and can thus be expressed using only M coefﬁcients. Now in terms of numerical complexity, CLSR requires O(N K 3/2 ) operations to build the matrix and compute the regression function, whereas according to [18], the (heuristical) complexity of the LASSO algorithm is O(N K 2 ) in the best cases (assuming that the number of steps required for convergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and simple competitor to LASSO. 5 Conclusion We considered the case when the number of features N is larger than the number of data K. The result stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS or its penalized versions) performed in the compressed domain GM versus in the initial space FN . In the compressed domain the estimation error is reduced but an additional (controlled) approximation error (when compared to the best regressor in FN ) comes into the picture. In the case of LS regression, when the term ||α+ || E||ϕ(X)||2 has a mild dependency on N , then by choosing a √ random subspace of dimension M = O( K), CLSR has an estimation error (assessed in terms of √ FN ) bounded by O(log K/ K) and has numerical complexity O(N K 3/2 ). In short, CLSR provides an alternative to usual penalization techniques where one ﬁrst selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace. Further work needs to be done to provide additional settings (when the space X is of dimension > 1) for which the term ||α+ || E||ϕ(X)||2 is small. Acknowledgements: The authors wish to thank Laurent Jacques for numerous comments and Alessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been supported by French National Research Agency (ANR) through COSINUS program (project EXPLO-RA, ANR-08-COSI-004). References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, June 2003. [2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, New York, NY, USA, 2006. ACM. [3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian truncation. Technical Report HAL : hal-00360268, 2009. [4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics, 1997. 8 [5] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. To appear in Annals of Statistics, 2008. [6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent Structure and Feature Selection, pages 52–68, 2006. [7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical Report, 2009. [8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35:2313, 2007. [9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. volume 5674, pages 76–86. SPIE, 2005. [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20:33–61, 1998. [11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation with compressive measurements. Technical Report TREE 0610, Department of Electrical and Computer Engineering, Rice University, 2006. [12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10:971–988, 2004. [13] L. Gy¨ rﬁ, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric o z regression. Springer-Verlag, 2002. [14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793– 800. MIT Press, 2008. [15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 2007. [18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012, 2007. [19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4, pages 1035–1038, 1963. [21] Yaakov Tsaig and David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. [22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. [24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. To appear in Annals of Statistics, 2009. [25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Processing Systems. MIT Press, 2007. 9

5 0.11656044 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of ﬁnite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classiﬁer, thus forming a natural link between the distance between distributions and their ease of classiﬁcation. An important consequence is that a kernel must be characteristic to guarantee classiﬁability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive deﬁnite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classiﬁer. The generalized MMD is shown to have consistent ﬁnite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

6 0.11352241 47 nips-2009-Boosting with Spatial Regularization

7 0.10546049 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

8 0.10058384 74 nips-2009-Efficient Bregman Range Search

9 0.097294003 260 nips-2009-Zero-shot Learning with Semantic Output Codes

10 0.091488078 71 nips-2009-Distribution-Calibrated Hierarchical Classification

11 0.089296594 69 nips-2009-Discrete MDL Predicts in Total Variation

12 0.082732871 256 nips-2009-Which graphical models are difficult to learn?

13 0.081589147 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

14 0.07938318 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

15 0.076574109 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

16 0.072778635 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

17 0.071333744 182 nips-2009-Optimal Scoring for Unsupervised Learning

18 0.069909945 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

19 0.069019102 72 nips-2009-Distribution Matching for Transduction

20 0.066441216 230 nips-2009-Statistical Consistency of Top-k Ranking

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.185), (1, 0.153), (2, -0.024), (3, 0.051), (4, 0.007), (5, -0.031), (6, -0.079), (7, -0.079), (8, -0.002), (9, 0.068), (10, 0.08), (11, -0.013), (12, -0.034), (13, -0.076), (14, 0.194), (15, 0.053), (16, 0.069), (17, 0.048), (18, 0.245), (19, -0.088), (20, 0.175), (21, -0.128), (22, 0.042), (23, -0.185), (24, 0.097), (25, -0.039), (26, -0.036), (27, 0.006), (28, 0.095), (29, 0.051), (30, 0.101), (31, -0.14), (32, 0.126), (33, -0.053), (34, 0.094), (35, -0.054), (36, 0.122), (37, -0.029), (38, -0.05), (39, 0.015), (40, 0.088), (41, -0.118), (42, -0.003), (43, 0.005), (44, 0.031), (45, 0.074), (46, 0.045), (47, -0.142), (48, 0.135), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94262213 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

2 0.72583461 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

3 0.56921917 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classiﬁers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classiﬁers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classiﬁers learned on examples with multiple artiﬁcially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our ﬁndings by showing that additional views obtained from MT may signiﬁcantly improve the classiﬁcation performance in the cases identiﬁed by our trade-off. 1

4 0.54653931 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

5 0.51641423 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classiﬁcation learning, we take a step back and examine how a hierarchical classiﬁcation problem should be formally deﬁned. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classiﬁcation loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated deﬁnition of statistical risk for hierarchical classiﬁcation, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classiﬁcation to cost-sensitive classiﬁcation.

6 0.48183572 69 nips-2009-Discrete MDL Predicts in Total Variation

7 0.42592841 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

8 0.4249166 47 nips-2009-Boosting with Spatial Regularization

9 0.42075792 94 nips-2009-Fast Learning from Non-i.i.d. Observations

10 0.41527173 182 nips-2009-Optimal Scoring for Unsupervised Learning

11 0.40658242 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

12 0.39401028 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

13 0.39029425 193 nips-2009-Potential-Based Agnostic Boosting

14 0.37542009 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

15 0.342464 14 nips-2009-A Parameter-free Hedging Algorithm

16 0.33378953 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

17 0.33114287 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

18 0.32949629 72 nips-2009-Distribution Matching for Transduction

19 0.32942265 67 nips-2009-Directed Regression

20 0.32190764 149 nips-2009-Maximin affinity learning of image segmentation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(17, 0.279), (24, 0.059), (25, 0.072), (35, 0.057), (36, 0.141), (39, 0.035), (58, 0.1), (61, 0.037), (71, 0.047), (86, 0.059), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81006867 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

2 0.75004941 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, conﬁdence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent conﬁdence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

3 0.60708255 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semideﬁnite matrix variable, we reformulate the problem to eliminate the need for general semideﬁnite programming. In particular, we provide two reformulations that admit fast algorithms. The ﬁrst is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

4 0.6058377 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

5 0.60457921 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classiﬁcation, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

6 0.60423362 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

7 0.60314322 72 nips-2009-Distribution Matching for Transduction

8 0.60152256 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

9 0.60146815 97 nips-2009-Free energy score space

10 0.60137057 180 nips-2009-On the Convergence of the Concave-Convex Procedure

11 0.59960777 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

12 0.59950894 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

13 0.59945184 129 nips-2009-Learning a Small Mixture of Trees

14 0.59894383 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

15 0.59858197 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

16 0.59849143 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.59758621 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

18 0.5967468 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

19 0.59673774 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

20 0.59671086 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models