nips nips2009 nips2009-64 knowledge-graph by maker-knowledge-mining

64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Source: pdf

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Data-driven calibration of linear estimators with minimal penalties Sylvain Arlot ∗ CNRS ; Willow Project-Team Laboratoire d’Informatique de l’Ecole Normale Superieure (CNRS/ENS/INRIA UMR 8548) 23, avenue d’Italie, F-75013 Paris, France sylvain. [sent-1, score-0.591]

2 We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. [sent-6, score-0.483]

3 Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. [sent-7, score-0.493]

4 Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. [sent-8, score-0.774]

5 A central issue common to all regularization frameworks is the choice of the regularization parameter: while most practitioners use cross-validation procedures to select such a parameter, data-driven procedures not based on cross-validation are rarely used. [sent-10, score-0.26]

6 The choice of the kernel, a seemingly unrelated issue, is also important for good predictive performance: several techniques exist, either based on cross-validation, Gaussian processes or multiple kernel learning [3, 4, 5]. [sent-11, score-0.252]

7 In this paper, we consider least-squares regression and cast these two problems as the problem of selecting among several linear estimators, where the goal is to choose an estimator with a quadratic risk which is as small as possible. [sent-12, score-0.397]

8 This problem includes for instance model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning (see Section 2). [sent-13, score-1.148]

9 The main contribution of the paper is to extend the notion of minimal penalty [6, 7] to all discrete classes of linear operators, and to use it for deﬁning a fully data-driven selection algorithm satisfying a non-asymptotic oracle inequality. [sent-14, score-0.682]

10 Our new theoretical results presented in Section 4 extend similar results which were limited to unregularized least-squares regression (i. [sent-15, score-0.119]

11 Finally, in Section 5, we show that our algorithm improves the performances of classical selection procedures, such as GCV [8] and 10-fold cross-validation, for kernel ridge regression or multiple kernel learning, for moderate values of the sample size. [sent-18, score-0.814]

12 fr/∼fbach/ 1 2 Linear estimators In this section, we deﬁne the problem we aim to solve and give several examples of linear estimators. [sent-25, score-0.219]

13 The goal is to reconstruct the signal F = (f (xi ))1≤i≤n ∈ Rn , with some estimator F ∈ Rn , depending only on (x1 , Y1 ), . [sent-41, score-0.125]

14 , (xn , Yn ) , and having a small quadratic risk n n−1 F − F 2 , where ∀t ∈ Rn , we denote by t 2 the ℓ2 -norm of t , deﬁned as t 2 := i=1 t2 . [sent-44, score-0.18]

15 2 2 i In this paper, we focus on linear estimators F that can be written as a linear function of Y = (Y1 , . [sent-45, score-0.219]

16 , xn (which are known and deterministic), but not on Y , and may be parameterized by certain quantities—usually regularization parameter or kernel combination weights. [sent-55, score-0.249]

17 2 Examples of linear estimators In this paper, our theoretical results apply to matrices A which are symmetric positive semi-deﬁnite, such as the ones deﬁned below. [sent-57, score-0.326]

18 If we consider linear predictors from a design matrix X ∈ Rn×p , then F = AY with A = X(X ⊤ X)−1 X ⊤ , which is a projection matrix (i. [sent-59, score-0.139]

19 We assume that a positive deﬁnite kernel k : X × X → R is given, and we are looking for a function f : X → R in the associated reproducing kernel Hilbert space (RKHS) F , with norm · F . [sent-67, score-0.36]

20 If K denotes the n × n kernel matrix, deﬁned by Kab = k(xa , xb ) , then the ridge regression estimator—a. [sent-68, score-0.426]

21 spline smoothing estimator for spline kernels [9]—is obtained by minimizing with respect to f ∈ F [2]: 1 n n i=1 (Yi − f (xi ))2 + λ f 2 F . [sent-71, score-0.472]

22 This leads to the smoothing matrix Aλ = K(K + nλIn )−1 , parameterized by the regularization parameter λ ∈ R+ . [sent-73, score-0.197]

23 The group Lasso [10] and multiple kernel learning [11, 5] frameworks consider the following objective function p n J(f1 , . [sent-79, score-0.26]

24 , fp ) = 1 n i=1 p j=1 yi − fj , Φj (xi ) 2 +2λ p fj Fj = L(f1 , . [sent-82, score-0.351]

25 Using a1/2 = minb 2 p j=1 fj 1 a 0 2{ b = minη∈Rp + + b} , we obtain a variational formulation of the sum of norms fj ηj p j=1 2 + ηj . [sent-87, score-0.262]

26 , fp ) is equivalent to minimizing with respect to η ∈ Rp (see [5] for more details): + p min L(f1 , . [sent-94, score-0.124]

27 ,fp j=1 fj ηj 2 p 1 +λ ηj = y ⊤ n j=1 2 p j=1 ηj Kj + nλIn −1 p y+λ ηj , j=1 where In is the n × n identity matrix. [sent-100, score-0.131]

28 Moreover, given η , this leads to a smoothing matrix of the form p p Aη,λ = ( j=1 ηj Kj )( j=1 ηj Kj + nλIn )−1 , (1) parameterized by the regularization parameter λ ∈ R+ and the kernel combinations in Rp —note + that it depends only on λ−1 η , which can be grouped in a single parameter in Rp . [sent-101, score-0.377]

29 However, while the model selection problem is by nature combinatorial, our optimization problems for multiple kernels are all differentiable and are thus amenable to gradient descent procedures—which only ﬁnd local optima. [sent-105, score-0.177]

30 Other linear estimators are commonly used, such as nearestneighbor regression or the Nadaraya-Watson estimator [13]; those however lead to non symmetric matrices A , and are not entirely covered by our theoretical results. [sent-107, score-0.594]

31 3 Linear estimator selection In this section, we ﬁrst describe the statistical framework of linear estimator selection and introduce the notion of minimal penalty. [sent-108, score-0.624]

32 1 Unbiased risk estimation heuristics Usually, several estimators of the form F = AY can be used. [sent-110, score-0.519]

33 2), hence a family of estimators (Fλ )λ∈Λ can be used, with Fλ := Aλ Y . [sent-113, score-0.219]

34 The goal is to choose from data some λ ∈ Λ , so that the quadratic risk of Fλ is as small as possible. [sent-114, score-0.18]

35 Many classical selection methods are built upon the “unbiased risk estimation” heuristics: If λ minimizes a criterion crit(λ) such that ∀λ ∈ Λ, E [ crit(λ) ] ≈ E n−1 Fλ − F 2 2 , then λ satisﬁes an oracle inequality such as in Eq. [sent-117, score-0.6]

36 One way of implementing this heuristics is penalization, which consists in minimizing the sum of the empirical risk and a penalty term, i. [sent-120, score-0.518]

37 The unbiased risk estimation heuristics, also called Mallows’ heuristics, then leads to the ideal (deterministic) penalty penid (λ) := E n−1 Fλ − F 3 2 2 − E n−1 Fλ − Y 2 2 . [sent-123, score-0.589]

38 2 penid (λ) = up to the term −E[n−1 (4) (5) Note that df(λ) = tr(Aλ ) is called the effective dimensionality or degrees of freedom [16], so that the ideal penalty in Eq. [sent-128, score-0.516]

39 (5) is proportional to the dimensionality associated with the matrix Aλ — for projection matrices, we get back the dimension of the subspace, which is classical in model selection. [sent-129, score-0.19]

40 (5) led to several selection procedures, in particular Mallows’ CL (called Cp in the case of projection estimators) [17], where σ 2 is replaced by some estimator σ 2 . [sent-131, score-0.295]

41 The estimator of σ 2 usually used with CL is based upon the value of the empirical risk at some λ0 with df(λ0 ) large; it has the drawback of overestimating the risk, in a way which depends on λ0 and F [18]. [sent-132, score-0.381]

42 GCV, which implicitly estimates σ 2 , has the drawback of overﬁtting if the family (Aλ )λ∈Λ contains a matrix too close to In [19]; GCV also overestimates the risk even more than CL for most Aλ (see (7. [sent-133, score-0.21]

43 In this paper, we deﬁne an estimator of σ 2 directly related to the selection task which does not have similar drawbacks. [sent-135, score-0.216]

44 Our estimator relies on the concept of minimal penalty, introduced by Birg´ and e Massart [6] and further studied in [7]. [sent-136, score-0.317]

45 (7), let us consider the kernel ridge regression example and assume that the risk and the empirical risk behave as their expectations in Eq. [sent-148, score-0.786]

46 Completely rigorous arguments based upon concentration inequalities are developed in [20] and summarized in Section 4, leading to the same conclusion as the present informal reasoning. [sent-152, score-0.18]

47 2 First, as proved in [20], the bias n−1 (Aλ − In )F 2 is a decreasing function of the dimensionality df(λ) = tr(Aλ ) , and the variance tr(A⊤ Aλ )σ 2 n−1 is an increasing function of df(λ) , as well λ as 2 tr(Aλ ) − tr(A⊤ Aλ ) . [sent-153, score-0.191]

48 (6) shows that the optimal λ realizes the best trade-off λ between bias (which decreases with df(λ)) and variance (which increases with df(λ)), which is a classical fact in model selection. [sent-155, score-0.145]

49 Second, the expectation of the empirical risk in Eq. [sent-156, score-0.18]

50 (7) can be decomposed into the bias and a negative variance term which is the opposite of penmin (λ) := n−1 2 tr(Aλ ) − tr(A⊤ Aλ ) σ 2 . [sent-157, score-0.374]

51 5 bias variance ∼ σ2tr A2 σ2trA 0 σ2trA2 − 2σ2trA generalization error ∼ bias + σ2 tr A2 empirical error−σ2 ∼ bias+σ2trA2−2σ2 tr A −0. [sent-159, score-0.785]

52 5 0 200 400 600 degrees of freedom ( tr A ) 800 Figure 1: Bias-variance decomposition of the generalization error, and minimal/optimal penalties. [sent-160, score-0.416]

53 As suggested by the notation penmin , we will show it is a minimal penalty in the following sense. [sent-161, score-0.685]

54 If ∀C ≥ 0, λmin (C) ∈ arg min n−1 Fλ − Y 2 + C penmin (λ) , 2 λ∈Λ then, up to concentration inequalities that are detailed in Section 4. [sent-162, score-0.416]

55 2, λmin (C) behaves like a minimizer of gC (λ) = E n−1 Fλ − Y 2 2 + C penmin (λ) −n−1 σ 2 = n−1 (Aλ − In )F 2 2 +(C−1) penmin (λ) Therefore, two main cases can be distinguished: • if C < 1 , then gC (λ) decreases with df(λ) so that df(λmin (C)) is huge: λmin (C) overﬁts. [sent-163, score-0.55]

56 As a conclusion, penmin (λ) is the minimal amount of penalization needed so that a minimizer λ of a penalized criterion is not clearly overﬁtting. [sent-165, score-0.543]

57 Following an idea ﬁrst proposed in [6] and further analyzed or used in several other papers such as [21, 7, 22], we now propose to use that penmin (λ) is a minimal penalty for estimating σ 2 and plug this estimator into Eq. [sent-166, score-0.81]

58 (9) is new; it generalizes previous results [6, 7] where penmin (Aλ ) = n−1 tr(Aλ )σ 2 because all Aλ were assumed to be projection matrices, i. [sent-171, score-0.354]

59 Furthermore, our results generalize the slope heuristics penid ≈ 2 penmin (only valid for projection estimators [6, 7]) to general linear estimators for which penid / penmin ∈ (1, 2] . [sent-174, score-1.515]

60 1 Algorithm The following algorithm ﬁrst computes an estimator of C of σ 2 using the minimal penalty in Eq. [sent-177, score-0.535]

61 Alternatively, using the same grid in log-scale, we can select C with maximal jump between successive values of df(λ0 (C))—note that our theoretical result then does not entirely hold, as we show the presence of a jump around σ 2 , but do not show the absence of similar jumps elsewhere. [sent-187, score-0.241]

62 The assumption that matrices Aλ must be symmetric can certainly be relaxed, since it is only used for deriving from Eq. [sent-208, score-0.121]

63 (A1−2 ) holds if maxλ∈Λ { df(λ) } ≥ n/2 and the bias is smaller than c df(λ)−d for some c, d > 0 , a quite classical assumption in the context of model selection. [sent-212, score-0.108]

64 (6), (A3 ) holds with κ = 1 when Aλ is a projection matrix since tr(A⊤ Aλ ) = tr(Aλ ) . [sent-218, score-0.109]

65 In the kernel ridge regression framework, λ (A3 ) holds as soon as the eigenvalues of the kernel matrix K decrease like j −α —see [20]. [sent-219, score-0.683]

66 In general, (A3 ) means that Fλ should not have a risk smaller than the parametric convergence rate associated with a model of dimension df(λ) = tr(Aλ ) . [sent-220, score-0.18]

67 When (A3 ) does not hold, selecting among estimators whose risks are below the parametric rate is a rather difﬁcult problem and it may not be possible to attain the risk of the oracle in general. [sent-221, score-0.58]

68 6 selected degrees of freedom selected degrees of freedom 400 minimal penalty optimal penalty / 2 300 200 100 0 −2 0 log(C/σ2) 2 200 optimal/2 minimal (discrete) minimal (continuous) 150 100 50 0 −2 0 log(C/σ2) 2 Figure 2: Selected degrees of freedom vs. [sent-222, score-1.324]

69 penalty strength log(C/σ 2 ) : note that when penalizing by the minimal penalty, there is a strong jump at C = σ 2 , while when using half the optimal penalty, this is not the case. [sent-223, score-0.498]

70 Nevertheless, an oracle inequality can still be proved without (A3 ), at the price of enlarging C slightly and adding a small fraction of σ 2 n−1 tr(Aλ ) in the right-hand side of Eq. [sent-225, score-0.321]

71 Enlarging C is necessary in general: If tr(A⊤ Aλ ) ≪ tr(Aλ ) for most λ ∈ Λ , the minimal penalty λ is very close to 2σ 2 n−1 tr(Aλ ) , so that according to Eq. [sent-227, score-0.41]

72 The ﬁrst part of Theorem 1 shows that C is a consistent estimator of σ 2 in a general framework and under mild assumptions. [sent-231, score-0.125]

73 Compared to classical estimators of σ 2 , such as the one usually used with Mallows’ CL , C does not depend on the choice of some model assumed to have almost no bias, which can lead to overestimating σ 2 by an unknown amount [18]. [sent-232, score-0.336]

74 Our algorithm satisﬁes an oracle inequality with high probability, as shown by Eq. [sent-234, score-0.22]

75 (11): The risk of the selected estimator Fλ is close to the risk of the oracle, up to a remainder b term which is negligible when the dimensionality df(λ⋆ ) grows with n faster than ln(n) , a typical situation when the bias is never equal to zero, for instance in kernel ridge regression. [sent-235, score-0.973]

76 Several oracle inequalities have been proved in the statistical literature for Mallows’ CL with a consistent estimator of σ 2 , for instance in [23]. [sent-236, score-0.458]

77 This assumption can be problematic for several learning problems, for instance in multiple kernel learning when the number p of kernels may grow with n . [sent-238, score-0.296]

78 According to Theorem 1 and previous theoretical results [23, 19], CL , GCV, cross-validation and our algorithm satisfy similar oracle inequalities in various frameworks. [sent-242, score-0.273]

79 Furthermore, our algorithm never overﬁts too much because df(λ) is by construction smaller than the effective dimensionality of λ0 (C) at which the jump occurs. [sent-246, score-0.123]

80 5 Figure 3: Comparison of various smoothing parameter selection (minikernel, GCV, 10-fold cross validation) for various values of numbers of observations, averaged over 20 replications. [sent-269, score-0.189]

81 In Figure 2 (left), we consider data xi ∈ R6 , n = 1000, and study the size of the jump in Figure 2 for kernel ridge regression. [sent-272, score-0.422]

82 With half the optimal penalty (which is used in traditional variable selection for linear regression), we do not get any jump, while with the minimal penalty we always do. [sent-273, score-0.719]

83 In Figure 2 (right), we plot the same curves for the multiple kernel learning problem with two kernels on two different 4-dimensional variables, with similar results. [sent-274, score-0.266]

84 In addition, we show two ways of optimizing over λ ∈ Λ = R2 , by discrete optimization with n different kernel matrices—a + situation covered by Theorem 1—or with continuous optimization with respect to η in Eq. [sent-275, score-0.231]

85 In Figure 3, we plot model selection results for 20 replications of data (d = 4, n = 500), comparing GCV [8], our minimal penalty algorithm, and cross-validation methods. [sent-278, score-0.501]

86 In the left part (single kernel), we compare to the oracle (which can be computed because we can enumerate Λ), and use for cross-validation all possible values of λ . [sent-279, score-0.212]

87 We also compare to using our minimal penalty algorithm with the sum of kernels. [sent-284, score-0.41]

88 Theorem 1 generalizes some results ﬁrst proved in [6] where all Aλ are assumed to be projection matrices, a framework where assumption (A3 ) is automatically satisﬁed. [sent-286, score-0.136]

89 To this extent, Birg´ and Massart’s slope heuristics has been modiﬁed in a way that sheds e a new light on the “magical” factor 2 between the minimal and the optimal penalty, as proved in [6, 7]. [sent-287, score-0.447]

90 Indeed, Theorem 1 shows that for general linear estimators, 2 tr(Aλ ) penid (λ) = , penmin (λ) 2 tr(Aλ ) − tr(A⊤ Aλ ) λ (14) which can take any value in (1, 2] in general; this ratio is only equal to 2 when tr(Aλ ) ≈ tr(A⊤ Aλ ) , λ hence mostly when Aλ is a projection matrix. [sent-288, score-0.479]

91 In the case of projection estimators, the slope heuristics still holds when the design is random and data are heteroscedastic [7]; we would like to know whether Eq. [sent-290, score-0.309]

92 In addition, the good empirical performances of elbow heuristics based algorithms (i. [sent-292, score-0.15]

93 Another interesting open problem would be to extend the results of Section 4, where Card(Λ) ≤ Knα is assumed, to continuous sets Λ such as the ones appearing naturally in kernel ridge regression and multiple kernel learning. [sent-295, score-0.647]

94 We conjecture that Theorem 1 is valid without modiﬁcation for a “small” continuous Λ , such as in kernel ridge regression where taking a grid of size n in log-scale is almost equivalent to taking Λ = R+ . [sent-296, score-0.464]

95 On the contrary, in applications such as the Lasso with p ≫ n variables, the natural set Λ cannot be well covered by a grid of cardinality nα with α small, and our minimal penalty algorithm and Theorem 1 certainly have to be modiﬁed. [sent-297, score-0.54]

96 Consistency of the group Lasso and multiple kernel learning. [sent-322, score-0.221]

97 Model selection and estimation in regression with grouped variables. [sent-354, score-0.183]

98 Learning bounds for kernel regression using effective data dimensionality. [sent-397, score-0.272]

99 Data-driven calibration of linear estimators with minimal penalties, September 2009. [sent-423, score-0.479]

100 Slope heuristics for variable selection and clustering via gaussian mixtures. [sent-435, score-0.211]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('df', 0.332), ('tr', 0.312), ('gcv', 0.275), ('penmin', 0.275), ('estimators', 0.219), ('penalty', 0.218), ('minimal', 0.192), ('oracle', 0.181), ('risk', 0.18), ('kernel', 0.18), ('ridge', 0.154), ('cl', 0.146), ('fj', 0.131), ('mallows', 0.131), ('penid', 0.125), ('estimator', 0.125), ('heuristics', 0.12), ('spline', 0.102), ('smoothing', 0.098), ('regression', 0.092), ('selection', 0.091), ('fp', 0.089), ('jump', 0.088), ('projection', 0.079), ('slope', 0.078), ('penalties', 0.076), ('arlot', 0.075), ('calibration', 0.068), ('crit', 0.066), ('inequalities', 0.065), ('bias', 0.062), ('freedom', 0.061), ('kj', 0.06), ('birg', 0.06), ('lasso', 0.059), ('procedures', 0.059), ('proved', 0.057), ('ln', 0.057), ('gc', 0.056), ('theorem', 0.056), ('ay', 0.055), ('rn', 0.054), ('rp', 0.054), ('covered', 0.051), ('italie', 0.05), ('massart', 0.05), ('superieure', 0.05), ('penalization', 0.049), ('cv', 0.049), ('soon', 0.047), ('classical', 0.046), ('cp', 0.045), ('kernels', 0.045), ('card', 0.044), ('kn', 0.044), ('technometrics', 0.044), ('enlarging', 0.044), ('degrees', 0.043), ('sp', 0.043), ('matrices', 0.043), ('certainly', 0.041), ('concentration', 0.041), ('multiple', 0.041), ('informatique', 0.04), ('overestimating', 0.04), ('frameworks', 0.039), ('inequality', 0.039), ('relaxed', 0.038), ('grid', 0.038), ('umr', 0.038), ('willow', 0.038), ('normale', 0.038), ('informal', 0.038), ('laboratoire', 0.038), ('symmetric', 0.037), ('variance', 0.037), ('upon', 0.036), ('regularization', 0.036), ('avenue', 0.036), ('dimensionality', 0.035), ('min', 0.035), ('mkl', 0.034), ('ideal', 0.034), ('parameterized', 0.033), ('heteroscedastic', 0.032), ('cb', 0.032), ('unbiased', 0.032), ('deterministic', 0.032), ('choice', 0.031), ('enumerate', 0.031), ('performances', 0.03), ('matrix', 0.03), ('instance', 0.03), ('inria', 0.029), ('ecole', 0.028), ('paris', 0.028), ('rkhs', 0.028), ('criterion', 0.027), ('negligible', 0.027), ('theoretical', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

2 0.20651801 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a uniﬁed asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

3 0.15854986 74 nips-2009-Efficient Bregman Range Search

Author: Lawrence Cayton

Abstract: We develop an algorithm for efﬁcient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some speciﬁed distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efﬁcient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efﬁcient algorithm for range search based on a recently proposed space decomposition for Bregman divergences. 1

4 0.15301217 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

Abstract: To estimate the changing structure of a varying-coefﬁcient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which ﬁrst identiﬁes jump points of structural changes and then identiﬁes relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1

5 0.13919339 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

6 0.13007861 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

7 0.12138502 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

8 0.12108947 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

9 0.12022319 128 nips-2009-Learning Non-Linear Combinations of Kernels

10 0.11784661 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

11 0.10860839 71 nips-2009-Distribution-Calibrated Hierarchical Classification

12 0.10367044 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

13 0.096653119 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

14 0.096591771 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

15 0.092654251 55 nips-2009-Compressed Least-Squares Regression

16 0.085919388 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

17 0.081589147 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

18 0.081465811 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

19 0.079348385 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

20 0.077839762 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.225), (1, 0.198), (2, 0.002), (3, 0.163), (4, -0.032), (5, -0.085), (6, 0.047), (7, -0.029), (8, 0.057), (9, 0.052), (10, 0.162), (11, -0.133), (12, 0.035), (13, 0.035), (14, 0.113), (15, 0.086), (16, -0.024), (17, -0.005), (18, 0.101), (19, 0.081), (20, 0.064), (21, 0.05), (22, 0.018), (23, -0.119), (24, 0.028), (25, -0.011), (26, 0.009), (27, 0.059), (28, 0.14), (29, -0.009), (30, 0.045), (31, 0.066), (32, -0.075), (33, -0.062), (34, -0.06), (35, 0.082), (36, 0.067), (37, 0.092), (38, 0.027), (39, 0.05), (40, 0.087), (41, -0.03), (42, -0.037), (43, -0.137), (44, 0.025), (45, -0.013), (46, -0.029), (47, 0.065), (48, 0.055), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95756811 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

2 0.72069389 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

3 0.62926626 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

Author: Samory Kpotufe

Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1

4 0.60306561 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

Abstract: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M . From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Re√ gression” (CLSR) in terms of N , K, and M . When we choose M = O( K), we √ show that CLSR has an estimation error of order O(log K/ K). 1 Problem setting We consider a regression problem where we observe data DK = ({xk , yk }k≤K ) (where xk ∈ X and yk ∈ R) are assumed to be independently and identically distributed (i.i.d.) from some distribution P , where xk ∼ PX and yk = f ∗ (xk ) + ηk (xk ), where f ∗ is the (unknown) target function, and ηk a centered independent noise of variance σ 2 (xk ). For a given class of functions F, and f ∈ F, we deﬁne the empirical (quadratic) error def LK (f ) = 1 K K [yk − f (xk )]2 , k=1 and the generalization (quadratic) error def L(f ) = E(X,Y )∼P [(Y − f (X))2 ]. Our goal is to return a regression function f ∈ F with lowest possible generalization error L(f ). Notations: In the sequel we will make use of the following notations about norms: for h : X → R, we write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm n 2 1/2 of h w.r.t. the empirical measure PK , and for u ∈ Rn , ||u|| denotes by default . i=1 ui The measurable function minimizing the generalization error is f ∗ , but it may be the case that f ∗ ∈ F. For any regression function f , we deﬁne the excess risk / L(f ) − L(f ∗ ) = ||f − f ∗ ||2 , P which decomposes as the sum of the estimation error L(f ) − inf f ∈F L(f ) and the approximation error inf f ∈F L(f ) − L(f ∗ ) = inf f ∈F ||f − f ∗ ||2 which measures the distance between f ∗ and the P function space F. 1 In this paper we consider a class of linear functions FN deﬁned as the span of a set of N functions def def N {ϕn }1≤n≤N called features. Thus: FN = {fα = n=1 αn ϕn , α ∈ RN }. When the number of data K is larger than the number of features N , the ordinary Least-Squares Regression (LSR) provides the LS solution fα which is the minimizer of the empirical risk LK (f ) b 1 in FN . Note that here LK (fα ) rewrites K ||Φα − Y ||K where Φ is the K × N matrix with elements (ϕn (xk ))1≤n≤N,1≤k≤K and Y the K-vector with components (yk )1≤k≤K . Usual results provide bound on the estimation error as a function of the capacity of the function space and the number of data. In the case of linear approximation, the capacity measures (such as covering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example the pseudo-dimension is at most N + 1). For example, let fα be a LS estimate (minimizer of LK b in FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation error is bounded as: N log K E L(fα ) − inf L(f ) ≤ cσ2 , (1) b f ∈FN K def where c is a universal constant, σ = supx∈X σ(x), and the expectation is taken with respect to P . Now, the excess risk is the sum of this estimation error and the approximation error inf f ∈FN ||f − f ∗ ||P of the class FN . Since the later usually decreases when the number of features N increases [13] (e.g. when N FN is dense in L2 (P )), we see the usual tradeoff between small estimation error (low N ) and small approximation error (large N ). In this paper we are interested in the setting when N is large so that the approximation error is small. Whenever N is larger than K we face the overﬁtting problem since there are more parameters than actual data (more variables than constraints), which is illustrated in the bound (1) which provides no information about the generalization ability of any LS estimate. In addition, there are many minimizers (in fact a vector space of same dimension as the null space of ΦT Φ) of the empirical risk. To overcome the problem, several approaches have been proposed in the literature: • LS solution with minimal norm: The solution is the minimizer of the empirical error with minimal (l1 or l2 )-norm: α = arg minΦα=Y ||α||1 or 2 , (or a robust solution arg min||Φα−Y ||2 ≤ε ||α||1 ). The choice of 2 -norm yields the ordinary LS solution. The choice of 1 -norm has been used for generating sparse solutions (e.g. the Basis Pursuit [10]), and assuming that the target function admits a sparse decomposition, the ﬁeld of Compressed Sensing [9, 21] provides sufﬁcient conditions for recovering the exact solution. However, such conditions (e.g. that Φ possesses a Restricted Isometric Property (RIP)) does not hold in general in this regression setting. On another aspect, solving these problems (both for l1 or l2 -norm) when N is large is numerically expensive. • Regularization. The solution is the minimizer of the empirical error plus a penalty term, for example f = arg min LK (f ) + λ||f ||p , for p = 1 or 2. p f ∈FN where λ is a parameter and usual choices for the norm are 2 (ridge-regression [20]) and 1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: α = arg min||α||1 ≤λ ||ΦT (Y − Φα)||∞ . The numerical complexity and generalization bounds of those methods depend on the sparsity of the target function decomposition in FN . Now if we possess a sequence of function classes (FN )N ≥1 with increasing capacity, we may perform structural risk minimization [22] by solving in each model the empirical risk penalized by a term that depends on the size of the model: fN = arg minf ∈FN ,N ≥1 LK (f ) + pen(N, K), where the penalty term measures the capacity of the function space. In this paper we follow another approach where instead of searching in the large space FN (where N > K) for a solution that minimizes the empirical error plus a penalty term, we simply search for the empirical error minimizer in a (randomly generated) lower dimensional subspace GM ⊂ FN (where M < K). Our contribution: We consider a set of M random linear combinations of the initial N features and perform our favorite LS regression algorithm (possibly regularized) using those “compressed 2 features”. This is equivalent to projecting the K points {ϕ(xk ) ∈ RN , k = 1..K} from the initial domain (of size N ) onto a random subspace of dimension M , and then performing the regression in the “compressed domain” (i.e. span of the compressed features). This is made possible because random projections approximately preserve inner products between vectors (by a variant of the Johnson-Lindenstrauss Lemma stated in Proposition 1. Our main result is a bound on the excess risk of a linear estimator built in the compressed domain in terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced estimation error in the compressed domain versus increased approximation error introduced by the random projection) and the numerical complexity (reduced complexity of solving the LSR in the compressed domain versus the additional load of performing the projection). √ As a consequence, we show that by choosing M = O( K) projections we deﬁne a Compressed Least-Squares Regression which uses O(N K 3/2 ) elementary operations to compute a regression √ function with estimation error (relatively to the initial function space FN ) of order log K/ K up to a multiplicative factor which depends on the best approximation of f ∗ in FN . This is competitive with the best methods, up to our knowledge. Related works: Using dimension reduction and random projections in various learning areas has received considerable interest over the past few years. In [7], the authors use a SVM algorithm in a compressed space for the purpose of classiﬁcation and show that their resulting algorithm has good generalization properties. In [25], the authors consider a notion of compressed linear regression. For data Y = Xβ + ε, where β is the target and ε a standard noise, they use compression of the set of data, thus considering AY = AXβ + Aε, where A has a Restricted Isometric Property. They provide an analysis of the LASSO estimator built from these compressed data, and discuss a property called sparsistency, i.e. the number of random projections needed to recover β (with high probability) when it is sparse. These works differ from our approach in the fact that we do not consider a compressed (input and/or output) data space but a compressed feature space instead. In [11], the authors discuss how compressed measurements may be useful to solve many detection, classiﬁcation and estimation problems without having to reconstruct the signal ever. Interestingly, they make no assumption about the signal being sparse, like in our work. In [6, 17], the authors show how to map a kernel k(x, y) = ϕ(x) · ϕ(y) into a low-dimensional space, while still approximately preserving the inner products. Thus they build a low-dimensional feature space speciﬁc for (translation invariant) kernels. 2 Linear regression in the compressed domain We remind that the initial set of features is {ϕn : X → def N FN = {fα = n=1 αn ϕn , α ∈ components (ϕn (x))n≤N . Let us R, 1 ≤ n ≤ N } and the initial domain R } is the span of those features. We write ϕ(x) the N -vector of N now deﬁne the random projection. Let A be a M × N matrix of i.i.d. elements drawn for some distribution ρ. Examples of distributions are: • Gaussian random variables N (0, 1/M ), √ • ± Bernoulli distributions, i.e. which takes values ±1/ M with equal probability 1/2, • Distribution taking values ± 3/M with probability 1/6 and 0 with probability 2/3. The following result (proof in the supplementary material) states the property that inner-product are approximately preserved through random projections (this is a simple consequence of the JohnsonLindenstrauss Lemma): Proposition 1 Let (uk )1≤k≤K and v be vectors of RN . Let A be a M × N matrix of i.i.d. elements drawn from one of the previously deﬁned distributions. For any ε > 0, δ > 0, for M ≥ ε2 1 ε3 log 4K , we have, with probability at least 1 − δ, for all k ≤ K, δ 4 − 6 |Auk · Av − uk · v| ≤ ε||uk || ||v||. 3 def We now introduce the set of M compressed features (ψm )1≤m≤M such that ψm (x) = N We also write ψ(x) the M -vector of components (ψm (x))m≤M . Thus n=1 Am,n ϕn (x). ψ(x) = Aϕ(x). We deﬁne the compressed domain GM = {gβ = m=1 βm ψm , β ∈ RM } the span of the compressed features (vector space of dimension at most M ). Note that each ψm ∈ FN , thus GM is a subspace of FN . def 2.1 M Approximation error We now compare the approximation error assessed in the compressed domain GM versus in the initial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary LS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and ridge regression. Deﬁne α+ = arg minα∈RN L(fα ) − L(f ∗ ) the parameter of the best regression function in FN . Theorem 1 For any δ > 0, any M ≥ 15 log(8K/δ), let A be a random M × N matrix deﬁned like in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with probability at least 1 − δ, inf ||g−f ∗ ||2 ≤ P g∈GM 8 log(8K/δ) + 2 ||α || M E ||ϕ(X)||2 +2 sup ||ϕ(x)||2 x∈X log 4/δ + inf ||f −f ∗ ||2 . P f ∈FN 2K (2) This theorem shows the tradeoff in terms of estimation and approximation errors for an estimator g obtained in the compressed domain compared to an estimator f obtained in the initial domain: • Bounds on the estimation error of g in GM are usually smaller than that of f in FN when M < N (since the capacity of FN is larger than that of GM ). • Theorem 1 says that the approximation error assessed in GM increases by at most O( log(K/δ) )||α+ ||2 E||ϕ(X)||2 compared to that in FN . M def def Proof: Let us write f + = fα+ = arg minf ∈FN ||f − f ∗ ||P and g + = gAα+ . The approximation error assessed in the compressed domain GM is bounded as inf ||g − f ∗ ||2 P g∈GM ≤ ||g + − f ∗ ||2 = ||g + − f + ||2 + ||f + − f ∗ ||2 , P P P (3) since f + is the orthogonal projection of f ∗ on FN and g + belongs to FN . We now bound ||g + − def def f + ||2 using concentration inequalities. Deﬁne Z(x) = Aα+ · Aϕ(x) − α+ · ϕ(x). Deﬁne ε2 = P log(8K/δ) 8 M log(8K/δ). For M ≥ 15 log(8K/δ) we have ε < 3/4 thus M ≥ ε2 /4−ε3 /6 . Proposition 1 applies and says that on an event E of probability at least 1 − δ/2, we have for all k ≤ K, def |Z(xk )| ≤ ε||α+ || ||ϕ(xk )|| ≤ ε||α+ || sup ||ϕ(x)|| = C (4) x∈X On the event E, we have with probability at least 1 − δ , ||g + − f + ||2 P = ≤ ≤ EX∼PX |Z(X)|2 ≤ ε2 ||α+ ||2 ε2 ||α+ ||2 1 K 1 K K |Z(xk )|2 + C 2 k=1 K ||ϕ(xk )||2 + sup ||ϕ(x)||2 x∈X k=1 E ||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x∈X log(2/δ ) 2K log(2/δ ) 2K log(2/δ ) . 2K where we applied two times Chernoff-Hoeffding’s inequality. Combining with (3), unconditioning, and setting δ = δ/2 then with probability at least (1 − δ/2)(1 − δ ) ≥ 1 − δ we have (2). 4 2.2 Computational issues We now discuss the relative computational costs of a given algorithm applied either in the initial or in the compressed domain. Let us write Cx(DK , FN , P ) the complexity (e.g. number of elementary operations) of an algorithm A to compute the regression function f when provided with the data DK and function space FN . We plot in the table below, both for the initial and the compressed versions of the algorithm A, the order of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the estimator, (iii) the cost for making one prediction (i.e. computing f (x) for any x): Construction of the feature matrix Computing the regression function Making one prediction Initial domain NK Cx(DK , FN , P ) N Compressed domain N KM Cx(DK , GM , P ) NM Note that the values mentioned for the compressed domain are upper-bounds on the real complexity and do not take into account the possible sparsity of the projection matrix A (which would speed up matrix computations, see e.g. [2, 1]). 3 Compressed Least-Squares Regression We now analyze the speciﬁc case of Least-Squares Regression. 3.1 Excess risk of ordinary Least Squares regression In order to bound the estimation error, we follow the approach of [13] which truncates (up to the level ±L where L is a bound, assumed to be known, on ||f ∗ ||∞ ) the prediction of the LS regression function. The ordinary LS regression provides the regression function fα where b α= argmin α∈argminα ∈ RN ||α||. ||Y −Φα || Note that ΦΦT α = ΦT Y , hence α = Φ† Y ∈ RN where Φ† is the Penrose pseudo-inverse of Φ1 . def Then the truncated predictor is: fL (x) = TL [fα (x)], where b def TL (u) = u if |u| ≤ L, L sign(u) otherwise. Truncation after the computation of the parameter α ∈ RN , which is the solution of an unconstrained optimization problem, is easier than solving an optimization problem under the constraint that ||α|| is small (which is the approach followed in [23]) and allows for consistency results and prediction bounds. Indeed, the excess risk of fL is bounded as 1 + log K E(||f − f ∗ ||2 ) ≤ c max{σ2 , L2 } N + 8 inf ||f − f ∗ ||2 (5) P P f ∈FN K where a bound on c is 9216 (see [13]). We have a simpler bound when we consider the expectation EY conditionally on the input data: N EY (||f − f ∗ ||2 K ) ≤ σ2 + inf ||f − f ∗ ||2 K (6) P P K f ∈F Remark: Note that because we use the quadratic loss function, by following the analysis in [3], or by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of Koltchinskii’s Saint Flour course, it is actually possible to state assumptions under which we can remove the log K term in (5). We will not further detail such bounds since our motivation here is not to provide the tightest possible bounds, but rather to show how the excess risk bound for LS regression in the initial domain extends to the compressed domain. 1 In the full rank case, Φ† = (ΦT Φ)−1 ΦT when K ≥ N and Φ† = ΦT (ΦΦT )−1 when K ≤ N 5 3.2 Compressed Least-Squares Regression (CLSR) CLSR is deﬁned as the ordinary LSR in the compressed domain. Let β = Ψ† Y ∈ RM , where Ψ is the K × M matrix with elements (ψm (xk ))1≤m≤M,1≤k≤K . The CLSR estimate is deﬁned as def gL (x) = TL [gβ (x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for b the CLSR estimate: √ ||α+ || E||ϕ(X)||2 K log(8K/δ) Corollary 1 For any δ > 0, set M = 8 max(σ,L) c (1+log K) . Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate is bounded as √ E(||gL − f ∗ ||2 ) ≤ 16 c max{σ, L}||α+ || E||ϕ(X)||2 P × 1+ supx ||ϕ(x)||2 E||ϕ(X)||2 (1 + log K) log(8K/δ) K log 4/δ + 8 inf ||f − f ∗ ||2 . P f ∈FN 2K (7) √ ||α+ || E||ϕ(X)||2 Now set M = 8K log(8K/δ). Assume N > K and that the features (ϕk )1≤k≤K σ are linearly independent. Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as 2 log(8K/δ) supx ||ϕ(x)||2 1+ K E||ϕ(X)||2 EY (||gL − f ∗ ||2 K ) ≤ 4σ||α+ || E||ϕ(X)||2 P log 4/δ . 2K Proof: Whenever M ≥ 15 log(8K/δ) we deduce from Theorem 1 and (5) that the excess risk of gL is bounded as E(||gL − f ∗ ||2 ) ≤ c max{σ2 , L2 } P +8 8 log(8K/δ) + 2 ||α || M 1 + log K M K E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 . P f ∈FN 2K By optimizing on M , we deduce (7). Similarly, using (6) we deduce the following bound on EY (||gL − f ∗ ||2 K ): P σ2 8 M + log(8K/δ)||α+ ||2 K M E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 K . P f ∈FN 2K By optimizing on M and noticing that inf f ∈FN ||f − f ∗ ||2 K = 0 whenever N > K and the features P (ϕk )1≤k≤K are linearly independent, we deduce the second result. Remark 1 Note that the second term in the parenthesis of (7) is negligible whenever K Thus we have the expected excess risk log K/δ + inf ||f − f ∗ ||2 . P f ∈FN K E(||gL − f ∗ ||2 ) = O ||α+ || E||ϕ(X)||2 √ P log 1/δ. (8) The choice of M in the previous corollary depends on ||α+ || and E||ϕ(X)|| which are a priori unknown (since f ∗ and PX are unknown). If we set M independently of ||α+ ||, then an additional multiplicative factor of ||α+ || appears in the bound, and if we replace E||ϕ(X)|| by its bound supx ||ϕ(x)|| (which is known) then this latter factor will appear instead of the former in the bound. Complexity of CLSR: The complexity of LSR for computing the regression function in the compressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK , GM , P ) = O(M K 2 ) which √ is of order O(K 5/2 ) when we choose the optimized number of projections M = O( K). However the leading term when using CLSR is the cost for building the Ψ matrix: O(N K 3/2 ). 6 4 4.1 Discussion The factor ||α+ || E||ϕ(X)||2 In light of Corollary 1, the important factor which will determine whether the CLSR provides low generalization error or not is ||α+ || E||ϕ(X)||2 . This factor indicates that a good set of features (for CLSR) should be such that the norm of those features as well as the norm of the parameter α+ of the projection of f ∗ onto the span of those features should be small. A natural question is whether this product can be made small for appropriate choices of features. We now provide two speciﬁc cases for which this is actually the case: (1) when the features are rescaled orthonormal basis functions, and (2) when the features are speciﬁc wavelet functions. In both cases, we relate the bound to an assumption of regularity on the function f ∗ , and show that the dependency w.r.t. N decreases when the regularity increases, and may even vanish. Rescaled Orthonormal Features: Consider a set of orthonormal functions (ηi )i≥1 w.r.t a measure µ, i.e. ηi , ηj µ = δi,j . In addition we assume that the law of the input data is dominated by µ, i.e. PX ≤ Cµ where C is a constant. For instance, this is the case when the set X is compact, µ is the uniform measure and PX has bounded density. def We deﬁne the set of N features as: ϕi = ci ηi , where ci > 0, for i ∈ {1, . . . , N }. Then any f ∈ FN decomposes as f = 2 we have: ||α|| = ||α+ ||2 E||ϕ||2 ≤ C N bi 2 i=1 ( ci ) N bi 2 i=1 ( ci ) and N i=1 N bi i=1 ci ϕi , where N 2 2 i=1 ci X ηi (x)dPX (x) f, ηi ηi = E||ϕ|| = 2 def bi = f, ηi . Thus ≤ C N 2 i=1 ci . Thus N 2 i=1 ci . Now, linear approximation theory (Jackson-type theorems) tells us that assuming a function f ∗ ∈ L2 (µ) is smooth, it may be decomposed onto the span of the N ﬁrst (ηi )i∈{1,...,N } functions with decreasing coefﬁcients |bi | ≤ i−λ for some λ ≥ 0 that depends on the smoothness of f ∗ . For example the class of functions with bounded total variation may be decomposed with Fourier basis (in dimension 1) with coefﬁcients |bi | ≤ ||f ||V /(2πi). Thus here λ = 1. Other classes (such as Sobolev spaces) lead to larger values of λ related to the order of differentiability. √ N By choosing ci = i−λ/2 , we have ||α+ || E||ϕ||2 ≤ C i=1 i−λ . Thus if λ > 1, then this term is bounded by a constant that does not depend on N . If λ = 1 then it is bounded by O(log N ), and if 0 < λ < 1, then it is bounded by O(N 1−λ ). However any orthonormal basis, even rescaled, would not necessarily yield a small ||α+ || E||ϕ||2 term (this is all the more true when the dimension of X is large). The desired property that the coefﬁcients (α+ )i of the decomposition of f ∗ rapidly decrease to 0 indicates that hierarchical bases, such as wavelets, that would decompose the function at different scales, may be interesting. Wavelets: Consider an inﬁnite family of wavelets in [0, 1]: (ϕ0 ) = (ϕ0 ) (indexed by n ≥ 1 or n h,l equivalently by the scale h ≥ 0 and translation 0 ≤ l ≤ 2h − 1) where ϕ0 (x) = 2h/2 ϕ0 (2h x − l) h,l and ϕ0 is the mother wavelet. Then consider N = 2H features (ϕh,l )1≤h≤H deﬁned as the rescaled def wavelets ϕh,l = ch 2−h/2 ϕ0 , where ch > 0 are some coefﬁcients. Assume the mother wavelet h,l is C p (for p ≥ 1), has at least p vanishing moments, and that for all h ≥ 0, supx l ϕ0 (2h x − l)2 ≤ 1. Then the following result (proof in the supplementary material) provides a bound on supx∈X ||ϕ(x)||2 (thus on E||ϕ(X)||2 ) by a constant independent of N : Proposition 2 Assume that f ∗ is (L, γ)-Lipschitz (i.e. for all v ∈ X there exists a polynomial pv of degree γ such that for all u ∈ X , |f (u) − pv (u)| ≤ L|u − v|γ ) with 1/2 < γ ≤ p. Then setting γ 1 ch = 2h(1−2γ)/4 , we have ||α+ || supx ||ϕ(x)|| ≤ L 1−22 |ϕ0 |, which is independent of N . 1/2−γ 0 Notice that the Haar walevets has p = 1 vanishing moment but is not C 1 , thus the Proposition does not apply directly. However direct computations show that if f ∗ is L-Lipschitz (i.e. γ = 1) then L 0 αh,l ≤ L2−3h/2−2 , and thus ||α+ || supx ||ϕ(x)|| ≤ 4(1−2−1/2 ) with ch = 2−h/4 . 7 4.2 Comparison with other methods In the case when the factor ||α+ || E||ϕ(X)||2 does not depend on N (such as in the previous example), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in √ √ terms of FN ) of CLSR is O(log K/ K). It is clear that whenever N > K (which is the case of interest here), this is better than the ordinary LSR in the initial domain, whose estimation error is O(N log K/K). It is difﬁcult to compare this result with LASSO (or the Dantzig selector that has similar properties [5]) for which an important aspect is to design sparse regression functions or to recover a solution assumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation error of LASSO is of order S log N where S is the sparsity (number of non-zero coefﬁcients) of the K√ best regressor f + in FN . If S < K then LASSO is more interesting than CLSR in terms of excess risk. Otherwise CLSR may be an interesting alternative although this method does not make any assumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to make good predictions. However, in some sense our method ﬁnds a sparse solution in the fact that the regression function gL lies in a space GM of small dimension M N and can thus be expressed using only M coefﬁcients. Now in terms of numerical complexity, CLSR requires O(N K 3/2 ) operations to build the matrix and compute the regression function, whereas according to [18], the (heuristical) complexity of the LASSO algorithm is O(N K 2 ) in the best cases (assuming that the number of steps required for convergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and simple competitor to LASSO. 5 Conclusion We considered the case when the number of features N is larger than the number of data K. The result stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS or its penalized versions) performed in the compressed domain GM versus in the initial space FN . In the compressed domain the estimation error is reduced but an additional (controlled) approximation error (when compared to the best regressor in FN ) comes into the picture. In the case of LS regression, when the term ||α+ || E||ϕ(X)||2 has a mild dependency on N , then by choosing a √ random subspace of dimension M = O( K), CLSR has an estimation error (assessed in terms of √ FN ) bounded by O(log K/ K) and has numerical complexity O(N K 3/2 ). In short, CLSR provides an alternative to usual penalization techniques where one ﬁrst selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace. Further work needs to be done to provide additional settings (when the space X is of dimension > 1) for which the term ||α+ || E||ϕ(X)||2 is small. Acknowledgements: The authors wish to thank Laurent Jacques for numerous comments and Alessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been supported by French National Research Agency (ANR) through COSINUS program (project EXPLO-RA, ANR-08-COSI-004). References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, June 2003. [2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, New York, NY, USA, 2006. ACM. [3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian truncation. Technical Report HAL : hal-00360268, 2009. [4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics, 1997. 8 [5] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. To appear in Annals of Statistics, 2008. [6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent Structure and Feature Selection, pages 52–68, 2006. [7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical Report, 2009. [8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35:2313, 2007. [9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. volume 5674, pages 76–86. SPIE, 2005. [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20:33–61, 1998. [11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation with compressive measurements. Technical Report TREE 0610, Department of Electrical and Computer Engineering, Rice University, 2006. [12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10:971–988, 2004. [13] L. Gy¨ rﬁ, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric o z regression. Springer-Verlag, 2002. [14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793– 800. MIT Press, 2008. [15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 2007. [18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012, 2007. [19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4, pages 1035–1038, 1963. [21] Yaakov Tsaig and David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. [22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. [24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. To appear in Annals of Statistics, 2009. [25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Processing Systems. MIT Press, 2007. 9

5 0.59842044 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

6 0.56716788 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

7 0.5600546 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

8 0.54738224 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

9 0.53347367 128 nips-2009-Learning Non-Linear Combinations of Kernels

10 0.52713889 67 nips-2009-Directed Regression

11 0.52569175 124 nips-2009-Lattice Regression

12 0.52350336 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

13 0.52349198 74 nips-2009-Efficient Bregman Range Search

14 0.51887667 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

15 0.50931978 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

16 0.49362424 81 nips-2009-Ensemble Nystrom Method

17 0.49357104 94 nips-2009-Fast Learning from Non-i.i.d. Observations

18 0.47739241 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

19 0.4424881 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

20 0.4377048 33 nips-2009-Analysis of SVM with Indefinite Kernels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.047), (25, 0.086), (35, 0.033), (36, 0.114), (39, 0.032), (58, 0.122), (61, 0.366), (71, 0.046), (86, 0.063), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91298175 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-speciﬁc knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1

2 0.89580977 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difﬁcult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.88029951 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any ﬁnite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

same-paper 4 0.84970748 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

5 0.81811953 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

6 0.70643485 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

7 0.65972662 134 nips-2009-Learning to Explore and Exploit in POMDPs

8 0.6370331 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

9 0.62260121 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

10 0.61763018 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

11 0.60940081 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

12 0.60805047 113 nips-2009-Improving Existing Fault Recovery Policies

13 0.602503 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

14 0.60176802 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

15 0.59817451 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

16 0.59069234 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

17 0.58625585 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

18 0.58522415 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

19 0.56177747 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

20 0.55995977 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains