jmlr jmlr2012 jmlr2012-73 knowledge-graph by maker-knowledge-mining

73 jmlr-2012-Multi-task Regression using Minimal Penalties

Source: pdf

Author: Matthieu Solnon, Sylvain Arlot, Francis Bach

Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. [sent-8, score-0.085]

2 We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. [sent-10, score-0.087]

3 Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. [sent-11, score-0.276]

4 In this paper we study the kernel ridge regression procedure in a multi-task framework. [sent-18, score-0.062]

5 S OLNON , A RLOT AND BACH One-dimensional kernel ridge regression, which we refer to as “single-task” regression, has been widely studied. [sent-20, score-0.062]

6 As we brieﬂy review in Section 3 one has, given n data points (Xi ,Yi )n , to estimate i=1 a function f , often the conditional expectation f (Xi ) = E[Yi |Xi ], by minimizing the quadratic risk of the estimator regularized by a certain norm. [sent-21, score-0.094]

7 In this paper, we focus on the concept of minimal penalty, which was ﬁrst introduced by Birg´ and Massart (2007) e and Arlot and Massart (2009) for model selection, then extended to linear estimators such as kernel ridge regression by Arlot and Bach (2011). [sent-31, score-0.115]

8 The main remaining theoretical problem is the calibration of a matricial parameter M (typically of size p), which characterizes the relationship between the tasks and extends the regularization parameter from single-task regression. [sent-44, score-0.087]

9 We give here a new algorithm to estimate Σ, and show that the estimation is sharp enough to derive an oracle inequality for the estimation of the task similarity matrix M, both with high probability and in expectation. [sent-51, score-0.177]

10 Multi-task Regression: Problem Set-up We consider p kernel ridge regression tasks. [sent-77, score-0.062]

11 j j (1) This means that, while the observations are independent, the outputs of the different tasks can be correlated, with correlation matrix Σ between the tasks. [sent-111, score-0.086]

12 , f p ) ∈ F p by fM ∈ argmin g∈F p p p 1 n p j (yi − g j (Xi ))2 + ∑ ∑ M j,l g j , gℓ F ∑∑ np i=1 j=1 j=1 ℓ=1 . [sent-137, score-0.428]

13 −µ λ + (p − 1)µ Taking M = Msimilar (λ, µ) in Equation (2) leads to the criterion p µ 1 n p j 2 ∑ ∑ (yi − g j (Xi ))2 + λ ∑ g j F + 2 np i=1 j=1 j=1 p p ∑∑ j=1 k=1 2 g j − gk F . [sent-151, score-0.419]

14 2 Optimal Choice of the Kernel ++ Now when working in multi-task regression, a set M ⊂ S p (R) of matrices M is given, and the goal is to select the “best” one, that is, minimizing over M the quadratic risk n−1 fM − f 2 . [sent-211, score-0.071]

15 The oracle risk is deﬁned as inf M∈M fM − f 2 2 . [sent-214, score-0.163]

16 (6) The ideal choice, called the oracle, is any matrix M ⋆ ∈ argmin M∈M fM − f 2 2 . [sent-215, score-0.104]

17 However the oracle M ⋆ is not an estimator, since it depends on f . [sent-219, score-0.115]

18 Example 12 (Partial computation of the oracle in a simple setting) It is possible in certain simple settings to exactly compute the oracle (or, at least, some part of it). [sent-220, score-0.23]

19 p Using the estimator fM = AM y we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36): E fM − f 2 2 = (AM − Inp ) f 2 ⊤ 2 + tr(AM AM · (Σ ⊗ In )) . [sent-223, score-0.094]

20 Thus the oracle is obtained when µ = +∞, leading to a situation where the oracle functions f 1,⋆ , . [sent-225, score-0.23]

21 2778 M ULTI - TASK R EGRESSION USING M INIMAL P ENALTIES As explained by Arlot and Bach (2011), we choose M ∈ argmin{crit(M)} with crit(M) = M∈M 1 y − fM np 2 2 + pen(M) , where the penalty term pen(M) has to be chosen appropriately. [sent-230, score-0.454]

22 Our goal is to build an estimator whose risk is the closest possible to the oracle risk. [sent-236, score-0.174]

23 The unbiased risk estimation principle (introduced by Akaike, 1970) requires E [crit(M)] ≈ E 1 fM − f np 2 , 2 which leads to the (deterministic) ideal penalty penid (M) := E 1 fM − f np 2 2 −E 1 y − fM np 2 2 . [sent-242, score-1.353]

24 Since ε is centered and M is deterministic, we get, up to an additive factor independent of M, penid (M) = 2E [ ε, AM ε ] , np that is, as the covariance matrix of ε is Σ ⊗ In , penid (M) = 2 tr AM · (Σ ⊗ In ) np . [sent-244, score-1.252]

25 (7) In order to approach this penalty as precisely as possible, we have to sharply estimate Σ. [sent-245, score-0.061]

26 Writing M = λ with λ ∈ [0, +∞], the regularization matrix is ∀λ ∈ (0, +∞) , Aλ = Aλ,K = K(K + nλIn )−1 , A0 = In and A+∞ = 0; the ideal penalty becomes penid (λ) = 2σ2 tr(Aλ ) . [sent-252, score-0.202]

27 The ideal penalty however depends on σ2 ; in order to have a fully data-driven penalty we have to replace σ2 by an estimator σ2 inside penid (λ). [sent-254, score-0.294]

28 For every λ ∈ [0, +∞], deﬁne penmin (λ) = penmin (λ, K) := (2 tr(Aλ,K ) − tr(A⊤ Aλ,K )) λ,K . [sent-255, score-0.148]

29 n We shall see now that it is a minimal penalty in the following sense. [sent-256, score-0.061]

30 If for every C > 0 λ0 (C) ∈ argmin λ∈[0,+∞] 1 A Y −Y n λ,K 2 +C penmin (λ, K) 2 , then—up to concentration inequalities—λ0 (C) acts as a mimimizer of gC (λ) = E 1 A Y −Y n λ 2 2 +C penmin (λ) − σ2 = 1 (Aλ − In ) f n 2 2 2 + (C − σ ) penmin (λ) . [sent-257, score-0.276]

31 ε ∼ N (0, σ2 In ) with σ2 > 0, and that λ0 ∈ (0, +∞) and dn ≥ 1 exist such that df(λ0 ) ≤ √ n and 1 (Aλ0 − In )F n 2 2 ≤ dn σ2 ln n . [sent-267, score-0.306]

32 n Suppose (8) Then for every δ ≥ 2, some constant n0 (δ) and an event Ω exist such that P(Ω) ≥ 1 − n−δ and if n ≥ n0 (δ), on Ω, 1 − β(2 + δ) ln n n σ2 ≤ C ≤ 1 + β(2 + δ)dn ln(n) n σ2 . [sent-268, score-0.266]

33 , p} , ∃λ0, j ∈ (0, +∞) ,  (13) 2 √ ln n  1  (Aλ0, j − In )Fe j ≤ Σ j, j df(λ0, j ) ≤ n and 2 n n We can now state the ﬁrst main result of the paper. [sent-337, score-0.21]

34 The multiplicative nature of the error is crucial for deriving the oracle inequality stated in Section 5, since it allows to show the ideal penalty deﬁned in Equation (7) is precisely estimated when Σ is replaced by Σ. [sent-342, score-0.217]

35 The upper bound ln(n)/n can be multiplied by any factor 1 ≤ dn ≪ n/ ln(n) (as in Theorem 15), at the price of multiplying η by dn in the upper bound of Equation (14). [sent-348, score-0.096]

36 We show two oracle inequalities (Theorems 26 and 29) that correspond to two possible deﬁnitions of Σ. [sent-360, score-0.115]

37 , 2011) when the risk of the proposed estimator is controlled by the risk of an estimator using information coming from the true parameter (that is, available only if provided by an oracle). [sent-362, score-0.118]

38 1 A General Result for Discrete Matrix Sets M We ﬁrst show that the estimator introduced in Equation (11) is precise enough to derive an oracle inequality when plugged in the penalty deﬁned in Equation (7) in the case where M is ﬁnite. [sent-364, score-0.235]

39 We deﬁne M ∈ argmin M∈M fM − y 2 2 + 2 tr AM · (Σ ⊗ In ) . [sent-366, score-0.301]

40 First, on Ω, 1 f −f np M 2 2 ≤ 1+ 1 ln(n) 2 inf M∈M 2 1 fM − f np + L2 c(Σ)4 tr(Σ)(α + δ)2 2 p3 ln(n)3 . [sent-370, score-0.834]

41 (16) np Second, an absolute constant L3 exists such that E 1 f −f np M 2 ≤ 1+ 2 1 ln(n) +L2 c(Σ) tr(Σ)(α + δ) 4 2 2 E inf M∈M p3 ln(n)3 np 1 fM − f np + L3 2 2 p(p +C) nδ/2 f 2 2 |||Σ||| + np (17) . [sent-371, score-2.013]

42 It turns out a faster algorithm can be used instead of Equation (11) with a reduced error and a larger probability event in the oracle inequality. [sent-376, score-0.147]

43 Computations detailed in Appendix D show that the ideal penalty introduced in Equation (7) can be written as ∀M = P⊤ Diag(d1 , . [sent-383, score-0.102]

44 , d p )P ∈ M , penid (M) = 2 tr AM · (Σ ⊗ In ) 2 = np np p ∑ tr(A pd )Σ j, j j . [sent-386, score-1.163]

45 , p}, a(u j ) denotes the output of Algorithm 14 applied to Problem (Puj ), and MHM ∈ argmin M∈M fM − y 2 2 + 2 tr AM · (ΣHM ⊗ In ) 2784 . [sent-404, score-0.301]

46 First, on Ω, 1 −f f np MHM 2 2 ≤ 1+ 1 ln(n) 2 inf M∈M 1 fM − f np 2 2 + L2 tr(Σ)(2 + δ)2 ln(n)3 . [sent-407, score-0.834]

47 n (21) Second, an absolute constant L4 exists such that E 1 f −f np MHM 2 2 ≤ 1+ 1 ln(n) 2 E 1 fM − f np inf M∈M +L4 tr(Σ)(2 + δ) 2 ln(n) 3 n + 2 2 f 2 2 . [sent-408, score-0.834]

48 np p nδ/2 (22) Theorem 29 is proved in Section F. [sent-409, score-0.393]

49 Remark 31 (Scaling of (n, p)) When assumption (15) holds, Equation (16) implies the asymptotic optimality of the estimator fM when c(Σ)4 tr Σ p3 (ln(n))3 × ≪ inf p n M∈M 1 fM − f np 2 2 . [sent-412, score-0.766]

50 When assumption (18) holds, the scalings required to ensure optimality in Equation (21) are more favorable: tr Σ × (ln(n))3 ≪ inf n M∈M 1 fM − f np 2 2 . [sent-414, score-0.707]

51 It is to be noted that p still inﬂuences the left hand side via tr Σ. [sent-415, score-0.266]

52 Remark 32 Theorems 26 and 29 are non asymptotic oracle inequalities, with a multiplicative term of the form 1 + o(1). [sent-416, score-0.115]

53 This allows us to claim that our selection procedure is nearly optimal, since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. [sent-417, score-0.209]

54 2785 S OLNON , A RLOT AND BACH Compared to the setting of Theorem 26, assumption (18) allows a simpler estimator for the penalty (19), with an increased probability and a reduced error in the oracle inequality. [sent-423, score-0.235]

55 The structure of kernel ridge regression allows us to have a uniform control over a continuous set for the single-task estimators at the “cost” of n pointwise controls, which can then be extended to the multi-task setting via (18). [sent-425, score-0.115]

56 , MK all satisfy (18) (with different matrices Pk ), then Theorem 29 still holds for M = K Mk with the penalty deﬁned by Equation (20) with P = Pk when M ∈ Mk , and k=1 P(Ω) ≥ 1 − 9K p2 n−δ , by applying the union bound in the proof. [sent-430, score-0.097]

57 (2008) has shown that if we also minimize Equation (2) with respect to the matrix M subject to the constraint tr M −1 = 1, then we obtain an equivalent regularization by the nuclear norm (a. [sent-433, score-0.294]

58 In the multi-task case, the trace-norm regularization, though efﬁcient computationally, does not lead to an oracle inequality, while our criterion is an unbiased estimate of the generalization error, which turns out to be non-convex in the matrix M. [sent-441, score-0.143]

59 In ExF periment D we consider the following three estimators, that depend on the choice of the collection M: ∀β ∈ {clus, interval, ind} , where Mβ ∈ argmin M∈Mβ 1 y − fM np fβ := fMβ = AMβ y 2 2 + 2 tr AM · (Σ ⊗ In ) np and Σ is deﬁned by Equation (11). [sent-524, score-1.087]

60 As explained in the following remark the parameters of the former estimator are chosen by optimizing (20), in practice by choosing a grid. [sent-526, score-0.135]

61 Remark 37 (Finding the jump in Algorithm 14) Algorithm 14 raises the question of how to detect the jump of df(λ), which happens around C = σ2 . [sent-531, score-0.096]

62 This approach has a major drawback, because it sometimes selects a jump far away from the “real” jump around σ2 , when the real jump consists of several small jumps. [sent-534, score-0.144]

63 Increasing the number of tasks rapidly reduces the quadratic error with multi-task estimators (Figure 2) contrary to what happens with single-task estimators (Figure 3). [sent-546, score-0.199]

64 2788 M ULTI - TASK R EGRESSION USING M INIMAL P ENALTIES 1 With the estimated Σ Ratio of the quadratic errors 0. [sent-547, score-0.06]

65 1 0 0 2 4 6 8 10 p 12 14 16 18 20 Figure 2: Increasing the number of tasks p (Experiment A), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-566, score-0.171]

66 1 0 0 2 4 6 8 10 p 12 14 16 18 20 Figure 3: Increasing the number of tasks p (Experiment A), quadratic errors of single-task estimators (np)−1 E[ find,S − f 2 ]. [sent-578, score-0.171]

67 02 0 0 50 n 100 150 Figure 4: Increasing the sample size n (Experiment B), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-589, score-0.113]

68 1 0 0 50 n 100 150 Figure 5: Increasing the sample size n (Experiment B), quadratic errors of single-task estimators (np)−1 E[ find,S − f 2 ]. [sent-600, score-0.113]

69 02 0 2 4 6 8 Intensity of the noise matrix Σ 10 12 Figure 7: Increasing the signal-to-noise ratio (Experiment C), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-618, score-0.141]

70 00 q 2/ fclus − f find − f finterval − f 2 / find − f finterval − f 2 / fclus − f 2 2 2 Std[q] 0. [sent-653, score-0.124]

71 A noticeable phenomenon also occurs in Figure 2 and even more in Figure 3: the estimator find,Σ (that is, obtained knowing the true covariance matrix Σ) is less efﬁcient than find,Σ where the covariance matrix is estimated. [sent-667, score-0.171]

72 It corresponds to the combination of two facts: (i) multiplying the ideal penalty by a small factor 1 < Cn < 1+o(1) is known to often improve performances in practice when the sample size is small (see Section 6. [sent-668, score-0.102]

73 2 of Arlot, 2009), and (ii) minimal penalty algorithms like Algorithm 14 are conjectured to overpenalize slightly when n is small or the noise-level is large (Lerasle, 2011) (as conﬁrmed by Figure 7). [sent-670, score-0.061]

74 The last line of Table 2 does not show that the clustering setting improves over the “segmentation into intervals” one, which was awaited if a model close to the oracle is selected in both cases. [sent-674, score-0.115]

75 The crucial point is to estimate the p × p covariance matrix Σ of the noise (covariance between tasks), in order to learn the task similarity matrix M. [sent-678, score-0.118]

76 Second, we show an oracle inequality (Theorem 26), more particularly with a simpliﬁed estimation of Σ and increased performances when the matrices of M are jointly diagonalizable (which often corresponds to cases where we have a prior knowledge of what the relations between the tasks would be). [sent-681, score-0.24]

77 We use the set Msimilar : µ p Msimilar := Msimilar (λ, µ) = (λ + pµ)Ip − 11⊤ / (λ, µ) ∈ (0, +∞)2 Using the estimator fM = AM y we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36): E fM − f 2 2 = (AM − Inp ) f 2 ⊤ 2 + tr(AM AM · (Σ ⊗ In )) . [sent-725, score-0.094]

78 Thus, with µ = λ + pµ we can diagonalize in an orthonormal basis any matrix Mλ,µ ∈ M as M = P⊤ Dλ,µ P, with D = Dλ,µ = Diag{λ, µ, . [sent-734, score-0.079]

79 Bias: We can ﬁrst remark that (P⊤ ⊗ Q⊤ ) = (P ⊗ Q)⊤ is an orthogonal matrix and that P × 1 = (1, 0, . [sent-751, score-0.104]

80 Thus we can ﬁnally write tr(A⊤ AM · (Σ ⊗ In )) = tr P ⊗ Q)⊤ (D−1 ⊗ ∆) (D−1 ⊗ ∆) + npInp M −1 2 × (P ⊗ Q)(Σ ⊗ In ) = tr (D−1 ⊗ ∆) (D−1 ⊗ ∆) + npInp −1 2 × (P ⊗ Q)(Σ ⊗ In )(P ⊗ Q)⊤ n =∑ i=1 µi µi + npλ 2 Σ1,1 + µi µi + npµ 2 p ∑ Σ j, j . [sent-782, score-0.532]

81 j=2 As noted at the end of Example 12 this leads to an oracle which has all its p functions equal. [sent-783, score-0.115]

82 The computations detailed above also show that the ideal penalty introduced in Equation (7) can be written as penid (M) = 2 tr AM · (Σ ⊗ In ) 2 = np np p ∑ tr(A pd )Σ j, j j . [sent-791, score-1.265]

83 For the last condition, remark that for every j ∈ {1, . [sent-855, score-0.1]

84 1 Key Quantities and their Concentration Around their Means ++ Deﬁnition 50 We introduce, for S ∈ S p (R), Mo (S) ∈ argmin FM −Y M∈M 2 + 2 tr (AM · (S ⊗ In )) (29) Deﬁnition 51 Let S ∈ S p (R), we note S+ the symmetric matrix where the eigenvalues of S have been thresholded at 0. [sent-930, score-0.353]

85 2 , S OLNON , A RLOT AND BACH We can see that the quantities δi decouple, therefore p |δ1 (M)| = ∑ εi , A pdi εi − E [ εi , A pdi ε ] , i=1 p |δ2 (M)| = ∑ A pdi εi i=1 p 2 2 −E A pdi εi 2 2 , |δ3 (M)| = ∑ 2 A pdi εi , (A pdi − In ) fi , i=1 p |δ4 (M)| = ∑ 2 εi , (In − A pdi ) fi . [sent-1000, score-0.798]

86 Assumption (18) implies that the matrix P used above is the same for all the matrices M of M . [sent-1002, score-0.064]

87 2 Intermediate Result We ﬁrst prove a general oracle inequality, under the assumption that the penalty we use (with an estimator of Σ) does not underestimate the ideal penalty (involving Σ) too much. [sent-1012, score-0.337]

88 (37) Combining Equation (29) and (37), we get: fMo (S) − f ≤ inf 2 2 + 2 tr AMo (S) · ((S − Σ)+ ⊗ In ) + ∆(Mo (S)) fM − f M∈M 2 2 + 2 tr (AM · ((S − Σ) ⊗ In )) + ∆(M) (38) . [sent-1016, score-0.58]

89 θ (40) S OLNON , A RLOT AND BACH With Equation (38), and with C1 = CA , C2 = 2CC + 4CD + 4CE and C3 = CB +CF we get (1 − 2θ) fMo (S) − f inf M∈M fM − f 2 2 2 2 + 2 tr AMo (S) · ((S − Σ)+ ⊗ In ) ≤ + 2 tr (AM · ((S − Σ) ⊗ In )) + C1 +C2 θ + C3 θ γ ln(n)|||Σ||| . [sent-1020, score-0.58]

90 Using elementary algebra it is easy to show that, for every symmetric positive deﬁnite matrices A, M and N of size p, M N implies that tr(AM) ≥ tr(AN). [sent-1029, score-0.084]

91 np 2CA + 3CC + 6CD + 6CE + ln(n) 18CB + 18CF + Using Equation (27) and deﬁning η2 := 12β(α + δ)γ ln(n) , n we get 1 f −f np M + 1− 2 3 ln(n) −1 2 ≤ 1+ 1 ln(n) inf M∈M 1 fM − f np 2 2 + η2 tr(AM · (Σ ⊗ In )) np 729β2 γ2 tr(Σ)2 4|||Σ|||2 ln(n)2 |||Σ||| ×(α + δ)2 . [sent-1034, score-1.62]

92 np 2CA + 3CC + 6CD + 6CE + ln(n) 18CB + 18CF + (42) Now, to get a classical oracle inequality, we have to show that η2 v1 (M) = η2 tr(AM · (Σ ⊗ In )) is negligible in front of fM − f 2 . [sent-1035, score-0.508]

93 Finally we deduce an oracle inequality in expectation by noting that if n−1 fM − f Ω, using Cauchy-Schwarz inequality E 1 f −f np M 2 2 =E 1Ω f −f np M ≤ E Rn,δ + 2 2 +E 1Ωc np fM − f 4p(p + 1) + 6pC nδ 1 np 2 ≤ Rn,δ on 2 2 E fM − f 4 2 . [sent-1047, score-1.687]

94 (44) We can remark that, since |||AM ||| ≤ 1, fM − f 2 2 ≤ 2 AM ε So E fM − f 2 2 +2 4 2 (Inp − AM ) f 2 2 ≤2 ε ≤ 12 np|||Σ||| + 4 f 2 2 2 2 +8 2 f 2 2 . [sent-1048, score-0.076]

95 , together with Equation (42) and Equation (44), induces Equation (17), using that for some constant L3 > 0, 12 p(p + 1)/2 + 6pC 4 f |||Σ||| + δ np n 2 2 2808 ≤ L3 p(p +C) 1 |||Σ||| + f δ/2 np n 2 2 . [sent-1049, score-0.786]

96 Then, tr(A⊤ A · (Σ ⊗ In )) + x|||Σ||| tr(A · (Σ ⊗ In )) inf A∈Mnp (R),|||A|||≤1 ≥2 x|||Σ||| n tr(Σ) Proof First note that the bilinear form on Mnp (R), (A, B) → tr(A⊤ B · (Σ ⊗ In )) is a scalar product. [sent-1051, score-0.072]

97 n tr(Σ) Therefore inf A∈Mnp (R),|||A|||≤1 tr(A⊤ A · (Σ ⊗ In )) + x|||Σ||| tr(A · (Σ ⊗ In )) ≥ inf c>0 ≥2 x|||Σ||| c + n tr(Σ) c x|||Σ||| . [sent-1054, score-0.096]

98 For every δ ≥ 2, a constant n0 (δ), an absolute constant L1 > 0 and an event Ω exist such that P(ΩHM ) ≥ 1 − pn−δ and for every n ≥ n0 (δ), on ΩHM , for every M in M (1 − η) tr(AM · (Σ ⊗ In )) ≤ tr(AM · (ΣHM ⊗ In )) ≤ (1 + η) tr(AM · (Σ ⊗ In )) , where η := L1 (α + δ) (45) ln(n) . [sent-1058, score-0.104]

99 To get to the oracle inequality in expectation we use the same technique than above, but we note that P(Ωc ) ≤ L4 × p/nδ/2 . [sent-1078, score-0.115]

100 Data-driven calibration of linear estimators with minimal penalties, July 2011. [sent-1100, score-0.082]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('np', 0.393), ('fm', 0.35), ('arlot', 0.3), ('tr', 0.266), ('ln', 0.21), ('olnon', 0.207), ('rlot', 0.207), ('enalties', 0.197), ('inimal', 0.197), ('bach', 0.16), ('inp', 0.155), ('fsimilar', 0.135), ('msimilar', 0.124), ('hm', 0.123), ('oracle', 0.115), ('pdi', 0.114), ('ulti', 0.111), ('npinp', 0.104), ('egression', 0.094), ('mo', 0.093), ('fe', 0.08), ('remark', 0.076), ('diag', 0.076), ('df', 0.073), ('amo', 0.072), ('penid', 0.072), ('penmin', 0.062), ('find', 0.062), ('ks', 0.062), ('penalty', 0.061), ('estimator', 0.059), ('tasks', 0.058), ('equation', 0.056), ('cf', 0.055), ('estimators', 0.053), ('sylvain', 0.052), ('cc', 0.052), ('kronecker', 0.049), ('cb', 0.049), ('inf', 0.048), ('dn', 0.048), ('jump', 0.048), ('cn', 0.048), ('fei', 0.047), ('sierra', 0.041), ('ideal', 0.041), ('pd', 0.039), ('cd', 0.038), ('ridge', 0.038), ('matrices', 0.036), ('ens', 0.035), ('mnp', 0.035), ('quadratic', 0.035), ('evgeniou', 0.035), ('var', 0.035), ('argmin', 0.035), ('task', 0.034), ('theorem', 0.033), ('event', 0.032), ('experiment', 0.032), ('concentration', 0.031), ('ei', 0.031), ('diagonalizable', 0.031), ('italie', 0.031), ('matthieu', 0.031), ('mhm', 0.031), ('rnp', 0.031), ('solnon', 0.031), ('umr', 0.031), ('ce', 0.031), ('proposition', 0.03), ('vec', 0.029), ('calibration', 0.029), ('gn', 0.029), ('matrix', 0.028), ('covariance', 0.028), ('massimiliano', 0.027), ('normale', 0.027), ('diagonalize', 0.027), ('crit', 0.027), ('lounici', 0.026), ('gk', 0.026), ('pm', 0.026), ('proof', 0.026), ('errors', 0.025), ('penalties', 0.025), ('tl', 0.025), ('fa', 0.025), ('basis', 0.024), ('symmetric', 0.024), ('every', 0.024), ('francis', 0.024), ('cedex', 0.024), ('rieure', 0.024), ('bilinear', 0.024), ('kernel', 0.024), ('multitask', 0.023), ('issn', 0.023), ('sin', 0.023), ('sup', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 73 jmlr-2012-Multi-task Regression using Minimal Penalties

Author: Matthieu Solnon, Sylvain Arlot, Francis Bach

2 0.18621112 91 jmlr-2012-Plug-in Approach to Active Learning

Author: Stanislav Minsker

Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classiﬁcation, conﬁdence bands

3 0.11340851 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a ﬁnite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only speciﬁed up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artiﬁcially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

4 0.10917637 82 jmlr-2012-On the Necessity of Irrelevant Variables

Author: David P. Helmbold, Philip M. Long

Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classiﬁers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high conﬁdence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory

5 0.093763955 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

6 0.092638604 80 jmlr-2012-On Ranking and Generalization Bounds

7 0.092473947 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

8 0.084793501 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

9 0.076596051 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

10 0.061701484 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

11 0.061691433 111 jmlr-2012-Structured Sparsity and Generalization

12 0.061190743 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

13 0.059388459 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

14 0.055028256 59 jmlr-2012-Linear Regression With Random Projections

15 0.054649267 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

16 0.053193934 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

17 0.049993534 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

18 0.049712248 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.049167056 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

20 0.048330929 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.253), (1, 0.118), (2, -0.144), (3, 0.065), (4, -0.088), (5, -0.11), (6, 0.014), (7, 0.068), (8, -0.062), (9, -0.022), (10, -0.059), (11, -0.015), (12, -0.004), (13, 0.064), (14, -0.162), (15, -0.132), (16, -0.094), (17, 0.051), (18, 0.165), (19, -0.042), (20, -0.21), (21, 0.393), (22, -0.085), (23, 0.022), (24, -0.001), (25, 0.0), (26, 0.127), (27, 0.093), (28, 0.04), (29, 0.064), (30, 0.038), (31, -0.009), (32, -0.054), (33, 0.082), (34, -0.019), (35, -0.017), (36, 0.004), (37, 0.015), (38, -0.063), (39, 0.035), (40, 0.012), (41, -0.037), (42, 0.031), (43, -0.091), (44, -0.026), (45, -0.098), (46, -0.009), (47, 0.016), (48, 0.131), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92966932 73 jmlr-2012-Multi-task Regression using Minimal Penalties

Author: Matthieu Solnon, Sylvain Arlot, Francis Bach

2 0.62310296 91 jmlr-2012-Plug-in Approach to Active Learning

Author: Stanislav Minsker

3 0.61944073 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller

Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of ﬁnding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identiﬁability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning

4 0.51082563 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

5 0.37309304 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

6 0.34751365 82 jmlr-2012-On the Necessity of Irrelevant Variables

7 0.34283373 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

8 0.33539972 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

9 0.33010229 97 jmlr-2012-Regularization Techniques for Learning with Matrices

10 0.31950417 111 jmlr-2012-Structured Sparsity and Generalization

11 0.29817775 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

12 0.29117513 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

13 0.27855209 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

14 0.2703822 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

15 0.26255393 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

16 0.24946347 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

17 0.24051978 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

18 0.23856534 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.23271079 59 jmlr-2012-Linear Regression With Random Projections

20 0.22880472 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.023), (21, 0.07), (25, 0.294), (26, 0.052), (27, 0.01), (29, 0.031), (35, 0.011), (49, 0.031), (64, 0.013), (75, 0.056), (77, 0.012), (79, 0.012), (81, 0.01), (92, 0.128), (96, 0.158)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75993192 73 jmlr-2012-Multi-task Regression using Minimal Penalties

Author: Matthieu Solnon, Sylvain Arlot, Francis Bach

2 0.59532166 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations sufﬁce to produce a predictor with empirical risk ε-close to the inﬁmum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

3 0.59508801 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is deﬁned by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identiﬁcation property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identiﬁcation

4 0.59433162 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

5 0.58753312 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

Abstract: In this paper we propose efﬁcient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefﬁcient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which deﬁne the set K , be satisﬁed in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efﬁcient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisﬁed in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a ﬁnite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our ﬁrst algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality

6 0.58713388 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

7 0.5831393 80 jmlr-2012-On Ranking and Generalization Bounds

8 0.58183789 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

9 0.57890409 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

10 0.57608116 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

11 0.57555544 91 jmlr-2012-Plug-in Approach to Active Learning

12 0.57505214 103 jmlr-2012-Sampling Methods for the Nyström Method

13 0.57447475 13 jmlr-2012-Active Learning via Perfect Selective Classification

14 0.57029462 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

15 0.56885272 111 jmlr-2012-Structured Sparsity and Generalization

16 0.56787121 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

17 0.56754816 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

18 0.56561232 84 jmlr-2012-Online Submodular Minimization

19 0.56289965 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

20 0.56287849 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing