jmlr jmlr2012 jmlr2012-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthieu Solnon, Sylvain Arlot, Francis Bach
Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory
Reference: text
sentIndex sentText sentNum sentScore
1 The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. [sent-8, score-0.085]
2 We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. [sent-10, score-0.087]
3 Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. [sent-11, score-0.276]
4 In this paper we study the kernel ridge regression procedure in a multi-task framework. [sent-18, score-0.062]
5 S OLNON , A RLOT AND BACH One-dimensional kernel ridge regression, which we refer to as “single-task” regression, has been widely studied. [sent-20, score-0.062]
6 As we briefly review in Section 3 one has, given n data points (Xi ,Yi )n , to estimate i=1 a function f , often the conditional expectation f (Xi ) = E[Yi |Xi ], by minimizing the quadratic risk of the estimator regularized by a certain norm. [sent-21, score-0.094]
7 In this paper, we focus on the concept of minimal penalty, which was first introduced by Birg´ and Massart (2007) e and Arlot and Massart (2009) for model selection, then extended to linear estimators such as kernel ridge regression by Arlot and Bach (2011). [sent-31, score-0.115]
8 The main remaining theoretical problem is the calibration of a matricial parameter M (typically of size p), which characterizes the relationship between the tasks and extends the regularization parameter from single-task regression. [sent-44, score-0.087]
9 We give here a new algorithm to estimate Σ, and show that the estimation is sharp enough to derive an oracle inequality for the estimation of the task similarity matrix M, both with high probability and in expectation. [sent-51, score-0.177]
10 Multi-task Regression: Problem Set-up We consider p kernel ridge regression tasks. [sent-77, score-0.062]
11 j j (1) This means that, while the observations are independent, the outputs of the different tasks can be correlated, with correlation matrix Σ between the tasks. [sent-111, score-0.086]
12 , f p ) ∈ F p by fM ∈ argmin g∈F p p p 1 n p j (yi − g j (Xi ))2 + ∑ ∑ M j,l g j , gℓ F ∑∑ np i=1 j=1 j=1 ℓ=1 . [sent-137, score-0.428]
13 −µ λ + (p − 1)µ Taking M = Msimilar (λ, µ) in Equation (2) leads to the criterion p µ 1 n p j 2 ∑ ∑ (yi − g j (Xi ))2 + λ ∑ g j F + 2 np i=1 j=1 j=1 p p ∑∑ j=1 k=1 2 g j − gk F . [sent-151, score-0.419]
14 2 Optimal Choice of the Kernel ++ Now when working in multi-task regression, a set M ⊂ S p (R) of matrices M is given, and the goal is to select the “best” one, that is, minimizing over M the quadratic risk n−1 fM − f 2 . [sent-211, score-0.071]
15 The oracle risk is defined as inf M∈M fM − f 2 2 . [sent-214, score-0.163]
16 (6) The ideal choice, called the oracle, is any matrix M ⋆ ∈ argmin M∈M fM − f 2 2 . [sent-215, score-0.104]
17 However the oracle M ⋆ is not an estimator, since it depends on f . [sent-219, score-0.115]
18 Example 12 (Partial computation of the oracle in a simple setting) It is possible in certain simple settings to exactly compute the oracle (or, at least, some part of it). [sent-220, score-0.23]
19 p Using the estimator fM = AM y we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36): E fM − f 2 2 = (AM − Inp ) f 2 ⊤ 2 + tr(AM AM · (Σ ⊗ In )) . [sent-223, score-0.094]
20 Thus the oracle is obtained when µ = +∞, leading to a situation where the oracle functions f 1,⋆ , . [sent-225, score-0.23]
21 2778 M ULTI - TASK R EGRESSION USING M INIMAL P ENALTIES As explained by Arlot and Bach (2011), we choose M ∈ argmin{crit(M)} with crit(M) = M∈M 1 y − fM np 2 2 + pen(M) , where the penalty term pen(M) has to be chosen appropriately. [sent-230, score-0.454]
22 Our goal is to build an estimator whose risk is the closest possible to the oracle risk. [sent-236, score-0.174]
23 The unbiased risk estimation principle (introduced by Akaike, 1970) requires E [crit(M)] ≈ E 1 fM − f np 2 , 2 which leads to the (deterministic) ideal penalty penid (M) := E 1 fM − f np 2 2 −E 1 y − fM np 2 2 . [sent-242, score-1.353]
24 Since ε is centered and M is deterministic, we get, up to an additive factor independent of M, penid (M) = 2E [ ε, AM ε ] , np that is, as the covariance matrix of ε is Σ ⊗ In , penid (M) = 2 tr AM · (Σ ⊗ In ) np . [sent-244, score-1.252]
25 (7) In order to approach this penalty as precisely as possible, we have to sharply estimate Σ. [sent-245, score-0.061]
26 Writing M = λ with λ ∈ [0, +∞], the regularization matrix is ∀λ ∈ (0, +∞) , Aλ = Aλ,K = K(K + nλIn )−1 , A0 = In and A+∞ = 0; the ideal penalty becomes penid (λ) = 2σ2 tr(Aλ ) . [sent-252, score-0.202]
27 The ideal penalty however depends on σ2 ; in order to have a fully data-driven penalty we have to replace σ2 by an estimator σ2 inside penid (λ). [sent-254, score-0.294]
28 For every λ ∈ [0, +∞], define penmin (λ) = penmin (λ, K) := (2 tr(Aλ,K ) − tr(A⊤ Aλ,K )) λ,K . [sent-255, score-0.148]
29 n We shall see now that it is a minimal penalty in the following sense. [sent-256, score-0.061]
30 If for every C > 0 λ0 (C) ∈ argmin λ∈[0,+∞] 1 A Y −Y n λ,K 2 +C penmin (λ, K) 2 , then—up to concentration inequalities—λ0 (C) acts as a mimimizer of gC (λ) = E 1 A Y −Y n λ 2 2 +C penmin (λ) − σ2 = 1 (Aλ − In ) f n 2 2 2 + (C − σ ) penmin (λ) . [sent-257, score-0.276]
31 ε ∼ N (0, σ2 In ) with σ2 > 0, and that λ0 ∈ (0, +∞) and dn ≥ 1 exist such that df(λ0 ) ≤ √ n and 1 (Aλ0 − In )F n 2 2 ≤ dn σ2 ln n . [sent-267, score-0.306]
32 n Suppose (8) Then for every δ ≥ 2, some constant n0 (δ) and an event Ω exist such that P(Ω) ≥ 1 − n−δ and if n ≥ n0 (δ), on Ω, 1 − β(2 + δ) ln n n σ2 ≤ C ≤ 1 + β(2 + δ)dn ln(n) n σ2 . [sent-268, score-0.266]
33 , p} , ∃λ0, j ∈ (0, +∞) , (13) 2 √ ln n 1 (Aλ0, j − In )Fe j ≤ Σ j, j df(λ0, j ) ≤ n and 2 n n We can now state the first main result of the paper. [sent-337, score-0.21]
34 The multiplicative nature of the error is crucial for deriving the oracle inequality stated in Section 5, since it allows to show the ideal penalty defined in Equation (7) is precisely estimated when Σ is replaced by Σ. [sent-342, score-0.217]
35 The upper bound ln(n)/n can be multiplied by any factor 1 ≤ dn ≪ n/ ln(n) (as in Theorem 15), at the price of multiplying η by dn in the upper bound of Equation (14). [sent-348, score-0.096]
36 We show two oracle inequalities (Theorems 26 and 29) that correspond to two possible definitions of Σ. [sent-360, score-0.115]
37 , 2011) when the risk of the proposed estimator is controlled by the risk of an estimator using information coming from the true parameter (that is, available only if provided by an oracle). [sent-362, score-0.118]
38 1 A General Result for Discrete Matrix Sets M We first show that the estimator introduced in Equation (11) is precise enough to derive an oracle inequality when plugged in the penalty defined in Equation (7) in the case where M is finite. [sent-364, score-0.235]
39 We define M ∈ argmin M∈M fM − y 2 2 + 2 tr AM · (Σ ⊗ In ) . [sent-366, score-0.301]
40 First, on Ω, 1 f −f np M 2 2 ≤ 1+ 1 ln(n) 2 inf M∈M 2 1 fM − f np + L2 c(Σ)4 tr(Σ)(α + δ)2 2 p3 ln(n)3 . [sent-370, score-0.834]
41 (16) np Second, an absolute constant L3 exists such that E 1 f −f np M 2 ≤ 1+ 2 1 ln(n) +L2 c(Σ) tr(Σ)(α + δ) 4 2 2 E inf M∈M p3 ln(n)3 np 1 fM − f np + L3 2 2 p(p +C) nδ/2 f 2 2 |||Σ||| + np (17) . [sent-371, score-2.013]
42 It turns out a faster algorithm can be used instead of Equation (11) with a reduced error and a larger probability event in the oracle inequality. [sent-376, score-0.147]
43 Computations detailed in Appendix D show that the ideal penalty introduced in Equation (7) can be written as ∀M = P⊤ Diag(d1 , . [sent-383, score-0.102]
44 , d p )P ∈ M , penid (M) = 2 tr AM · (Σ ⊗ In ) 2 = np np p ∑ tr(A pd )Σ j, j j . [sent-386, score-1.163]
45 , p}, a(u j ) denotes the output of Algorithm 14 applied to Problem (Puj ), and MHM ∈ argmin M∈M fM − y 2 2 + 2 tr AM · (ΣHM ⊗ In ) 2784 . [sent-404, score-0.301]
46 First, on Ω, 1 −f f np MHM 2 2 ≤ 1+ 1 ln(n) 2 inf M∈M 1 fM − f np 2 2 + L2 tr(Σ)(2 + δ)2 ln(n)3 . [sent-407, score-0.834]
47 n (21) Second, an absolute constant L4 exists such that E 1 f −f np MHM 2 2 ≤ 1+ 1 ln(n) 2 E 1 fM − f np inf M∈M +L4 tr(Σ)(2 + δ) 2 ln(n) 3 n + 2 2 f 2 2 . [sent-408, score-0.834]
48 np p nδ/2 (22) Theorem 29 is proved in Section F. [sent-409, score-0.393]
49 Remark 31 (Scaling of (n, p)) When assumption (15) holds, Equation (16) implies the asymptotic optimality of the estimator fM when c(Σ)4 tr Σ p3 (ln(n))3 × ≪ inf p n M∈M 1 fM − f np 2 2 . [sent-412, score-0.766]
50 When assumption (18) holds, the scalings required to ensure optimality in Equation (21) are more favorable: tr Σ × (ln(n))3 ≪ inf n M∈M 1 fM − f np 2 2 . [sent-414, score-0.707]
51 It is to be noted that p still influences the left hand side via tr Σ. [sent-415, score-0.266]
52 Remark 32 Theorems 26 and 29 are non asymptotic oracle inequalities, with a multiplicative term of the form 1 + o(1). [sent-416, score-0.115]
53 This allows us to claim that our selection procedure is nearly optimal, since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. [sent-417, score-0.209]
54 2785 S OLNON , A RLOT AND BACH Compared to the setting of Theorem 26, assumption (18) allows a simpler estimator for the penalty (19), with an increased probability and a reduced error in the oracle inequality. [sent-423, score-0.235]
55 The structure of kernel ridge regression allows us to have a uniform control over a continuous set for the single-task estimators at the “cost” of n pointwise controls, which can then be extended to the multi-task setting via (18). [sent-425, score-0.115]
56 , MK all satisfy (18) (with different matrices Pk ), then Theorem 29 still holds for M = K Mk with the penalty defined by Equation (20) with P = Pk when M ∈ Mk , and k=1 P(Ω) ≥ 1 − 9K p2 n−δ , by applying the union bound in the proof. [sent-430, score-0.097]
57 (2008) has shown that if we also minimize Equation (2) with respect to the matrix M subject to the constraint tr M −1 = 1, then we obtain an equivalent regularization by the nuclear norm (a. [sent-433, score-0.294]
58 In the multi-task case, the trace-norm regularization, though efficient computationally, does not lead to an oracle inequality, while our criterion is an unbiased estimate of the generalization error, which turns out to be non-convex in the matrix M. [sent-441, score-0.143]
59 In ExF periment D we consider the following three estimators, that depend on the choice of the collection M: ∀β ∈ {clus, interval, ind} , where Mβ ∈ argmin M∈Mβ 1 y − fM np fβ := fMβ = AMβ y 2 2 + 2 tr AM · (Σ ⊗ In ) np and Σ is defined by Equation (11). [sent-524, score-1.087]
60 As explained in the following remark the parameters of the former estimator are chosen by optimizing (20), in practice by choosing a grid. [sent-526, score-0.135]
61 Remark 37 (Finding the jump in Algorithm 14) Algorithm 14 raises the question of how to detect the jump of df(λ), which happens around C = σ2 . [sent-531, score-0.096]
62 This approach has a major drawback, because it sometimes selects a jump far away from the “real” jump around σ2 , when the real jump consists of several small jumps. [sent-534, score-0.144]
63 Increasing the number of tasks rapidly reduces the quadratic error with multi-task estimators (Figure 2) contrary to what happens with single-task estimators (Figure 3). [sent-546, score-0.199]
64 2788 M ULTI - TASK R EGRESSION USING M INIMAL P ENALTIES 1 With the estimated Σ Ratio of the quadratic errors 0. [sent-547, score-0.06]
65 1 0 0 2 4 6 8 10 p 12 14 16 18 20 Figure 2: Increasing the number of tasks p (Experiment A), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-566, score-0.171]
66 1 0 0 2 4 6 8 10 p 12 14 16 18 20 Figure 3: Increasing the number of tasks p (Experiment A), quadratic errors of single-task estimators (np)−1 E[ find,S − f 2 ]. [sent-578, score-0.171]
67 02 0 0 50 n 100 150 Figure 4: Increasing the sample size n (Experiment B), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-589, score-0.113]
68 1 0 0 50 n 100 150 Figure 5: Increasing the sample size n (Experiment B), quadratic errors of single-task estimators (np)−1 E[ find,S − f 2 ]. [sent-600, score-0.113]
69 02 0 2 4 6 8 Intensity of the noise matrix Σ 10 12 Figure 7: Increasing the signal-to-noise ratio (Experiment C), quadratic errors of multi-task estimators (np)−1 E[ fsimilar,S − f 2 ]. [sent-618, score-0.141]
70 00 q 2/ fclus − f find − f finterval − f 2 / find − f finterval − f 2 / fclus − f 2 2 2 Std[q] 0. [sent-653, score-0.124]
71 A noticeable phenomenon also occurs in Figure 2 and even more in Figure 3: the estimator find,Σ (that is, obtained knowing the true covariance matrix Σ) is less efficient than find,Σ where the covariance matrix is estimated. [sent-667, score-0.171]
72 It corresponds to the combination of two facts: (i) multiplying the ideal penalty by a small factor 1 < Cn < 1+o(1) is known to often improve performances in practice when the sample size is small (see Section 6. [sent-668, score-0.102]
73 2 of Arlot, 2009), and (ii) minimal penalty algorithms like Algorithm 14 are conjectured to overpenalize slightly when n is small or the noise-level is large (Lerasle, 2011) (as confirmed by Figure 7). [sent-670, score-0.061]
74 The last line of Table 2 does not show that the clustering setting improves over the “segmentation into intervals” one, which was awaited if a model close to the oracle is selected in both cases. [sent-674, score-0.115]
75 The crucial point is to estimate the p × p covariance matrix Σ of the noise (covariance between tasks), in order to learn the task similarity matrix M. [sent-678, score-0.118]
76 Second, we show an oracle inequality (Theorem 26), more particularly with a simplified estimation of Σ and increased performances when the matrices of M are jointly diagonalizable (which often corresponds to cases where we have a prior knowledge of what the relations between the tasks would be). [sent-681, score-0.24]
77 We use the set Msimilar : µ p Msimilar := Msimilar (λ, µ) = (λ + pµ)Ip − 11⊤ / (λ, µ) ∈ (0, +∞)2 Using the estimator fM = AM y we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36): E fM − f 2 2 = (AM − Inp ) f 2 ⊤ 2 + tr(AM AM · (Σ ⊗ In )) . [sent-725, score-0.094]
78 Thus, with µ = λ + pµ we can diagonalize in an orthonormal basis any matrix Mλ,µ ∈ M as M = P⊤ Dλ,µ P, with D = Dλ,µ = Diag{λ, µ, . [sent-734, score-0.079]
79 Bias: We can first remark that (P⊤ ⊗ Q⊤ ) = (P ⊗ Q)⊤ is an orthogonal matrix and that P × 1 = (1, 0, . [sent-751, score-0.104]
80 Thus we can finally write tr(A⊤ AM · (Σ ⊗ In )) = tr P ⊗ Q)⊤ (D−1 ⊗ ∆) (D−1 ⊗ ∆) + npInp M −1 2 × (P ⊗ Q)(Σ ⊗ In ) = tr (D−1 ⊗ ∆) (D−1 ⊗ ∆) + npInp −1 2 × (P ⊗ Q)(Σ ⊗ In )(P ⊗ Q)⊤ n =∑ i=1 µi µi + npλ 2 Σ1,1 + µi µi + npµ 2 p ∑ Σ j, j . [sent-782, score-0.532]
81 j=2 As noted at the end of Example 12 this leads to an oracle which has all its p functions equal. [sent-783, score-0.115]
82 The computations detailed above also show that the ideal penalty introduced in Equation (7) can be written as penid (M) = 2 tr AM · (Σ ⊗ In ) 2 = np np p ∑ tr(A pd )Σ j, j j . [sent-791, score-1.265]
83 For the last condition, remark that for every j ∈ {1, . [sent-855, score-0.1]
84 1 Key Quantities and their Concentration Around their Means ++ Definition 50 We introduce, for S ∈ S p (R), Mo (S) ∈ argmin FM −Y M∈M 2 + 2 tr (AM · (S ⊗ In )) (29) Definition 51 Let S ∈ S p (R), we note S+ the symmetric matrix where the eigenvalues of S have been thresholded at 0. [sent-930, score-0.353]
85 2 , S OLNON , A RLOT AND BACH We can see that the quantities δi decouple, therefore p |δ1 (M)| = ∑ εi , A pdi εi − E [ εi , A pdi ε ] , i=1 p |δ2 (M)| = ∑ A pdi εi i=1 p 2 2 −E A pdi εi 2 2 , |δ3 (M)| = ∑ 2 A pdi εi , (A pdi − In ) fi , i=1 p |δ4 (M)| = ∑ 2 εi , (In − A pdi ) fi . [sent-1000, score-0.798]
86 Assumption (18) implies that the matrix P used above is the same for all the matrices M of M . [sent-1002, score-0.064]
87 2 Intermediate Result We first prove a general oracle inequality, under the assumption that the penalty we use (with an estimator of Σ) does not underestimate the ideal penalty (involving Σ) too much. [sent-1012, score-0.337]
88 (37) Combining Equation (29) and (37), we get: fMo (S) − f ≤ inf 2 2 + 2 tr AMo (S) · ((S − Σ)+ ⊗ In ) + ∆(Mo (S)) fM − f M∈M 2 2 + 2 tr (AM · ((S − Σ) ⊗ In )) + ∆(M) (38) . [sent-1016, score-0.58]
89 θ (40) S OLNON , A RLOT AND BACH With Equation (38), and with C1 = CA , C2 = 2CC + 4CD + 4CE and C3 = CB +CF we get (1 − 2θ) fMo (S) − f inf M∈M fM − f 2 2 2 2 + 2 tr AMo (S) · ((S − Σ)+ ⊗ In ) ≤ + 2 tr (AM · ((S − Σ) ⊗ In )) + C1 +C2 θ + C3 θ γ ln(n)|||Σ||| . [sent-1020, score-0.58]
90 Using elementary algebra it is easy to show that, for every symmetric positive definite matrices A, M and N of size p, M N implies that tr(AM) ≥ tr(AN). [sent-1029, score-0.084]
91 np 2CA + 3CC + 6CD + 6CE + ln(n) 18CB + 18CF + Using Equation (27) and defining η2 := 12β(α + δ)γ ln(n) , n we get 1 f −f np M + 1− 2 3 ln(n) −1 2 ≤ 1+ 1 ln(n) inf M∈M 1 fM − f np 2 2 + η2 tr(AM · (Σ ⊗ In )) np 729β2 γ2 tr(Σ)2 4|||Σ|||2 ln(n)2 |||Σ||| ×(α + δ)2 . [sent-1034, score-1.62]
92 np 2CA + 3CC + 6CD + 6CE + ln(n) 18CB + 18CF + (42) Now, to get a classical oracle inequality, we have to show that η2 v1 (M) = η2 tr(AM · (Σ ⊗ In )) is negligible in front of fM − f 2 . [sent-1035, score-0.508]
93 Finally we deduce an oracle inequality in expectation by noting that if n−1 fM − f Ω, using Cauchy-Schwarz inequality E 1 f −f np M 2 2 =E 1Ω f −f np M ≤ E Rn,δ + 2 2 +E 1Ωc np fM − f 4p(p + 1) + 6pC nδ 1 np 2 ≤ Rn,δ on 2 2 E fM − f 4 2 . [sent-1047, score-1.687]
94 (44) We can remark that, since |||AM ||| ≤ 1, fM − f 2 2 ≤ 2 AM ε So E fM − f 2 2 +2 4 2 (Inp − AM ) f 2 2 ≤2 ε ≤ 12 np|||Σ||| + 4 f 2 2 2 2 +8 2 f 2 2 . [sent-1048, score-0.076]
95 , together with Equation (42) and Equation (44), induces Equation (17), using that for some constant L3 > 0, 12 p(p + 1)/2 + 6pC 4 f |||Σ||| + δ np n 2 2 2808 ≤ L3 p(p +C) 1 |||Σ||| + f δ/2 np n 2 2 . [sent-1049, score-0.786]
96 Then, tr(A⊤ A · (Σ ⊗ In )) + x|||Σ||| tr(A · (Σ ⊗ In )) inf A∈Mnp (R),|||A|||≤1 ≥2 x|||Σ||| n tr(Σ) Proof First note that the bilinear form on Mnp (R), (A, B) → tr(A⊤ B · (Σ ⊗ In )) is a scalar product. [sent-1051, score-0.072]
97 n tr(Σ) Therefore inf A∈Mnp (R),|||A|||≤1 tr(A⊤ A · (Σ ⊗ In )) + x|||Σ||| tr(A · (Σ ⊗ In )) ≥ inf c>0 ≥2 x|||Σ||| c + n tr(Σ) c x|||Σ||| . [sent-1054, score-0.096]
98 For every δ ≥ 2, a constant n0 (δ), an absolute constant L1 > 0 and an event Ω exist such that P(ΩHM ) ≥ 1 − pn−δ and for every n ≥ n0 (δ), on ΩHM , for every M in M (1 − η) tr(AM · (Σ ⊗ In )) ≤ tr(AM · (ΣHM ⊗ In )) ≤ (1 + η) tr(AM · (Σ ⊗ In )) , where η := L1 (α + δ) (45) ln(n) . [sent-1058, score-0.104]
99 To get to the oracle inequality in expectation we use the same technique than above, but we note that P(Ωc ) ≤ L4 × p/nδ/2 . [sent-1078, score-0.115]
100 Data-driven calibration of linear estimators with minimal penalties, July 2011. [sent-1100, score-0.082]
wordName wordTfidf (topN-words)
[('np', 0.393), ('fm', 0.35), ('arlot', 0.3), ('tr', 0.266), ('ln', 0.21), ('olnon', 0.207), ('rlot', 0.207), ('enalties', 0.197), ('inimal', 0.197), ('bach', 0.16), ('inp', 0.155), ('fsimilar', 0.135), ('msimilar', 0.124), ('hm', 0.123), ('oracle', 0.115), ('pdi', 0.114), ('ulti', 0.111), ('npinp', 0.104), ('egression', 0.094), ('mo', 0.093), ('fe', 0.08), ('remark', 0.076), ('diag', 0.076), ('df', 0.073), ('amo', 0.072), ('penid', 0.072), ('penmin', 0.062), ('find', 0.062), ('ks', 0.062), ('penalty', 0.061), ('estimator', 0.059), ('tasks', 0.058), ('equation', 0.056), ('cf', 0.055), ('estimators', 0.053), ('sylvain', 0.052), ('cc', 0.052), ('kronecker', 0.049), ('cb', 0.049), ('inf', 0.048), ('dn', 0.048), ('jump', 0.048), ('cn', 0.048), ('fei', 0.047), ('sierra', 0.041), ('ideal', 0.041), ('pd', 0.039), ('cd', 0.038), ('ridge', 0.038), ('matrices', 0.036), ('ens', 0.035), ('mnp', 0.035), ('quadratic', 0.035), ('evgeniou', 0.035), ('var', 0.035), ('argmin', 0.035), ('task', 0.034), ('theorem', 0.033), ('event', 0.032), ('experiment', 0.032), ('concentration', 0.031), ('ei', 0.031), ('diagonalizable', 0.031), ('italie', 0.031), ('matthieu', 0.031), ('mhm', 0.031), ('rnp', 0.031), ('solnon', 0.031), ('umr', 0.031), ('ce', 0.031), ('proposition', 0.03), ('vec', 0.029), ('calibration', 0.029), ('gn', 0.029), ('matrix', 0.028), ('covariance', 0.028), ('massimiliano', 0.027), ('normale', 0.027), ('diagonalize', 0.027), ('crit', 0.027), ('lounici', 0.026), ('gk', 0.026), ('pm', 0.026), ('proof', 0.026), ('errors', 0.025), ('penalties', 0.025), ('tl', 0.025), ('fa', 0.025), ('basis', 0.024), ('symmetric', 0.024), ('every', 0.024), ('francis', 0.024), ('cedex', 0.024), ('rieure', 0.024), ('bilinear', 0.024), ('kernel', 0.024), ('multitask', 0.023), ('issn', 0.023), ('sin', 0.023), ('sup', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 73 jmlr-2012-Multi-task Regression using Minimal Penalties
Author: Matthieu Solnon, Sylvain Arlot, Francis Bach
Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory
2 0.18621112 91 jmlr-2012-Plug-in Approach to Active Learning
Author: Stanislav Minsker
Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classification, confidence bands
Author: Michael U. Gutmann, Aapo Hyvärinen
Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image
4 0.10917637 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
5 0.093763955 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
6 0.092638604 80 jmlr-2012-On Ranking and Generalization Bounds
7 0.092473947 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
8 0.084793501 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
9 0.076596051 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
10 0.061701484 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
11 0.061691433 111 jmlr-2012-Structured Sparsity and Generalization
12 0.061190743 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
13 0.059388459 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
14 0.055028256 59 jmlr-2012-Linear Regression With Random Projections
15 0.054649267 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
16 0.053193934 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
17 0.049993534 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
18 0.049712248 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
19 0.049167056 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
20 0.048330929 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
topicId topicWeight
[(0, -0.253), (1, 0.118), (2, -0.144), (3, 0.065), (4, -0.088), (5, -0.11), (6, 0.014), (7, 0.068), (8, -0.062), (9, -0.022), (10, -0.059), (11, -0.015), (12, -0.004), (13, 0.064), (14, -0.162), (15, -0.132), (16, -0.094), (17, 0.051), (18, 0.165), (19, -0.042), (20, -0.21), (21, 0.393), (22, -0.085), (23, 0.022), (24, -0.001), (25, 0.0), (26, 0.127), (27, 0.093), (28, 0.04), (29, 0.064), (30, 0.038), (31, -0.009), (32, -0.054), (33, 0.082), (34, -0.019), (35, -0.017), (36, 0.004), (37, 0.015), (38, -0.063), (39, 0.035), (40, 0.012), (41, -0.037), (42, 0.031), (43, -0.091), (44, -0.026), (45, -0.098), (46, -0.009), (47, 0.016), (48, 0.131), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.92966932 73 jmlr-2012-Multi-task Regression using Minimal Penalties
Author: Matthieu Solnon, Sylvain Arlot, Francis Bach
Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory
2 0.62310296 91 jmlr-2012-Plug-in Approach to Active Learning
Author: Stanislav Minsker
Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classification, confidence bands
3 0.61944073 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller
Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identifiability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning
Author: Michael U. Gutmann, Aapo Hyvärinen
Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image
5 0.37309304 80 jmlr-2012-On Ranking and Generalization Bounds
Author: Wojciech Rejchel
Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process
6 0.34751365 82 jmlr-2012-On the Necessity of Irrelevant Variables
7 0.34283373 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
8 0.33539972 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
9 0.33010229 97 jmlr-2012-Regularization Techniques for Learning with Matrices
10 0.31950417 111 jmlr-2012-Structured Sparsity and Generalization
11 0.29817775 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
12 0.29117513 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
13 0.27855209 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
14 0.2703822 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
15 0.26255393 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
16 0.24946347 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
17 0.24051978 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
18 0.23856534 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
19 0.23271079 59 jmlr-2012-Linear Regression With Random Projections
20 0.22880472 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
topicId topicWeight
[(7, 0.023), (21, 0.07), (25, 0.294), (26, 0.052), (27, 0.01), (29, 0.031), (35, 0.011), (49, 0.031), (64, 0.013), (75, 0.056), (77, 0.012), (79, 0.012), (81, 0.01), (92, 0.128), (96, 0.158)]
simIndex simValue paperId paperTitle
same-paper 1 0.75993192 73 jmlr-2012-Multi-task Regression using Minimal Penalties
Author: Matthieu Solnon, Sylvain Arlot, Francis Bach
Abstract: In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples. Keywords: multi-task, oracle inequality, learning theory
2 0.59532166 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
3 0.59508801 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
4 0.59433162 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
5 0.58753312 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
6 0.58713388 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
7 0.5831393 80 jmlr-2012-On Ranking and Generalization Bounds
8 0.58183789 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
9 0.57890409 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
10 0.57608116 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
11 0.57555544 91 jmlr-2012-Plug-in Approach to Active Learning
12 0.57505214 103 jmlr-2012-Sampling Methods for the Nyström Method
13 0.57447475 13 jmlr-2012-Active Learning via Perfect Selective Classification
14 0.57029462 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
15 0.56885272 111 jmlr-2012-Structured Sparsity and Generalization
16 0.56787121 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
17 0.56754816 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
18 0.56561232 84 jmlr-2012-Online Submodular Minimization
19 0.56289965 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
20 0.56287849 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing