jmlr jmlr2011 jmlr2011-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
Reference: text
sentIndex sentText sentNum sentScore
1 Up to now, optimal aggregation procedures are convex combinations of every elements of F. [sent-6, score-0.52]
2 In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. [sent-7, score-0.52]
3 Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. [sent-8, score-0.462]
4 Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. [sent-9, score-0.947]
5 We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. [sent-11, score-0.946]
6 Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars 1. [sent-12, score-0.187]
7 This set of functions is often a set of estimators computed on a training sample, which is independent of the sample Dn (learning sample). [sent-18, score-0.06]
8 The aim of the problem of aggregation is to construct a procedure f˜ (called an aggregate) using Dn and F with a risk which is very close to the smallest risk over F. [sent-22, score-0.562]
9 Inequalities of the form (1) are called exact oracle inequalities and r(F, n) is called the residue. [sent-25, score-0.137]
10 , 2008) says that aggregates with values in F cannot satisfy an inequality like (1) with a residue smaller than ((log M)/n)1/2 for every F. [sent-27, score-0.143]
11 Nevertheless, it is possible to mimic the oracle (an oracle is a element in F achieving the minimal risk over F) up to the residue (log M)/n (see Juditsky et al. [sent-28, score-0.349]
12 , 2008 and Lecu´ and Mendele son, 2009, among others) using an aggregate f˜ that combines all the elements of F. [sent-29, score-0.212]
13 In this case, we say that f˜ is an optimal aggregation procedure. [sent-30, score-0.462]
14 e Given the set of functions F, a natural way to predict Y is to compute the empirical risk minimization procedure (ERM), the one that minimizes the empirical risk Rn ( f ) := 1 n ∑ (Yi − f (Xi ))2 n i=1 over F. [sent-32, score-0.1]
15 This very basic principle is at the core of aggregation procedures (for regression with squared loss). [sent-33, score-0.52]
16 An aggregate is typically represented as a convex combination of the elements of F. [sent-34, score-0.212]
17 Up to now, most of j=1 the optimal aggregation procedures are based on exponential weights: aggregation with cumulated exponential weights (ACEW), see Catoni (2001), Yang (2004), Yang (2000), Juditsky et al. [sent-36, score-1.011]
18 (2005), Audibert (2009) and aggregation with exponential weights (AEW), see Leung and Barron (2006) and Dalalyan and Tsybakov (2007), among others. [sent-38, score-0.491]
19 This can be a problem when one wants to use aggregation to construct adaptive procedures. [sent-47, score-0.481]
20 Indeed, one could imagine large dictionaries 1814 H YPER -S PARSE O PTIMAL AGGREGATION containing many different types of estimators (kernel estimators, projection estimators, etc. [sent-48, score-0.096]
21 Some of the estimators are likely to be more adapted than the others, depending on the kind of models that fits well the data, and, there may be only few of them among a large dictionary. [sent-51, score-0.06]
22 An aggregate that combines only the most adapted estimators from the dictionary and that removes the irrelevant ones is suitable in this case. [sent-52, score-0.482]
23 An improvement going in this direction has been made using a preselection step in Lecu´ and Mendele son (2009). [sent-54, score-0.158]
24 This preselection step allows to remove all the estimators in F which performs badly on a learning subsample. [sent-55, score-0.193]
25 In this paper, we want to go a step further: we look for an aggregation algorithm that shares the same property of optimality, but with as few non-zero coefficients θ j as possible, hence the name hyper-sparse aggregate. [sent-56, score-0.462]
26 This leads to the following question: Question 1 What is the minimal number of non-zero coefficients θ j such that an aggregation procedure f˜ = ∑M θ j f j is optimal? [sent-57, score-0.462]
27 Indeed, if every coefficient is zero, excepted for one, the aggregate coincides with an element of F, and as we mentioned before, such a procedure can only achieve the rate ((log M)/n)1/2 (unless extra properties are satisfied by F and ν). [sent-59, score-0.212]
28 We prove in Theorem 2 below that these procedures are optimal, since they achieve the rate (log M)/n. [sent-61, score-0.058]
29 • (Bounded setup) There is a constant b > 0 such that: max Y ∞ , sup f ∈F f (X) L∞ ≤ b. [sent-71, score-0.101]
30 (2) • (Sub-exponential setup) There is a constant b > 0 such that: max ε ψ1 , sup f ∈F f (X) − f0 (X) L∞ ≤ b. [sent-72, score-0.101]
31 The results given below differ a bit depending on the considered assumption (there is an extra log n term in the sub-exponential case). [sent-74, score-0.053]
32 Preselection) Use Dn,1 to define a random subset of F : F1 = where f 2 n,1 f ∈ F : Rn,1 ( f ) ≤ Rn,1 ( fn,1 ) + c max φ fn,1 − f n,1 , φ 2 , (4) = n−1 ∑n f (Xi )2 , Rn,1 ( f ) = n−1 ∑n ( f (Xi ) −Yi )2 , fn,1 ∈ argmin f ∈F Rn,1 ( f ). [sent-84, score-0.059]
33 In Figure 1 we summarize the aggregation steps in the three cases. [sent-88, score-0.462]
34 In Figure 2 we give a simulated illustration of the preselection step, and we show the value of the weights of the AEW for a comparison. [sent-89, score-0.162]
35 As mentioned above, the Step 3 of the algorithm returns, when F is given by (6) or (7), an aggregate which is a convex combination of only two functions in F, among the ones remaining after the preselection step. [sent-90, score-0.345]
36 The preselection step was introduced in Lecu´ and Mendelson (2009), with the use of (5) only for the aggregation e step. [sent-91, score-0.595]
37 Theorem 2 Let x > 0 be a confidence level, F be a dictionary with cardinality M and f˜ be one of the aggregation procedure given in Definition 1. [sent-94, score-0.672]
38 If (3) holds, we have, with ν2n -probability at least 1 − 4e−x : R( f˜) ≤ min R( f ) + cσε ,b f ∈F (1 + x) log M log n . [sent-97, score-0.106]
39 Now, we give details for the computation of the star-shaped aggregate, namely the aggregate f˜ given by Definition 1 when F is (7). [sent-102, score-0.212]
40 Only the elements of F with an empirical risk smaller than the threshold are kept from the dictionary for the construction of the aggregates of Definition (1). [sent-108, score-0.402]
41 The first and third examples correspond to a case where an aggregate with preselection step improves upon AEW, while in the second example, both procedures behaves similarly. [sent-109, score-0.403]
42 Since aggregation procedures are known (see references above) to outperform selectors in terms of prediction error, it is tempting to use aggregation for the choice of the tuning parameters. [sent-117, score-1.055]
43 Unfortunately, as we mentioned before, most aggregation procedures provide non-zero weights to many non relevant element in a dictionary: this is a problem for variable selection. [sent-118, score-0.549]
44 Indeed, if we use, for instance, the AEW on a dictionary consisting of the full path of Lasso estimators (provided by the Lars algorithm, see Efron et al. [sent-119, score-0.27]
45 , 2004), then the resulting aggregate is likely to select all the variables since the Lasso with a small regularization parameter is very close (and equal if it is zero) to ordinary least-squares (which does not perform any variables selection). [sent-120, score-0.212]
46 So, in this context, the hyper-sparse aggregate of Section 2 is of particular interest. [sent-121, score-0.212]
47 In this section, we compare the prediction error and the accuracy of variable selection of our star-shaped aggregation algorithm to Mallow’s Cp heuristic, leave-one-out crossvalidation and 10-fold cross-validation. [sent-122, score-0.516]
48 2 we consider a dictionary consisting of the 1818 H YPER -S PARSE O PTIMAL AGGREGATION Algorithm 1: Computation of the star-shaped aggregate. [sent-124, score-0.21]
49 Input: dictionary F, data (Xi ,Yi )2n , and a confidence level x > 0 i=1 Output: star-shaped aggregate f˜ Split D2n into two samples Dn,1 and Dn,2 foreach j ∈ {1, . [sent-125, score-0.466]
50 , M} do Compute Rn,1 ( f j ) and Rn,2 ( f j ), and use this loop to find fn,1 ∈ argmin f ∈F Rn,1 ( f ) end foreach j ∈ {1, . [sent-128, score-0.073]
51 Indeed, let us recall that here, the focus is on the comparison of selection and aggregation procedures for the choice of tuning parameters, and not on the comparison of the procedures inside the dictionary themselves. [sent-135, score-0.828]
52 The noise εi is N(0, σ2 ) with σ equal to 1 or 3. [sent-147, score-0.054]
53 The noise εi is N(0, σ2 ) with σ equal to 15 or 7. [sent-158, score-0.054]
54 2 Procedures We consider a dictionary consisting of the entire sequence of Lasso estimators and a dictionary with several sequences of elastic-net estimators, corresponding to ridge parameters in the set of values {0, 0. [sent-180, score-0.501]
55 1, 1, 5, 10, 20, 50, 100} (these dictionaries are computed with the lars and enet routines from R). [sent-182, score-0.093]
56 1 For each dictionary, we compute the prediction errors |X(β − β)|2 (where X is the matrix ⊤ ⊤ with rows X1 , . [sent-183, score-0.065]
57 For both aggregates we use jackknife: we compute the mean of 100 aggregates obtained with several splits chosen at random. [sent-188, score-0.236]
58 As a matter of fact, we observed in our numerical studies that Star-shaped aggregation with the preselection step and without it (see Definition 1) provides close estimators. [sent-194, score-0.595]
59 So, in order to improve the computational burden, the numerical results of the Star-shaped aggregate reproduced here are the ones obtained without the preselection step. [sent-195, score-0.345]
60 We need to explain how variable selection is performed based on J star-shaped aggregates coming from J random splits (here we take J = 100). [sent-196, score-0.14]
61 This procedure is close in spirit to the stability selection procedure described in Meinshausen and B¨ hlmann (2010), since each aggregate u is related to a subsample. [sent-205, score-0.234]
62 Finally, the selected covariates are the one in S = k ∈ {1, . [sent-206, score-0.064]
63 u For each of the Models 1-6, the boxplots of the 200 prediction errors are given in Figures 3 and 4. [sent-212, score-0.065]
64 Note that in a high dimensional setting (p > n), we don’t reproduce the Cp ’s prediction errors, since in this case the lars package does not give it correctly. [sent-213, score-0.115]
65 The results concerning variables selection for the Lars and the Elastic-Net dictionaries are given in Tables 1 and 2. [sent-215, score-0.058]
66 In these tables we reproduce the number of selected variables by each procedure, and the number of noise variables (the selected variables which are not active in the true model). [sent-216, score-0.142]
67 3 Conclusion In most cases, the Star-Shaped aggregate improves upon the AEW and the considered selection procedures both in terms of prediction error and variable selection. [sent-218, score-0.324]
68 The proposed variable selection algorithm based on star-shaped aggregation and stability selection tends to select smaller models than the Cp and cross-validation methods (see Table 1, Models 1-4) leading to less noise variables. [sent-219, score-0.56]
69 In particular, in high-dimensional cases (p > n), it is much more stable regarding the sample size and noise level, and provides better results most of the time (see Table 1, Models 5-6). [sent-220, score-0.054]
70 We can say that, roughly, the Cp and the cross-validations are better than the Star-Shaped aggregate only for non-sparse vectors (since these selection procedures tend to select larger models), in particular when n is small and σ is large. [sent-222, score-0.292]
71 We can conclude by saying that, in the worst cases, the Star-shaped algorithm has prediction and selection performances which are comparable to cross-validations and Cp heuristic, but, on the other hand, it can improve them a lot (in particular for sparse vectors). [sent-223, score-0.054]
72 One can think of the Star-Shaped aggregation algorithm as an alternative to cross-validation and Cp . [sent-224, score-0.462]
73 , Xn are independent random variables and F is a countable set of functions such that E f (X) = 0, ∀ f ∈ F and sup f ∈F f (X) ψ1 < +∞. [sent-397, score-0.071]
74 Define 1 n Z := sup ∑ f (Xi ) f ∈F n i=1 1827 ´ G A¨FFAS AND L ECU E I and σ2 = sup E f (X)2 and b := f ∈F max sup | f (Xi )| i=1,. [sent-398, score-0.243]
75 Then, for any η ∈ (0, 1) and δ > 0, there is c = cη,δ such that for any x > 0: x P Z ≥ (1 + η)EZ + σ 2(1 + δ) + cb n x P Z ≤ (1 − η)EZ − σ 2(1 + δ) − cb n x n x n ≤ 4e−x ≤ 4e−x . [sent-402, score-0.218]
76 For any function f define (P − Pn )( f ) := i=1 n−1 ∑n f (Zi ) − E f (Z) and for a class of functions F, define P − Pn F := sup f ∈F |(P − Pn )( f )|. [sent-405, score-0.071]
77 Lemma 7 Define σ2 (F) = sup E[ f (X)2 ], d(F) := diam(F, L2 (µ)), C = conv(F), f ∈F and LC (C ) = {(Y − f (X))2 − (Y − f C (X))2 : f ∈ C }, where f C ∈ argming∈C R(g). [sent-408, score-0.071]
78 If (2) holds, we have 1 n 2 b2 log M f (Xi ) ≤ c max σ2 (F), , and ∑ n f ∈F n i=1 E sup E Pn − P LC (C ) ≤ cb log M max b n log M , d(F) . [sent-409, score-0.399]
79 n If (3) holds, we have b2 log M 1 n 2 ∑ f (Xi ) ≤ c max σ2 (F), n , and f ∈F n i=1 E sup log M log n max b n E Pn − P LC (C ) ≤ cb log M log n , d(F) . [sent-410, score-0.505]
80 Define 1 n ∑ f (Xi )2 , f ∈F n i=1 r2 = sup e and note that EX (r2 ) ≤ EX P − Pn F 2 + σ(F)2 , where F := { f 2 : f ∈ F}. [sent-412, score-0.071]
81 Using the Gin´ -Zinn symmetrization argument, see Gin´ and Zinn (1984), we have e EX P − Pn F2 c ≤ EX Eg sup n f ∈F 1828 n ∑ gi f 2 (Xi ) i=1 , H YPER -S PARSE O PTIMAL AGGREGATION where (gi ) are i. [sent-413, score-0.108]
82 Using (3) we have dn,∞ ( f , f ′ ) ≤ 2b for any f , f ′ ∈ F, so using Dudley’s entropy integral, we have Eg P − Pn F2 c ≤√ n 2b log N(F, dn,∞ ,t)dt ≤ cr 0 log M . [sent-421, score-0.106]
83 n So, we get EX P − Pn F2 log M EX [r] ≤ cb n ≤ cb log M n EX [r2 ], which entails that b2 log M + σ(F)2 . [sent-422, score-0.377]
84 Using the same argument as before we have EX (r2 ) ≤ c max c E P − Pn LC (C ) ≤ E(X,Y ) Eg sup n f ∈C n ∑ gi L f (Xi ,Yi ) . [sent-425, score-0.138]
85 Therefore, by Slepian’s Lemma, we have for every (Xi ,Yi )n : i=1 i=1 Eg sup Z f ≤ max sup |2Yi − f (Xi ) − f ′ (Xi )| × Eg sup Z ′f , f ∈C i=1,. [sent-434, score-0.243]
86 , M and ∑ α j = 1, Z ′f = ∑M α j Z f j , j=1 j=1 we have Eg sup Z ′f ≤ Eg sup Z ′f . [sent-440, score-0.142]
87 f ∈F f ∈C Moreover, we have, using Dudley’s entropy integral argument, 1 c Eg sup Z ′f ≤ √ n n f ∈F ∆n (F ′ ) 0 N(F, · 1829 n ,t)dt ≤c log M ′ r, n ´ G A¨FFAS AND L ECU E I where F ′ := { f − f C : f ∈ F} and ∆n (F ′ ) := diam(F ′ , · n) and 1 n ∑ f (Xi )2 . [sent-441, score-0.124]
88 ′ n f ∈F i=1 r′2 := sup Hence, we proved that E P − Pn LC (C ) ≤ c log M n E max |2Yi − f (Xi ) − f ′ (Xi )|2 i=1,. [sent-442, score-0.154]
89 Using Pisier’s inequality for ψ1 random variables and the fact that E(U 2 ) ≤ 4 U random variable U, together with (3), we obtain that ψ1 for any ψ1 - E max sup |2Yi − f (Xi ) − f ′ (Xi )|2 ≤ cb2 log(n). [sent-446, score-0.101]
90 ,n f , f ′ ∈C (8) So, we finally obtain log n log M n E P − Pn LC (C ) ≤ c E(r′2 ), and the conclusion follows from the first part of the Lemma, since σ(F ′ ) ≤ d(F). [sent-450, score-0.106]
91 If (3) holds, we have, with probability larger than 1 − 4e−x , that for every f ∈ C : 1 n ∑ L f (Xi ,Yi ) − EL f (X,Y ) n i=1 ≤ c(σε + b) (log M + x) log n max b n (log M + x) log n , d(F) . [sent-454, score-0.136]
92 n H YPER -S PARSE O PTIMAL AGGREGATION where σ(C )2 = sup E[L f (X,Y )2 ], and f ∈C bn (C ) = max sup |L f (Xi ,Yi ) − E[L f (X,Y )]| i=1,. [sent-456, score-0.198]
93 ε ψ1 , we have bn (C ) ≤ 2 log(n + 1) sup f ∈C |L f (X,Y )| ψ1 . [sent-461, score-0.097]
94 Putting all this together, and using Lemma 7, we arrive at Z ≤ c(σε + b) (log M + x) log n max b n (log M + x) log n , d(F) , n with probability larger than 1 − 4e−x for any x > 0. [sent-463, score-0.136]
95 If (3) holds, we have with probability larger than 1 − 4e−x , that for every f ∈ F: 1 n ∑ L f (Xi ,Yi ) − EL f (X,Y ) n i=1 (log M + x) log n max b n ≤ c(σε + b) (log M + x) log n , f − fF n . [sent-466, score-0.136]
96 Also, with probability at least 1 − 4e−x , we have for every f , g ∈ F: f −g 2 n− f −g ≤ cb 2 (log M + x) log n max b n (log M + x) log n , f −g n . [sent-467, score-0.245]
97 If (3) holds, we have with probability at least 1 − 4 exp(−x) that f F ∈ F1 , and any function f ∈ F1 satisfies R( f ) ≤ R( f F ) + c(σε + b) (log M + x) log n max b n (log M + x) log n , d(F1 ) . [sent-471, score-0.136]
98 n If (2) holds, we have with probability at least 1 − 2 exp(−x) that f F ∈ F1 , and any function f ∈ F1 satisfies R( f ) ≤ R( f F ) + cb log M + x max b n log M + x , d(F1 ) . [sent-472, score-0.245]
99 Recursive aggregation of estimators by the mirror descent method with averaging. [sent-517, score-0.541]
100 On the optimality of the aggregate with exponential e weights for small temperatures. [sent-545, score-0.241]
wordName wordTfidf (topN-words)
[('aggregation', 0.462), ('aew', 0.427), ('loo', 0.236), ('star', 0.22), ('cp', 0.214), ('aggregate', 0.212), ('dictionary', 0.21), ('ffas', 0.177), ('lecu', 0.162), ('ecu', 0.147), ('yper', 0.147), ('oracle', 0.137), ('preselection', 0.133), ('aggregates', 0.118), ('ptimal', 0.113), ('cb', 0.109), ('lc', 0.1), ('eg', 0.094), ('pn', 0.092), ('parse', 0.086), ('dn', 0.086), ('juditsky', 0.075), ('erm', 0.072), ('sup', 0.071), ('conv', 0.068), ('issn', 0.063), ('lasso', 0.063), ('ff', 0.063), ('lf', 0.062), ('estimators', 0.06), ('acew', 0.059), ('dalalyan', 0.059), ('procedures', 0.058), ('xi', 0.058), ('lars', 0.057), ('noise', 0.054), ('log', 0.053), ('diam', 0.052), ('tsybakov', 0.051), ('ex', 0.051), ('risk', 0.05), ('mendelson', 0.048), ('guillaume', 0.045), ('ferm', 0.044), ('foreach', 0.044), ('seg', 0.044), ('talagrand', 0.044), ('el', 0.04), ('mallow', 0.038), ('gi', 0.037), ('dictionaries', 0.036), ('doi', 0.034), ('gin', 0.034), ('errors', 0.033), ('covariates', 0.033), ('prediction', 0.032), ('selected', 0.031), ('max', 0.03), ('weights', 0.029), ('anatoli', 0.029), ('appliqu', 0.029), ('argming', 0.029), ('cances', 0.029), ('fother', 0.029), ('mendele', 0.029), ('yuhong', 0.029), ('argmin', 0.029), ('alexandre', 0.027), ('zi', 0.027), ('bn', 0.026), ('reproduce', 0.026), ('meinshausen', 0.026), ('leung', 0.025), ('temperatures', 0.025), ('son', 0.025), ('residue', 0.025), ('threshold', 0.024), ('split', 0.023), ('shahar', 0.023), ('selectors', 0.023), ('temperature', 0.023), ('ez', 0.022), ('url', 0.022), ('selection', 0.022), ('truth', 0.022), ('ridge', 0.021), ('dudley', 0.021), ('segments', 0.021), ('efron', 0.021), ('zou', 0.021), ('lemma', 0.02), ('phane', 0.019), ('mirror', 0.019), ('laboratoire', 0.019), ('excess', 0.019), ('wants', 0.019), ('anr', 0.019), ('unbounded', 0.018), ('tuning', 0.018), ('hui', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
2 0.19650573 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
3 0.14458111 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.11665761 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
5 0.067804962 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
6 0.067534208 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
7 0.066242509 97 jmlr-2011-Union Support Recovery in Multi-task Learning
8 0.062357657 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
9 0.060105175 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
10 0.051543977 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
11 0.050112717 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.04095193 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
13 0.038450576 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
14 0.035519138 35 jmlr-2011-Forest Density Estimation
15 0.03428679 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
16 0.033961389 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
17 0.030596623 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
18 0.029496506 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
19 0.027342604 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
20 0.02567298 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
topicId topicWeight
[(0, 0.179), (1, -0.019), (2, 0.132), (3, 0.253), (4, 0.008), (5, 0.055), (6, -0.087), (7, -0.069), (8, -0.093), (9, -0.041), (10, -0.098), (11, 0.076), (12, -0.181), (13, 0.202), (14, -0.117), (15, 0.272), (16, 0.027), (17, 0.308), (18, -0.205), (19, 0.141), (20, 0.069), (21, 0.006), (22, -0.021), (23, -0.059), (24, 0.039), (25, -0.092), (26, -0.079), (27, -0.19), (28, 0.026), (29, -0.09), (30, 0.069), (31, -0.064), (32, 0.045), (33, -0.236), (34, -0.078), (35, 0.087), (36, 0.057), (37, 0.068), (38, 0.059), (39, 0.011), (40, 0.087), (41, -0.004), (42, -0.066), (43, 0.023), (44, -0.061), (45, -0.04), (46, 0.07), (47, -0.037), (48, 0.034), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.94087893 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
2 0.66516477 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
3 0.44003224 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.26302367 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
5 0.24151352 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
6 0.22574356 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
7 0.22466259 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
8 0.21730433 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
9 0.20332491 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
10 0.19559456 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
11 0.1739879 97 jmlr-2011-Union Support Recovery in Multi-task Learning
12 0.15888052 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
13 0.15061718 35 jmlr-2011-Forest Density Estimation
14 0.13976128 16 jmlr-2011-Clustering Algorithms for Chains
15 0.13571359 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
16 0.12985802 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
17 0.12536663 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
18 0.12297376 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
19 0.11908793 22 jmlr-2011-Differentially Private Empirical Risk Minimization
20 0.11758464 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
topicId topicWeight
[(4, 0.057), (6, 0.01), (9, 0.064), (10, 0.023), (24, 0.035), (31, 0.043), (32, 0.03), (41, 0.029), (60, 0.025), (64, 0.013), (73, 0.03), (78, 0.108), (94, 0.417)]
simIndex simValue paperId paperTitle
same-paper 1 0.68723166 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
2 0.63390112 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
Author: Piotr Zwiernik
Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold
3 0.32789421 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.32655108 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
5 0.31627345 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
6 0.31314892 59 jmlr-2011-Learning with Structured Sparsity
7 0.31277633 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
8 0.31212577 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
9 0.30880547 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
10 0.30833462 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
11 0.30797499 22 jmlr-2011-Differentially Private Empirical Risk Minimization
12 0.30771893 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
13 0.3065061 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
14 0.30643708 105 jmlr-2011-lp-Norm Multiple Kernel Learning
15 0.30630451 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
16 0.30590358 104 jmlr-2011-X-Armed Bandits
17 0.30564675 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
18 0.30336189 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
19 0.30017322 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
20 0.29980111 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms