jmlr jmlr2009 jmlr2009-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Statistics Department 110 Frelinghuysen Road Rutgers University, NJ 08854 Editor: Bin Yu Abstract This paper studies the feature selection problem using a greedy least squares regression algorithm. [sent-3, score-0.532]
2 We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. [sent-4, score-0.735]
3 The condition is identical to a corresponding condition for Lasso. [sent-5, score-0.096]
4 Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. [sent-6, score-0.517]
5 In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. [sent-7, score-0.111]
6 Introduction We are interested in the statistical feature selection problem for least squares regression. [sent-9, score-0.211]
7 , yn ] ∈ Rn is generated from a sparse linear combination of the basis vectors {x j } plus a zero-mean stochastic noise vector z ∈ Rn : ¯ y = Xβ + z = d ¯ ∑ β j x j + z, (1) j=1 ¯ where most coefficients β j equal zero. [sent-20, score-0.117]
8 The goal of feature selection is to identify the set of non¯ j = 0}, where β = [β1 , . [sent-21, score-0.089]
9 The purpose of this paper is study the performance of ¯ ¯ ¯ zeros { j : β greedy least squares regression for feature selection. [sent-25, score-0.504]
10 Z HANG ˆ ¯ ¯ That is, βX (F, x) is the least squares solution with coefficients restricted to F. [sent-35, score-0.146]
11 ¯ following quantity, appeared in Tropp (2004), is important in our analysis (also see Wainwright, 2006): T T ¯ µX (F) = max (XF XF )−1 XF x j 1 . [sent-41, score-0.081]
12 1 T This quantity is the smallest eigenvalue of the restricted design matrix n XF XF , which has also ¯ ¯ appeared in previous work such as Wainwright (2006) and Zhao and Yu (2006). [sent-46, score-0.092]
13 The requirement that ρX (F) is bounded away from zero for small |F| is often referred to as the sparse eigenvalue condition (or the restricted isometry condition). [sent-47, score-0.146]
14 One of the frequently used method for feature selection is Lasso, which solves the following L1 regularization problem: 2 1 d ˆ β = arg min (2) ∑ β j x j − y + λ β 1 , n j=1 β 2 where λ > 0 is an appropriately chosen regularization parameter. [sent-50, score-0.085]
15 The effectiveness of feature selection using Lasso was established in Zhao and Yu (2006) (also see Meinshausen and Buhlmann, 2006) under irrepresentable conditions that depend on the signs of ¯ the true target sgn(β). [sent-51, score-0.488]
16 Results established in this paper have cruder forms that depend only on the ¯ design matrix but not the sparse target β. [sent-52, score-0.135]
17 In addition to Lasso, greedy algorithms have also been widely used for feature selection. [sent-54, score-0.316]
18 Greedy algorithms for least squares regression are called matching pursuit in the signal processing community (Mallat and Zhang, 1993). [sent-55, score-0.249]
19 The algorithm is often called forward greedy selection in the machine learning literature. [sent-57, score-0.351]
20 This paper investigates the behavior of greedy least squares algorithm in Figure 1 for feature selection under the stochastic noise model (1). [sent-58, score-0.554]
21 It was shown by Tropp (2004) that µX (F) < 1 is ¯ when the noise vector sufficient for the greedy algorithm to identify the correct feature set supp(β) z = 0. [sent-60, score-0.385]
22 In particular, we will establish conditions on min j∈supp(β) |β j | and ¯ ¯ the stopping criterion ε in Figure 1, so that the algorithm finds the correct feature set supp(β). [sent-62, score-0.161]
23 The selection of stopping criterion ε in the greedy algorithm is equivalent to selecting an appropriate 556 F EATURE S ELECTION USING G REEDY L EAST S QUARES Input: X = [x1 , . [sent-63, score-0.361]
24 In fact, our result shows that the condition of min j∈supp(β) |β j | required for greedy ¯ algorithm is weaker than the corresponding condition for Lasso. [sent-74, score-0.395]
25 The greedy algorithm analysis employed in this paper is a combination of an observation by Tropp (2004, see Lemma 11) and some technical lemmas for the behavior of greedy least squares regression by Zhang (2008), which are included in Appendix A for completeness. [sent-75, score-0.763]
26 Note that Zhang (2008) only studied a forward-backward procedure, but not the more standard forward greedy algo¯ rithm considered here. [sent-76, score-0.338]
27 ¯ As we shall see in this paper, the condition µX (F) ≤ 1 is necessary for the success of the forward greedy procedure. [sent-78, score-0.371]
28 It is worth mentioning that Lasso is consistent in parameter estimation under a ¯ weaker sparse eigenvalue condition, even if the condition µX (F) ≤ 1 fails (which means Lasso may not estimate the true feature set correctly): for example, see Meinshausen and Yu (2008) and Zhang (2009). [sent-79, score-0.166]
29 Although similar results may be obtained for greedy least squares regression, when the ¯ condition µX (F) ≤ 1 fails, it was shown by Zhang (2008) that the performance of greedy algorithm can be improved by incorporating backward steps. [sent-80, score-0.77]
30 In contrast, results in this paper show that if ¯ the design matrix satisfies the additional condition µX (F) < 1, then the standard forward greedy algorithm will be successful without complicated backward steps. [sent-81, score-0.406]
31 Feature Selection using Greedy Least Squares Regression We would like to establish conditions under which the forward greedy algorithm in Figure 1 never makes any mistake (with large probability), and thus suitable for feature selection. [sent-83, score-0.41]
32 The following theorem gives conditions under which the forward greedy algorithm can identify the correct set of features. [sent-101, score-0.405]
33 Theorem 1 Consider the greedy least squares algorithm in Figure 1, where Assumption 1 holds. [sent-102, score-0.425]
34 5), with probability larger than 1 − 2η, if the stopping criterion satisfies ε> √ ¯ ¯ min |β j | ≥ 3ερX (F)−1 / n, 1 ¯ σ 2 ln(2d/η), 1 − µX (F) ¯ j∈F ¯ then when the procedure stops, we have F (k−1) = F and ¯ β(k−1) − β ∞ ≤σ ¯ ¯ (2 ln(2|F|/η))/(nρX (F)). [sent-104, score-0.116]
35 Theorem 2 Consider the greedy least squares algorithm in Figure 1, where Assumption 1 holds. [sent-106, score-0.425]
36 5), with probability larger than 1 − 2η, if the stopping criterion satisfies ε> 1 ¯ σ 2 ln(2d/η), 1 − µX (F) then when the procedure stops, the following claims are true: ¯ • F (k−1) ⊂ F. [sent-108, score-0.145]
37 In such (k−1) selected by the greedy least squares algorithm is approximately correct. [sent-114, score-0.425]
38 This implies that when µX (F) < 1, greedy ¯ least squares regression leads to a good estimation of the true parameter β. [sent-121, score-0.495]
39 By choosing ε = (k−1) − β ¯ 2 = O( |F|/n + k(ε) ln d/n). [sent-122, score-0.154]
40 Therefore, if k(ε) is small, then Lasso is inferior due to the extra ln d factor. [sent-124, score-0.154]
41 In this paper, we are mainly interested in the situation k(ε) = 0, which implies that with the stopping criterion ε, greedy least squares regression can correctly identify all features with large ¯ probability. [sent-126, score-0.598]
42 This means that under the assumption of Theorem 1, it is possible to identify all features correctly using the greedy least squares algorithm as long as the target ¯ ¯ coefficients β j ( j ∈ F) are larger than the order σ ln(d/η)/n. [sent-129, score-0.573]
43 ¯ In fact, since σ ln(d/η)/n is the noise level, if there exists a target coefficient β j that is smaller than O(σ ln(d/η)/n) in absolute value, then we cannot distinguish such a small coefficient from ¯ zero (or noise) with large probability. [sent-130, score-0.109]
44 Therefore when the condition µX (F) < 1 holds, it is not possible to do much better than greedy least squares regression except for the constant hidden in ¯ ¯ O(·) and its dependency on ρ(F) and µX (F). [sent-131, score-0.515]
45 ¯ ¯ In comparison, for Lasso, the condition required of min j∈F |β j | depends not only on ρ(F)−1 and ¯ −1 , but also on the quantity (X T X )−1 ¯ (see Wainwright, 2006), where (1 − µX (F)) ∞,∞ ¯ ¯ F F T (XF XF )−1 ¯ ¯ ∞,∞ = sup ¯ u∈R|F| T (XF XF )−1 u ¯ ¯ u ∞ ∞ . [sent-132, score-0.115]
46 This is a more restrictive condition ¯ than that of greedy least squares regression. [sent-143, score-0.473]
47 For example, for the greedy algorithm, we can show βX (F, y)− β 2 = O(σ |F|/n) ˆ ¯ ¯ ¯ and βX (F, y) − β ∞ = O(σ ln |F|/n). [sent-146, score-0.433]
48 Under the conditions of Theorem 1, we have β(k−1) = ˆ X (F, y), and thus β(k−1) − β 2 = O(σ |F|/n) and β(k−1) − β ∞ = O(σ ln |F|/n). [sent-147, score-0.189]
49 Under ¯ ¯ ¯ ¯ β ¯ 559 Z HANG ˆ ˆ ¯ ¯ the same conditions, for the Lasso estimator β of (2), we have β − β 2 = O(σ |F| ln d/n) and ˆ ¯ β − β ∞ = O(σ ln d/n). [sent-148, score-0.308]
50 The ln d factor (bias) is inherent to Lasso, which can be removed with two-stage procedures (e. [sent-149, score-0.154]
51 However, such procedures are less robust and more complicated than the simple greedy algorithm. [sent-152, score-0.279]
52 Feature Selection Consistency As we have mentioned before, the effectiveness of feature selection using Lasso was established by Zhao and Yu (2006), under more refined irrepresentable conditions that depend on the signs of the ¯ ¯ true target sgn(β). [sent-154, score-0.488]
53 In comparison, the condition µX (F) < 1 in Theorem 2 depends only on the design ¯ That is, the condition is with respect to the worst case choice of matrix but not the sparse target β. [sent-155, score-0.234]
54 Due to the complexity of greedy procedure, we cannot establish a simple target dependent condition that ensures feature selection consistency. [sent-157, score-0.471]
55 This means for any specific target, the behavior of forward greedy algorithm and Lasso might be different, and one may be preferred over the other under different scenarios. [sent-158, score-0.323]
56 In the following, we introduce the target independent irrepresentable conditions that are equiv¯ alent to the irrepresentable conditions of Zhao and Yu (2006) with the worst case choice of sgn(β) (also see Wainwright, 2006). [sent-160, score-0.74]
57 ¯ ¯ ¯ be the feature set, where Ey We say that the sequence satisfies the strong target independent irrepresentable condition if there ¯ exists δ > 0 such that limn→∞ µX (n) (F (n) ) ≤ 1 − δ. [sent-163, score-0.451]
58 We say that the sequence satisfies the weak target independent irrepresentable condition if ¯ µX (n) (F (n) ) ≤ 1 for all sufficiently large n. [sent-164, score-0.422]
59 It was shown by Zhao and Yu (2006) that the strong (target independent) irrepresentable condition ¯ is sufficient for Lasso to select features consistently for all possible sign combination of β(n) when n → ∞ (under appropriate assumptions). [sent-165, score-0.417]
60 In addition, the weak (target independent) irrepresentable condition is necessary for Lasso to select features consistently when n → ∞. [sent-166, score-0.425]
61 The target independent irrepresentable conditions are considered by Zhao and Yu (2006) and Wainwright (2006). [sent-167, score-0.385]
62 Specifically, the following two theorems show that the strong target independent irrepresentable condition is sufficient for Algorithm 1 to select features consistently, while the weak target independent irrepresentable condition is necessary. [sent-170, score-0.882]
63 Theorem 4 Consider regression problems indexed by the sample size n, and use notations in Definition 3. [sent-171, score-0.084]
64 For each problem of sample size n, denote by Fn the feature set from Algorithm 1 when it stops with ε = ns/2 for some s ∈ (0, 1]. [sent-174, score-0.107]
65 Then for all sufficiently large n, we have ¯ P(Fn = F (n) ) ≤ exp(−ns / ln n) if the following conditions hold: 560 F EATURE S ELECTION USING G REEDY L EAST S QUARES 1. [sent-175, score-0.189]
66 Theorem 5 Consider regression problems indexed by the sample size n, and use notations in Definition 3. [sent-181, score-0.084]
67 Assume that the weak irrepresentable ¯ condition is violated at sample sizes n1 < n2 < · · · . [sent-183, score-0.358]
68 ¯ T ¯ Proof By definition of µX (F), there exists v = (XF XF )u ∈ R|F| such that ¯ ¯ T T ¯ µX (F) = max (XF XF )−1 XF x j ¯ ¯ ¯ j∈F /¯ = max j∈F /¯ = max j∈F /¯ = 1 T T |vT (XF XF )−1 XF x j | ¯ ¯ ¯ v ∞ T |uT XF x j | ¯ T (XF XF )u ∞ ¯ ¯ max j∈F |xT XF u| / j ¯ maxi∈F |(xT XF )u| ¯ ¯ i . [sent-186, score-0.26]
69 At any sample size n = n j , since ¯ ¯ µX (n) (F (n) ) > 1, we can find a sufficiently large target vector β(n) such that ¯ ¯ max |xT X (n) β(n) | < max |xT X (n) β(n) | − 2δn . [sent-192, score-0.194]
70 Therefore (3) implies that j ¯ ¯ ¯ ¯ max [|xT X (n) β(n) | + |xT (y − X (n) β(n) )|] < max [|xT X (n) β(n) | − |xT (y − X (n) β(n) )|]. [sent-198, score-0.158]
71 i i j j ¯ i∈F (n) j∈F (n) /¯ Therefore max |xT y| < max |xT y|. [sent-199, score-0.13]
72 Conclusion We have shown that weak and strong target independent irrepresentable conditions are necessary and sufficient conditions for a greedy least squares regression algorithm to select features consistently. [sent-203, score-0.973]
73 These conditions match the target independent versions of the necessary and sufficient conditions for Lasso by Zhao and Yu (2006). [sent-204, score-0.134]
74 ¯ Moreover, if the eigenvalue ρ(F) is bounded away from zero, then the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. [sent-205, score-0.496]
75 In comparison, under the same condition, Lasso may require the coefficients to be larger than √ O( s) times the noise level, where s is the number of nonzero coefficients. [sent-206, score-0.092]
76 This implies that under some conditions, greedy least squares regression may potentially select features more effectively than Lasso in the presence of stochastic noise. [sent-207, score-0.56]
77 Although the target independent versions of the irrepresentable conditions for greedy least squares regression match those of Lasso, our result does not show which algorithm is better for any specific target. [sent-208, score-0.852]
78 Acknowledgments This paper is the outcome of some private discussion with Joel Tropp, which drew the author’s attention to the similarities and differences between orthogonal matching pursuit and Lasso for estimating sparse signals. [sent-210, score-0.099]
79 Auxiliary Lemmas The following technical lemmas from Zhang (2008) are needed to analyze the behavior of greedy least squares regression under stochastic noise. [sent-212, score-0.503]
80 The following lemma relates the squared error reduction in one greedy step to correlation coefficients. [sent-214, score-0.324]
81 2 The following lemma provides a bound on the squared error reduction of one forward greedy step. [sent-217, score-0.368]
82 , m, we have for all η ∈ (0, 1), with probability larger than 1 − η: 2 2 n sup | ∑ gi, j (ξi − Eξi )| ≤ a 2 ln(2m/η), j where a2 = i=1 sup j ∑n g2 j σ2 . [sent-246, score-0.081]
83 2 t 2 /2 , Z HANG This implies that n P sup j ∑ gi, j (ξi − Eξi ) ≥ ε ≤ m sup P j i=1 n ∑ gi, j (ξi − Eξi ) i=1 ≥ ε ≤ 2me−ε 2 /(2a2 ) . [sent-249, score-0.09]
84 ¯ Since by definition, ρ(F)n is no larger than the smallest eigenvalue of GT G, the desired inequality follows from the estimate eT (GT G)−1 GT j 2 2 ¯ = eT (GT G)−1 e j ≤ 1/(nρ(F)). [sent-264, score-0.078]
85 For all η ∈ (0, 1), with probability larger than 1 − η, we have ˆ ¯ max |(X βX (F, y) − y)T x j | ≤ σ 2n ln(2d/η). [sent-271, score-0.084]
86 Lemma 8 implies that with probability larger than 1 − η: sup |(y − Ey)T (I − P)x j | ≤ σ 2n ln(2d/η), j=1,. [sent-273, score-0.078]
87 First, Lemma 10 imˆ ¯ plies that with large probability, max j∈F |(X βX (F, y) − y)T x j | is small. [sent-283, score-0.08]
88 The claims of the theorem then follow by induction. [sent-285, score-0.072]
89 We have ˆ ¯ ¯ max |(Xβ − y)T x j | ≤ max |(X βX (F, y) − y)T x j | + µX (F) max |(Xβ − y)T xi |. [sent-289, score-0.215]
90 Therefore T max |(Xβ − y)T xi | = max |xT X(β − β′ )| = XF XF (β − β′ ) ¯ ¯ i ¯ i∈F ¯ i∈F ∞. [sent-292, score-0.15]
91 T ¯ Let v = XF XF (β − β′ ), then the definition of µX (F) implies that ¯ ¯ ¯ µX (F) ≥ max j∈F /¯ T |xT XF (XF XF )−1 v| max j∈F |xT XF (β − β′ )| /¯ ¯ ¯ j ¯ j ¯ = . [sent-293, score-0.158]
92 T X (β − β′ ) v ∞ XF F ∞ ¯ ¯ We obtain from the above T ¯ max |xT X(β − β′ )| ≤ µX (F) XF XF (β − β′ ) ¯ ¯ j j∈F /¯ ∞ ¯ = µX (F) max |(Xβ − y)T xi |. [sent-294, score-0.15]
93 ¯ i∈F Now, the lemma follows from the simple inequality max |(Xβ − y)T x j | ≤ max |xT X(β − β′ )| + max |xT (Xβ′ − y)| j j j∈F /¯ j∈F /¯ j∈F /¯ Now we are ready to prove the theorem. [sent-295, score-0.256]
94 565 Z HANG From Lemma 10, we obtain with probability larger than 1 − η, ˆ ¯ ˜ max |(X(βX (F, y)) − y)T x j | ≤ σ j∈F /¯ ¯ 2 ln(2d/η) < (1 − µX (F))ε. [sent-297, score-0.084]
95 In this case, we have ˆ ¯ ¯ ˜ ˜ max |(Xβ(k−1) − y)T xi | > ε > max |(X βX (F, y) − y)T x j |/(1 − µX (F)). [sent-305, score-0.15]
96 Now by combining this estimate with Lemma 11, we obtain ˜ ˜ max |(Xβ(k−1) − y)T x j | < max |(Xβ(k−1) − y)T xi |. [sent-307, score-0.15]
97 i(k) ∈ F: By Lemma 11, we must have: / ¯ ˜ ˜ max |(Xβ − y)T xi | ≤ max |(Xβ − y)T x j | ¯ i∈F j∈F /¯ ˆ ¯ ¯ ˜ ˜ ≤ max |(X βX (F, y) − y)T x j | + µX (F) max |(Xβ − y)T xi | ¯ j∈F ¯ i∈F ˆ ¯ ¯ ˜ ˜ ≤ max |(X βX (F, y) − y) x j | + µX (F) max |(Xβ − y)T x j |. [sent-311, score-0.43]
98 T ¯ j∈F j∈F /¯ 566 F EATURE S ELECTION USING G REEDY L EAST S QUARES Therefore ˜ max |(Xβ − y)T x j | ≤ j∈F /¯ 1 T ˆ ¯ ˜ ¯ max |(X βX (F, y) − y) x j | < ε. [sent-312, score-0.13]
99 Therefore by induction, when the procedure stops, the following three claims hold: ¯ • F (k−1) ⊂ F. [sent-321, score-0.072]
100 Some sharp performance bounds for least squares regression with L1 regularization. [sent-358, score-0.205]
wordName wordTfidf (topN-words)
[('xf', 0.714), ('irrepresentable', 0.286), ('greedy', 0.279), ('lasso', 0.243), ('ln', 0.154), ('supp', 0.143), ('gt', 0.143), ('squares', 0.122), ('zhao', 0.12), ('ey', 0.112), ('tropp', 0.109), ('reedy', 0.107), ('xt', 0.097), ('quares', 0.091), ('yu', 0.088), ('hang', 0.081), ('eature', 0.075), ('east', 0.075), ('stops', 0.07), ('wainwright', 0.069), ('max', 0.065), ('target', 0.064), ('election', 0.056), ('claims', 0.049), ('condition', 0.048), ('maxi', 0.047), ('rn', 0.045), ('noise', 0.045), ('lemma', 0.045), ('forward', 0.044), ('zhang', 0.044), ('gi', 0.044), ('eigenvalue', 0.043), ('regression', 0.042), ('meinshausen', 0.039), ('stopping', 0.038), ('sparse', 0.038), ('pursuit', 0.037), ('feature', 0.037), ('coef', 0.037), ('eyi', 0.036), ('mallat', 0.036), ('omp', 0.036), ('ns', 0.035), ('conditions', 0.035), ('tong', 0.034), ('sup', 0.031), ('rutgers', 0.031), ('joel', 0.03), ('selection', 0.028), ('bin', 0.028), ('implies', 0.028), ('nonzero', 0.028), ('nicolai', 0.027), ('inf', 0.027), ('rd', 0.027), ('features', 0.025), ('least', 0.024), ('matching', 0.024), ('identify', 0.024), ('weak', 0.024), ('procedure', 0.023), ('theorem', 0.023), ('sgn', 0.023), ('signs', 0.022), ('notations', 0.021), ('select', 0.021), ('indexed', 0.021), ('consistently', 0.021), ('xd', 0.021), ('xi', 0.02), ('min', 0.02), ('stochastic', 0.019), ('larger', 0.019), ('cients', 0.019), ('worst', 0.019), ('recovery', 0.019), ('backward', 0.018), ('consistency', 0.018), ('design', 0.017), ('sharp', 0.017), ('lemmas', 0.017), ('away', 0.017), ('reliably', 0.016), ('criterion', 0.016), ('strong', 0.016), ('proof', 0.016), ('quantity', 0.016), ('appeared', 0.016), ('established', 0.016), ('inequality', 0.016), ('assumption', 0.016), ('basis', 0.015), ('establish', 0.015), ('rithm', 0.015), ('limn', 0.015), ('pursuits', 0.015), ('plies', 0.015), ('alent', 0.015), ('eet', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
2 0.08818724 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
3 0.079493798 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
4 0.077895492 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
5 0.063800864 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
6 0.062434167 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
7 0.059848774 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
8 0.053099103 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
9 0.048498519 16 jmlr-2009-Classification with Gaussians and Convex Loss
10 0.044430822 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
11 0.040760007 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
12 0.039453749 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
13 0.038949568 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
14 0.037297178 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)
15 0.035721164 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
16 0.035393942 78 jmlr-2009-Refinement of Reproducing Kernels
17 0.034758486 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression
18 0.034560248 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
19 0.034340989 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
20 0.033767916 49 jmlr-2009-Learning Permutations with Exponential Weights
topicId topicWeight
[(0, 0.175), (1, 0.05), (2, -0.005), (3, -0.014), (4, 0.009), (5, 0.075), (6, -0.076), (7, 0.034), (8, -0.135), (9, -0.021), (10, 0.13), (11, 0.019), (12, -0.015), (13, 0.17), (14, 0.153), (15, -0.031), (16, -0.319), (17, -0.16), (18, -0.01), (19, 0.084), (20, -0.004), (21, 0.042), (22, 0.214), (23, -0.07), (24, 0.037), (25, 0.042), (26, 0.102), (27, 0.14), (28, -0.012), (29, -0.127), (30, 0.097), (31, -0.019), (32, 0.05), (33, -0.055), (34, 0.228), (35, 0.092), (36, 0.091), (37, 0.058), (38, 0.072), (39, 0.055), (40, -0.077), (41, -0.031), (42, 0.076), (43, -0.127), (44, 0.2), (45, 0.046), (46, -0.086), (47, 0.121), (48, -0.116), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.96094108 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
2 0.46481243 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
3 0.37413645 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
4 0.32950166 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
5 0.3188518 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
7 0.26548797 90 jmlr-2009-Structure Spaces
8 0.2437185 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
9 0.2398957 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
10 0.23300995 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
11 0.22058597 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
12 0.22058316 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression
13 0.19244045 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
14 0.19170617 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
15 0.18707491 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
16 0.18209258 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
17 0.17620577 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
18 0.1531994 46 jmlr-2009-Learning Halfspaces with Malicious Noise
19 0.15280668 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
20 0.1507545 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
topicId topicWeight
[(8, 0.01), (11, 0.025), (19, 0.018), (26, 0.042), (38, 0.016), (41, 0.355), (52, 0.035), (55, 0.078), (58, 0.033), (66, 0.146), (68, 0.043), (90, 0.047), (96, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.70233911 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
2 0.44632018 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
3 0.44154489 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.43738434 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
Author: Junning Li, Z. Jane Wang
Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton
6 0.43015793 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.42854214 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
8 0.42804843 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
9 0.42586085 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
10 0.42540547 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
11 0.42405403 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
12 0.42353258 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
13 0.42298877 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
14 0.41985196 16 jmlr-2009-Classification with Gaussians and Convex Loss
15 0.41899967 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
16 0.4167197 18 jmlr-2009-Consistency and Localizability
17 0.41573012 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
18 0.4156788 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence
19 0.41407672 78 jmlr-2009-Refinement of Reproducing Kernels
20 0.41373086 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost