jmlr jmlr2013 jmlr2013-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
Reference: text
sentIndex sentText sentNum sentScore
1 This article introduces a general tuning parameter selection criterion based on variable selection stability. [sent-8, score-0.633]
2 Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning 1. [sent-12, score-0.816]
3 For example, Zhao and Yu (2006) showed that, under the irrepresentable condition, the lasso regression is selection consistent when the tuning parameter converges to 0 at a rate slower than O(n−1/2 ). [sent-21, score-0.696]
4 In literature, many classical selection criteria have been applied to the penalized regression models, including cross validation (Stone, 1974), generalized cross validation (Craven and Wahba, 1979), Mallows’ Cp (Mallows, 1973), AIC (Akaike, 1974) and BIC (Schwarz, 1978). [sent-24, score-0.448]
5 To the best of our knowledge, few criteria has been developed directly focusing on the selection of the informative variables. [sent-30, score-0.368]
6 This article proposes a tuning parameter selection criterion based on variable selection stability. [sent-31, score-0.633]
7 The similarity between two informative variable sets is measured by Cohen’s kappa coefficient (Cohen, 1960), which adjusts the actual variable selection agreement relative to the possible agreement by chance. [sent-33, score-0.967]
8 Whereas the stability selection method (Meinshausen and B¨ hlmann, 2010) also u u follows the idea of variable selection stability, it mainly focuses on selecting the informative variables as opposed to selecting the tuning parameters for any given variable selection methods. [sent-36, score-1.057]
9 More importantly, its asymptotic selection consistency is established, showing that the variable selection method with the selected tuning parameter would recover the truly informative variable set with probability tending to one. [sent-38, score-0.881]
10 Section 3 presents the idea of variable selection stability as well as the proposed kappa selection criterion. [sent-41, score-0.994]
11 Section 4 establishes the asymptotic selection consistency of the kappa selection criterion. [sent-42, score-0.917]
12 Here a penalty term is said to be selection consistent if the probability that the fitted regression model includes only the truly informative variables is tending to one, and λ is replaced by λn to emphasize its dependence on n in quantifying the asymptotic behaviors. [sent-56, score-0.512]
13 Therefore, it is in demand to devise a tuning parameter selection criterion that can be employed by the penalized regression models so that their variable selection performance can be optimized. [sent-60, score-0.753]
14 Tuning via Variable Selection Stability This section introduces the proposed tuning parameter selection criterion based on the concept of variable selection stability. [sent-62, score-0.633]
15 The key idea is that if we repeatedly draw samples from the population and apply the candidate variable selection methods, a desirable method should produce the informative variable set that does not vary much from one sample to another. [sent-63, score-0.426]
16 Clearly, variable selection stability is assumption free and can be used to tune any penalized regression model. [sent-64, score-0.461]
17 A base variable selection method Ψ(zn ; λ) with a given training sample zn and a tuning parameter λ yields a set of selected informative variables A ⊂ {1, · · · , p}, called the active set. [sent-67, score-0.566]
18 Supposed that two active sets A1 and A2 are produced, the agreement between A1 and A2 can be measured by Cohen’s kappa coefficient (Cohen, 1960), κ(A1 , A2 ) = Pr(a) − Pr(e) . [sent-69, score-0.535]
19 As a consequence, the kappa coefficient in (2) is not suitable for evaluating the null model with no informative variable and the complete model with all variables. [sent-82, score-0.64]
20 2 Kappa Selection Criterion This section proposes an estimation scheme of the variable selection stability based on cross validation, and develops a kappa selection criterion to tune the penalized regression models by maximizing the estimated variable selection stability. [sent-88, score-1.504]
21 Furthermore, in order to reduce the estimation variability due to the splitting randomness, multiple data splitting can be conducted and the average estimated variable selection stability over all splittings is computed. [sent-91, score-0.375]
22 The proposed kappa selection criterion is present as follows. [sent-93, score-0.758]
23 A large value of λ may produce an active set that consistently overlooks the weakly informative variables, which leads to an underfitted model with large variable selection stability. [sent-104, score-0.412]
24 The proposed kappa selection criterion shares the similar idea of variable selection stability with the stability selection method (Meinshausen and B¨ hlmann, 2010), but they differ in a number of u ways. [sent-111, score-1.381]
25 First, the stability selection method is a competitive variable selection method, which combines the randomized lasso regression and the bootstrap, and achieves superior variable selection performance. [sent-112, score-1.162]
26 However, the kappa selection criterion can be regarded as a model selection criterion that is designed to select appropriate tuning parameters for any variable selection method. [sent-113, score-1.414]
27 Second, despite of its robustness, the stability selection method still requires a number of tuning parameters. [sent-114, score-0.371]
28 On the contrary, the kappa selection criterion can be directly applied to select the tuning parameters for the stability selection method. [sent-117, score-1.152]
29 Asymptotic Selection Consistency This section presents the asymptotic selection consistency of the proposed kappa selection criterion. [sent-119, score-0.917]
30 Furthermore, we denote rn ≺ sn if rn converges to 0 at a faster rate than sn , rn ∼ sn if rn converges to 0 at the same rate as sn , and rn sn if rn converges to 0 at a rate not slower than sn . [sent-121, score-0.6]
31 Assumption 1: There exist positive rn and sn such that the base variable selection method is selection consistent if rn ≺ λn ≺ sn . [sent-124, score-0.688]
32 Assumption 1 specifies an asymptotic working interval for λn within which the base variable selection method is selection consistent. [sent-127, score-0.507]
33 For instance, Lemma 2 in the online supplementary material shows that Assumptions 1 and 2 are satisfied by the lasso regression, the adaptive lasso, and the SCAD. [sent-134, score-0.426]
34 The assumptions can also be verified for other methods such as the elastic-net (Zou and Hastie, 2005), the adaptive elastic net (Zou and Zhang, 2009), the group lasso (Yuan and Lin, 2006), and the adaptive group lasso (Wang and Leng, 2008). [sent-135, score-0.758]
35 Given that the base variable selection method is selection consistent with appropriately selected λn ’s, Theorem 1 shows that the proposed kappa selection criterion is able to identify such λn ’s. [sent-136, score-1.28]
36 3424 C ONSISTENT S ELECTION OF T UNING PARAMETERS VIA VARIABLE S ELECTION S TABILITY ˆ Theorem 1 Under Assumptions 1 and 2, any variable selection method in (1) with λn selected as in Algorithm 1 with αn ≻ εn is selection consistent. [sent-137, score-0.473]
37 ˆ n→∞ B→∞ Theorem 1 claims the asymptotic selection consistency of the proposed kappa selection criterion when p is fixed. [sent-139, score-1.022]
38 That is, with probability tending to one, the selected active set by the resultant ˆ variable selection method with tuning parameter λn contains only the truly informative variables. [sent-140, score-0.628]
39 As long as αn converges to 0 not too fast, the kappa selection criterion is guaranteed to be consistent. [sent-141, score-0.758]
40 Therefore, the value of αn is expected to have little effect on the performance of the kappa selection criterion, which agrees with the sensitivity study in Section 5. [sent-142, score-0.653]
41 2 Consistency with Diverging pn In high-dimensional data analysis, it is of interest to study the asymptotic behavior of the proposed kappa selection criterion with diverging pn , where size of truly informative set p0n may also diverge with n. [sent-145, score-1.321]
42 Assumption 1a: There exist positive rn and sn such that the base variable selection method is selection consistent if rn ≺ λn ≺ sn . [sent-147, score-0.688]
43 ˆ Theorem 2 Under Assumptions 1a and 2a, any variable selection method in (1) with λn selected −1 c c ) ≻ α ≻ ε is selection consistent, where c = ˜1n as in Algorithm 1 with min (pn (1 − c1n ), pn 1n 2n ˜ n n supλ0 c1n (λ0 ), c1n = infλ0 c1n (λ0 ), and c2n = infλ0 c2n (λ0 ). [sent-151, score-0.602]
44 Theorem 2 shows the asymptotic selection consistency of the proposed kappa selection criterion with satisfied αn for diverging pn , where the diverging speed of pn is bounded as in Theorem 2 and 3425 S UN , WANG AND FANG depends on the base variable selection method. [sent-152, score-1.762]
45 Lemma 3 in the online supplementary material shows that (7)-(10) in Assumptions 1a and 2a are satisfied by the lasso regression. [sent-153, score-0.383]
46 Simulations This section examines the effectiveness of the proposed kappa selection criterion in simulated examples. [sent-157, score-0.758]
47 To assess the performance of each selection criterion, we report the percentage of selecting the true model over all replicates, as well as the number of correctly selected zeros and incorrectly ˆ ˆ ˆ ˆ selected zeros in β(λ). [sent-163, score-0.489]
48 For comparison, we set n = 40, 60 or 80 and implement the lasso regression, the adaptive lasso and the SCAD as the base variable selection methods. [sent-172, score-0.995]
49 The lasso regression and the adaptive lasso 3426 C ONSISTENT S ELECTION OF T UNING PARAMETERS VIA VARIABLE S ELECTION S TABILITY are implemented by package ‘lars’ (Efron et al. [sent-173, score-0.761]
50 The number of splittings B for the kappa selection criterion is 20. [sent-179, score-0.775]
51 Each simulation is replicated 100 times, and the percentages of selecting the true active set, the average numbers of correctly selected zeros (C) and incorrectly selected zeros (I), and the relative prediction errors (RPE) are summarized in Tables 1-2 and Figure 1. [sent-180, score-0.417]
52 n 40 60 80 Penalty Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD Ks 0. [sent-181, score-1.008]
53 61 Table 1: The percentages of selecting the true active set for various selection criteria in simulations of Section 5. [sent-225, score-0.38]
54 Here ‘Ks’, ‘Cp’, ‘BIC’, ‘CV’ and ‘GCV’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-227, score-0.653]
55 n 40 60 80 Penalty Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD Ks C 4. [sent-228, score-1.008]
56 22 GCV I 0 0 0 0 0 0 0 0 0 Table 2: The average numbers of correctly selected zeros (C) and incorrectly selected zeros (I) for various selection criteria in simulations of Section 5. [sent-273, score-0.507]
57 Here ‘Ks’, ‘Cp’, ‘BIC’, ‘CV’ and ‘GCV’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-275, score-0.653]
58 Evidently, the proposed kappa selection criterion delivers superior performance against its competitors in terms of both variable selection accuracy and relative prediction error. [sent-276, score-1.123]
59 As shown in Table 1, the kappa selection criterion has the largest probability of choosing the true active set and consistently outperforms other selection criteria, especially when the lasso regression is used as the base variable selection method. [sent-277, score-1.655]
60 3427 S UN , WANG AND FANG Table 2 shows that the kappa selection criterion yields the largest number of correctly selected zeros in all scenarios, and it yields almost perfect performance for the adaptive lasso and the SCAD. [sent-279, score-1.254]
61 In addition, all selection criteria barely select any incorrect zeros, whereas the kappa selection criterion is relatively more aggressive in that it has small chance to shrink some informative variables to zeros when sample size is small. [sent-280, score-1.232]
62 Besides the superior variable selection performance, the kappa selection criterion also delivers accurate prediction performance and yields small relative prediction error as displayed in Figure 1. [sent-282, score-1.122]
63 Here ‘K’, ‘Cp’, ‘B’, ‘C’ and ‘G’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-298, score-0.653]
64 To illustrate the effectiveness of the kappa selection criterion, we randomly select one replication with n = 40 and display the estimated variable selection stability as well as the results of detection and sparsity for various λ’s for the lasso regression. [sent-299, score-1.415]
65 30 α Figure 2: The detection and sparsity of the lasso regression with the kappa selection criterion are shown on the top and the sensitivity of α to the relative prediction error is shown on the bottom. [sent-312, score-1.238]
66 The optimal log(λ) selected by the kappa selection criterion is denoted as the filled triangle in the detection and sparsity plots. [sent-313, score-0.837]
67 More importantly, the selection performance of the kappa selection criterion is very stable against αn when it is small. [sent-316, score-0.948]
68 Specifically, l we apply the kappa selection criterion on the lasso regression for αn = { 100 ; l = 0, . [sent-317, score-1.14]
69 2 Scenario II: Diverging pn To investigate the effects of the noise level and the dimensionality, we compare all the selection criteria in the diverging pn scenario with a similar simulation model as in Scenario I, except that √ β = (5, 4, 3, 2, 1, 0, · · · , 0)T , pn = [ n], and σ = 1 or 6. [sent-324, score-0.757]
70 More specifically, 8 cases are examined: 3429 S UN , WANG AND FANG n = 100, pn = 10; n = 200, pn = 14; n = 400, pn = 20; and n = 800, pn = 28, with σ = 1 or 6 respectively. [sent-325, score-0.516]
71 The percentages of selecting the true active set, the average numbers of correctly selected zeros (C) and incorrectly selected zeros (I), and the relative prediction errors (RPE) are summarized in Tables 3-4 and Figures 3-4. [sent-327, score-0.417]
72 n pn 100 10 200 14 400 20 800 28 100 10 200 14 400 20 800 28 Penalty Ks Cp Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD 0. [sent-328, score-1.473]
73 46 Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD 0. [sent-343, score-1.344]
74 21 Table 3: The percentages of selecting the true active set for various selection criteria in simulations of Section 5. [sent-439, score-0.38]
75 Here ‘Ks’, ‘Cp’, ‘BIC’, ‘CV’ and ‘GCV’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-441, score-0.653]
76 In the low noise case with σ = 1, the proposed kappa selection criterion outperforms other competitors in both variable selection and prediction performance. [sent-442, score-1.066]
77 As illustrated in Tables 3-4, the kappa selection criterion delivers the largest percentage of selecting the true active set among all the selection criteria, and achieves perfect variable selection performance for all the variable selection methods when n ≥ 200. [sent-443, score-1.586]
78 Furthermore, as shown in Figure 3, the kappa selection criterion yields the smallest relative prediction error across all cases. [sent-444, score-0.811]
79 As the noise level increases to σ = 6, the kappa selection criterion still delivers the largest percentage of selecting the true active set among all scenarios except for the adaptive lasso with n = 400, where the percentage is slightly smaller than that of BIC. [sent-445, score-1.304]
80 21 0 0 0 0 0 0 0 0 0 0 0 0 Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD Lasso Ada lasso SCAD 4. [sent-461, score-1.344]
81 04 Table 4: The average numbers of correctly selected zeros (C) and incorrectly selected zeros (I) for various selection criteria in simulations of Section 5. [sent-614, score-0.507]
82 Here ‘Ks’, ‘Cp’, ‘BIC’, ‘CV’ and ‘GCV’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-616, score-0.653]
83 selection criterion yields the largest number of correctly selection zeros. [sent-617, score-0.506]
84 Considering the smaller relative prediction errors achieved by the kappa selection criterion and BIC, these two criteria tend to produce sparser models with satisfactory prediction performance. [sent-620, score-0.917]
85 In practice, if false negatives are of concern, one can increase the thresholding value αn in the kappa selection criterion, to allow higher tolerance of instability and hence decrease the chance of claiming false negatives. [sent-621, score-0.674]
86 In addition, as shown in Figure 4, the kappa selection criterion yields the smallest relative prediction error for the lasso regression and the adaptive lasso among all scenarios, whereas the advantage is considerably less significant for the SCAD. [sent-622, score-1.572]
87 Here ‘K’, ‘Cp’, ‘B’, ‘C’ and ‘G’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-640, score-0.653]
88 Real Application In this section, we apply the kappa selection criterion to the prostate cancer data (Stamey et al. [sent-642, score-0.846]
89 Since it is unknown whether the clinical measures are truly informative or not, the performance of all the selection criteria are compared by computing their corresponding prediction errors on the test set in Table 5. [sent-650, score-0.481]
90 As shown in Table 5, the proposed kappa selection criterion yields the sparsest model and achieves the smallest prediction error for the lasso regression and the SCAD, while the predic3432 C ONSISTENT S ELECTION OF T UNING PARAMETERS VIA VARIABLE S ELECTION S TABILITY n=200, pn=14 0. [sent-651, score-1.169]
91 Here ‘K’, ‘Cp’, ‘B’, ‘C’ and ‘G’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-667, score-0.653]
92 Active Set PE Penalty Lasso Ada lasso SCAD Lasso Ada lasso SCAD Ks 1,2,4,5 1,2,5 1,2,4,5 0. [sent-668, score-0.672]
93 825 Table 5: The selected active sets and the prediction errors (PE) for various selection criteria in the prostate cancer example. [sent-683, score-0.463]
94 Here ‘Ks’, ‘Cp’, ‘BIC’, ‘CV’ and ‘GCV’ represent the kappa selection criterion, Mallows’ Cp , BIC, CV and GCV, respectively. [sent-684, score-0.653]
95 tion error for the adaptive lasso is comparable to the minima. [sent-685, score-0.379]
96 As opposed to the sparse regression models produced by other selection criteria, the variable age is excluded by the kappa selection criterion for all base variable selection methods, which agrees with the findings in Zou and Hastie (2005). [sent-687, score-1.333]
97 Discussion This article proposes a tuning parameter selection criterion based on the concept of variable selection stability. [sent-689, score-0.633]
98 Its key idea is to select the tuning parameter so that the resultant variable selection method is stable in selecting the informative variables. [sent-690, score-0.539]
99 The lemma stated below shows that if a variable selection method is selection consistent and εn ≺ αn , then its variable selection stability converges to 1 in probability. [sent-700, score-0.798]
100 Shrinkage tuning parameter selection with a diverging number of parameters. [sent-885, score-0.38]
wordName wordTfidf (topN-words)
[('kappa', 0.463), ('scad', 0.379), ('lasso', 0.336), ('cp', 0.305), ('ada', 0.202), ('rpe', 0.197), ('selection', 0.19), ('gcv', 0.182), ('bic', 0.16), ('pn', 0.129), ('cv', 0.128), ('informative', 0.118), ('election', 0.117), ('adalasso', 0.108), ('criterion', 0.105), ('diverging', 0.101), ('onsistent', 0.098), ('tability', 0.098), ('uning', 0.098), ('mallows', 0.093), ('stability', 0.092), ('tuning', 0.089), ('fang', 0.074), ('penalized', 0.074), ('ks', 0.07), ('prostate', 0.069), ('zeros', 0.062), ('criteria', 0.06), ('zou', 0.059), ('variable', 0.059), ('rn', 0.053), ('sup', 0.05), ('truly', 0.049), ('wang', 0.048), ('sn', 0.047), ('regression', 0.046), ('active', 0.045), ('adaptive', 0.043), ('un', 0.042), ('sse', 0.039), ('consistency', 0.037), ('asymptotic', 0.037), ('selecting', 0.035), ('penalty', 0.035), ('selected', 0.034), ('fan', 0.034), ('delivers', 0.033), ('base', 0.031), ('percentages', 0.03), ('competitors', 0.03), ('gleason', 0.03), ('lcavol', 0.03), ('lweight', 0.03), ('svi', 0.03), ('prediction', 0.029), ('lim', 0.029), ('meinshausen', 0.028), ('agreement', 0.027), ('percentage', 0.027), ('junhui', 0.025), ('uninformative', 0.025), ('supplementary', 0.025), ('resultant', 0.025), ('incorrectly', 0.024), ('sun', 0.024), ('relative', 0.024), ('select', 0.023), ('detection', 0.023), ('material', 0.022), ('cohen', 0.022), ('sparsity', 0.022), ('shen', 0.021), ('chance', 0.021), ('correctly', 0.021), ('simulations', 0.02), ('antigen', 0.02), ('breheny', 0.02), ('craven', 0.02), ('lbph', 0.02), ('prostatectomy', 0.02), ('purdue', 0.02), ('stamey', 0.02), ('yixin', 0.02), ('validation', 0.02), ('scenario', 0.019), ('cross', 0.019), ('cancer', 0.019), ('tending', 0.019), ('consistent', 0.018), ('clinical', 0.018), ('pr', 0.018), ('errors', 0.017), ('li', 0.017), ('tted', 0.017), ('falsely', 0.017), ('irrepresentable', 0.017), ('popularly', 0.017), ('radical', 0.017), ('splittings', 0.017), ('estimated', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
2 0.18443657 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
3 0.12715061 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
Author: Wei Pan, Xiaotong Shen, Binghui Liu
Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)
4 0.070477217 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont
Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction
5 0.062352385 106 jmlr-2013-Stationary-Sparse Causality Network Learning
Author: Yuejia He, Yiyuan She, Dapeng Wu
Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap
6 0.05602368 104 jmlr-2013-Sparse Single-Index Model
7 0.052590765 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
8 0.051859852 76 jmlr-2013-Nonparametric Sparsity and Regularization
9 0.047808614 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
10 0.046903234 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
11 0.041223943 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
12 0.041072428 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
13 0.040556889 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
14 0.040017933 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
15 0.03768618 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.033671182 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
17 0.030526146 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory
18 0.028666513 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
19 0.027933847 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
20 0.0273584 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
topicId topicWeight
[(0, -0.148), (1, 0.03), (2, 0.057), (3, 0.065), (4, 0.169), (5, -0.117), (6, -0.02), (7, -0.031), (8, 0.298), (9, 0.001), (10, -0.087), (11, -0.169), (12, -0.141), (13, 0.104), (14, -0.071), (15, -0.167), (16, 0.119), (17, 0.29), (18, 0.082), (19, -0.032), (20, -0.006), (21, 0.08), (22, 0.111), (23, -0.169), (24, 0.118), (25, 0.036), (26, 0.123), (27, -0.029), (28, 0.003), (29, 0.022), (30, -0.105), (31, 0.223), (32, -0.074), (33, 0.123), (34, -0.048), (35, -0.022), (36, 0.044), (37, 0.028), (38, -0.028), (39, -0.024), (40, 0.029), (41, 0.02), (42, 0.024), (43, 0.019), (44, 0.031), (45, 0.034), (46, 0.047), (47, -0.097), (48, 0.0), (49, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.97545069 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
2 0.67340666 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
Author: Wei Pan, Xiaotong Shen, Binghui Liu
Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)
3 0.61065972 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
4 0.3509056 106 jmlr-2013-Stationary-Sparse Causality Network Learning
Author: Yuejia He, Yiyuan She, Dapeng Wu
Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap
5 0.29935789 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
6 0.28998232 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
7 0.28339201 76 jmlr-2013-Nonparametric Sparsity and Regularization
8 0.27379704 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
9 0.26720589 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
10 0.24748553 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
11 0.21060011 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
12 0.19681047 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
13 0.1966317 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
14 0.18283843 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
15 0.16986696 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
17 0.14872967 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
18 0.14460848 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
19 0.13371843 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
20 0.13249594 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
topicId topicWeight
[(5, 0.103), (6, 0.025), (10, 0.074), (20, 0.013), (23, 0.025), (48, 0.409), (68, 0.056), (70, 0.033), (75, 0.07), (85, 0.025), (87, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.71413934 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
2 0.65471691 32 jmlr-2013-Differential Privacy for Functions and Functional Data
Author: Rob Hall, Alessandro Rinaldo, Larry Wasserman
Abstract: Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when releasing summaries of a database. Previous work has focused mainly on methods for which the output is a finite dimensional vector, or an element of some discrete set. We develop methods for releasing functions while preserving differential privacy. Specifically, we show that adding an appropriate Gaussian process to the function of interest yields differential privacy. When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the Gaussian process, then the correct noise level is established by measuring the “sensitivity” of the function in the RKHS norm. As examples we consider kernel density estimation, kernel support vector machines, and functions in RKHSs. Keywords: differential privacy, density estimation, Gaussian processes, reproducing kernel Hilbert space
3 0.34719563 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
Author: Alberto N. Escalante-B., Laurenz Wiskott
Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the final label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classification, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufficient as well as when feature selection is difficult. Keywords: slow feature analysis, feature extraction, classification, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis
5 0.34270361 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
6 0.34202716 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
8 0.34129104 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
9 0.3405987 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
10 0.33846402 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
11 0.33820084 59 jmlr-2013-Large-scale SVD and Manifold Learning
12 0.33809468 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
13 0.33794436 86 jmlr-2013-Parallel Vector Field Embedding
14 0.33757019 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
15 0.33726251 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.3360028 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
17 0.33578101 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
18 0.33564559 21 jmlr-2013-Classifier Selection using the Predicate Depth
19 0.33507639 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
20 0.33443406 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion