jmlr jmlr2009 jmlr2009-97 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
Reference: text
sentIndex sentText sentNum sentScore
1 Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al. [sent-9, score-0.143]
2 Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. [sent-13, score-0.101]
3 A new technique is introduced to reduce the false selection rate in the feature screening stage. [sent-15, score-0.37]
4 Keywords: classification, feature screening, generalized linear models, robust regression, feature selection 1. [sent-17, score-0.146]
5 The dimensionality grows very rapidly when interactions of the features are considered, which is necessary for many scientific endeavors. [sent-28, score-0.104]
6 One particularly popular family of methods is based on penalized least-squares or, more generally, penalized pseudo-likelihood. [sent-38, score-0.19]
7 Independence screening recruits those features having the best marginal utility, which corresponds to the largest marginal correlation with the response in the context of least-squares regression. [sent-52, score-0.512]
8 Under certain regularity conditions, Fan & Lv (2008) show surprisingly that this fast feature selection method has a ‘sure screening property’; that is, with probability very close to 1, the independence screening technique retains all of the important features in the model. [sent-53, score-0.777]
9 The sure screening property is described in greater detail in Section 3. [sent-55, score-0.36]
10 Roughly, ISIS works by iteratively performing feature selection to recruit a small number of features, computing residuals based on the model fitted using these recruited features, and then using the working residuals as the response variable to continue recruiting new features. [sent-58, score-0.337]
11 Independence screening is a commonly used techniques for feature selection. [sent-61, score-0.335]
12 This screening method is indeed a form of independence screening and has been justified by Fan & Fan (2008) under some ideal situations. [sent-69, score-0.623]
13 1, it is easy to construct features that are marginally unrelated, but jointly related with the response. [sent-72, score-0.159]
14 Such features will be screened out by independent learning methods such as the two-sample t test. [sent-73, score-0.102]
15 How can we construct better feature selection procedures in ultrahigh dimensional feature space than the independence screening popularly used in feature selection? [sent-75, score-0.55]
16 , β p )T that is sparse and minimizes an objective function of the form n Q(β0 , β) = ∑ L(Yi , β0 + xT β), i i=1 (xT ,Yi ) i where are the covariate vector and response for the ith individual. [sent-83, score-0.101]
17 Any screening method, by default, has a large false selection rate (FSR), namely, many unimportant features are selected after screening. [sent-95, score-0.447]
18 The first has the desirable property that in highdimensional problems the probability of incorrectly selecting unimportant features is small. [sent-98, score-0.122]
19 The second method is less aggressive, and for the linear model has the same sure screening property as the original SIS technique. [sent-100, score-0.36]
20 It uses only a marginal relation between features and the response variable to screen variables. [sent-142, score-0.178]
21 For classification problems with quadratic loss L, Fan & Lv (2008) show that SIS reduces to feature screening using a two-sample t-statistic. [sent-145, score-0.335]
22 2 for some theoretical results on the sure screening property. [sent-148, score-0.36]
23 2 Penalized Pseudo-likelihood With features crudely selected by SIS, variable selection and parameter estimation can further be carried out simultaneously using a more refined penalized (pseudo)-likelihood method, as we now describe. [sent-150, score-0.204]
24 In the penalized likelihood approach, we seek to minimize n d i=1 j=1 ℓ(β0 , β) = n−1 ∑ L(Yi , β0 + xT β) + ∑ pλ (|β j |). [sent-163, score-0.121]
25 Park & Hastie (2007) describe an iterative algorithm for minimizing the objective function for the L1 penalty, and Zhang (2009) propose a PLUS algorithm for finding solution paths to the penalized least-squares problem with a general penalty pλ (·). [sent-170, score-0.153]
26 For a class of penalty functions that includes the SCAD penalty and when d is fixed as n diverges, Fan & Li (2001) established an oracle property; that is, the penalized estimates perform asymptotically as well as if an oracle had told us in advance which components of β were non-zero. [sent-181, score-0.255]
27 3 Iterative Feature Selection The SIS methodology may break down if a feature is marginally unrelated, but jointly related with the response, or if a feature is jointly uncorrelated with the response but has higher marginal correlation with the response than some important features. [sent-187, score-0.43]
28 In the former case, the important feature has already been screened at the first stage, whereas in the latter case, the unimportant feature is ranked too high by the independent screening technique. [sent-188, score-0.456]
29 In the first step, we apply SIS to pick a set A1 of indices of size k1 , and then employ a penalized (pseudo)-likelihood method such as Lasso, SCAD, MCP or the adaptive Lasso to select a subset M1 of these indices. [sent-190, score-0.155]
30 In this screening stage, an alternative approach is to substitute the fitted value βM from the first stage into (3) and the optimization problem (3) would 1 only be bivariate. [sent-202, score-0.318]
31 The conditional contributions of features are more relevant in recruiting variables at the second stage, but the computation is more expensive. [sent-204, score-0.112]
32 After the prescreening step, we use penalized likelihood to obtain n β2 = argmin n−1 ∑ L(Yi , β0 + xT β0 ,βM ,βA 1 2 i=1 β i,M1 M1 + xTA βA2 ) + i, 2 ∑ j∈M1 ∪A2 pλ (|β j |). [sent-207, score-0.121]
33 The indices of β2 that are non-zero yield a new estimated set M2 of active indices. [sent-209, score-0.105]
34 It allows the procedure to delete variables from the previously selected features with indices in M1 . [sent-211, score-0.134]
35 The reason is that the fitting of the residuals from the (r − 1)th iteration on the remaining features significantly weakens the priority of those unimportant features that are highly correlated with the response through their associations with {X j : j ∈ Mr−1 }. [sent-221, score-0.317]
36 This is due to the fact that the features {X j : j ∈ Mr−1 } have lower correlation with the residuals than with the original responses. [sent-222, score-0.135]
37 Reduction of False Selection Rates Sure independence screening approaches are simple and quick methods to screen out irrelevant features. [sent-240, score-0.333]
38 The first is an aggressive feature selection method that is particularly useful when the dimensionality is very large relative to the sample size; the second is a more conservative procedure. [sent-243, score-0.135]
39 We write A for the set of active indices—that is, the set containing those indices j for which β j = 0 in the true model. [sent-246, score-0.105]
40 Write XA = {X j : j ∈ A } and XA c = {X j : j ∈ A c } for the corresponding sets of active and inactive variables respectively. [sent-247, score-0.105]
41 Apply SIS or ISIS separately to the data in each partition (with d = ⌊n/ log n⌋ or larger, say), yielding two estimates A (1) and A (2) of the set of active indices A . [sent-249, score-0.131]
42 Both of them should have large FSRs, as they are constructed from a crude screening method. [sent-250, score-0.29]
43 Then, the active features should appear in both sets with probability tending to one. [sent-252, score-0.119]
44 However, this estimate contains many fewer indices corresponding to inactive features, as such indices have to appear twice at random in the sets A (1) and A (2) . [sent-255, score-0.18]
45 Just as in the original formulation of SIS in Section 2, we can now use a penalized (pseudo)likelihood method such as SCAD to perform final feature selection from A and parameter estimation. [sent-257, score-0.175]
46 , jr are distinct elements of A c } is exchangeable. [sent-266, score-0.161]
47 This condition ensures that each inactive feature is equally likely to be recruited by SIS. [sent-267, score-0.162]
48 Note that (A1) certainly allows inactive features to be correlated with the response, but does imply that each inactive feature has the same marginal distribution. [sent-268, score-0.263]
49 In Theorem 1 below, the case r = 1 is particularly important, as it gives an upper bound on the probability of recruiting any inactive features into the model. [sent-269, score-0.172]
50 Then P(|A ∩ A c | ≥ r) ≤ = ∑ P( j1 ∈ A , · · · , jr ∈ A ) ∑ P( j1 ∈ A (1) , · · · , jr ∈ A (1) )2 , ( j1 ,. [sent-281, score-0.322]
51 , jr )∈J 2021 FAN , S AMWORTH AND W U in which we use the random splitting in the last equality. [sent-287, score-0.161]
52 Obviously, the last probability is bounded by (5) max P( j1 ∈ A (1) , · · · , jr ∈ A (1) ) ∑ P( j1 ∈ A (1) , · · · , jr ∈ A (1) ). [sent-288, score-0.322]
53 , jr )∈J Since there are at most d inactive features from A c in the set A (1) , the number of r-tuples from J falling in the set A (1) can not be more than the total number of such r-tuples in A (1) , that is, ∑ ( j1 ,. [sent-295, score-0.295]
54 r P( j1 ∈ A (1) , · · · , jr ∈ A (1) ) ≤ (6) Substituting this into (5), we obtain P(|A ∩ A c | ≥ r) ≤ d r max ( j1 ,. [sent-303, score-0.161]
55 Now, under the exchangeability condition (A1), each r-tuple of distinct indices in A c is equally likely to be recruited into A (1) . [sent-307, score-0.157]
56 , jr )∈J P( j1 ∈ A (1) , · · · , jr ∈ A (1) ) ≤ d r p−|A | r , and the first result follows. [sent-311, score-0.322]
57 Intuitively, if p is large, then each inactive feature has small probability of being included in the estimated active set, so it is very unlikely indeed that it will appear in the estimated active sets from both partitions. [sent-318, score-0.195]
58 ∩ A (K) , where A (k) represents the estimated set of active indices from the kth partition. [sent-324, score-0.105]
59 Such a variable selection procedure would be even more aggressive than the K = 2 version; improved bounds in Theorem 1 could be obtained, but the probability of missing true active indices would be increased. [sent-325, score-0.165]
60 In the iterated version of this first variant of SIS, we apply SIS to each partition separately to (2) (1) obtain two sets of indices A1 and A1 , each having k1 elements. [sent-327, score-0.121]
61 After forming the intersection (1) (2) A1 = A1 ∩ A1 , we carry out penalized likelihood estimation as before to give a first approximation M1 to the true active set of features. [sent-328, score-0.186]
62 Taking the intersection of these sets and re-estimating parameters using penalized likelihood as in Section 2 gives a second approximation M2 to the true active set. [sent-330, score-0.186]
63 2 Second Variant of ISIS Our second variant of SIS is a more conservative feature selection procedure and also relies on random partitioning the data into K = 2 groups as before. [sent-333, score-0.11]
64 Again, we apply SIS to each partition separately, but now we recruit as many features into equal-sized sets of active indices A (1) and A (2) as are required to ensure that the intersection A = A (1) ∩ A (2) has d elements. [sent-334, score-0.199]
65 We then apply a penalized pseudo-likelihood method to the features XA = {X j : j ∈ A } for final feature selection ˜ and parameter estimation. [sent-335, score-0.249]
66 Theoretical support for this method can be provided in the case of the linear model; namely, under certain regularity conditions, this variant of SIS possesses the sure screening property. [sent-336, score-0.39]
67 Thus, these technical conditions ensure that any non-zero signal is not too small, and that the features are not too close to being collinear, and the dimensionality is also controlled via log p = o(n1−2κ / log n), which is still of an exponential order. [sent-343, score-0.156]
68 See Fan & Lv (2008) for further discussion of the sure screening property. [sent-344, score-0.36]
69 The sure screening property does not depend on the correlation of the features, as expected. [sent-350, score-0.38]
70 At the first stage we apply (1) (2) SIS, taking enough features in equal-sized sets of active indices A1 and A1 to ensure that the (1) (2) intersection A1 = A1 ∩ A1 has k1 elements. [sent-356, score-0.227]
71 Applying penalized likelihood to the features with indices in A1 gives a first approximation M1 to the true set of active indices. [sent-357, score-0.3]
72 We then carry out a second stage of the ISIS procedure of Section 2 to each partition separately to obtain equal-sized (1) (2) (1) (2) new sets of indices A2 and A2 , taking enough features to ensure that A2 = A2 ∩ A2 has k2 elements. [sent-358, score-0.162]
73 We will see later that for both Case 2 and Case 3 the true coefficients are chosen such that the response is marginally uncorrelated with X4 . [sent-384, score-0.156]
74 Regarding the choice of d, the asymptotic theory of Fan & Lv (2008) shows that in the linear ∗ model there exists θ∗ > 0 such that we may obtain the sure screening property with ⌊n1−θ ⌋ < 2024 U LTRAHIGH D IMENSIONAL F EATURE S ELECTION d < n. [sent-387, score-0.36]
75 In particular, the binary response of a logistic regression model and, to a lesser extent, the integer-valued response in a Poisson regression model are less informative than the n real-valued response in a linear model. [sent-391, score-0.358]
76 We therefore used d = ⌊ 4 log n ⌋ in the logistic regression and n multicategory classification settings of Sections 4. [sent-392, score-0.148]
77 For the first variant (Var1n SIS), however, we used d = ⌊ log n ⌋ = 66; note that since this means the selected features are in the intersection of two sets of size d, we typically end up with far fewer than d features selected by this method. [sent-406, score-0.224]
78 The reason for using the independent tuning data set is that the lack of information in the binary response means that cross-validation is particularly prone to overfitting in logistic regression, and therefore perfoms poorly for all methods. [sent-410, score-0.12]
79 The fact that X4 is marginally independent of the response is designed to make it difficult for a popular method such as the two-sample t test or other independent learning methods 2025 FAN , S AMWORTH AND W U to recognize this important feature. [sent-430, score-0.164]
80 For Case 2, the ideal variables picked up by the two sample test or independence screening technique are X1 , X2 and X3 . [sent-432, score-0.362]
81 In fact one may exaggerate Case 2 to make the Bayes error using the independence screening technique close to 0. [sent-436, score-0.333]
82 For example, the Bayes error using the independence screening technique, which deletes X4 , is 0. [sent-438, score-0.333]
83 The fifth row gives the median final number of features selected. [sent-444, score-0.142]
84 1), Akaike’s information criterion (Akaike, 1974), which adds twice the number of features in the final model, and the Bayesian information criterion (Schwarz, 1978), which adds the product of log n and the number of features in the final model. [sent-446, score-0.174]
85 Finally, an independent test data set of size 100n ˆ was used to evaluate the median value of 2Q(β0 , β) on the test data (Row 9), as well as to report the median 0-1 test error (Row 10), where we observe an error if the test response differs from the fitted response by more than 1/2. [sent-447, score-0.412]
86 The most noticeable observation is that while the LASSO always includes all of the important features, it does so by selecting very large models—a median of 94 variables, as opposed to the correct number, 6, which is the median model size reported by all three SIS-based methods. [sent-449, score-0.136]
87 Table 2 displays the results of repeating the Case 1 simulations for Van-SIS, Var1-SIS and Var2SIS under the same conditions, but using the LASSO penalty function rather than the SCAD penalty function after the SIS step. [sent-457, score-0.116]
88 On the other hand, the results are less successful than applying SIS and its variants in conjuction with the SCAD penalty for final feature selection and parameter estimation. [sent-460, score-0.138]
89 2 Poisson Regression In our second example, the generic response Y is distributed, conditional on X = x, as Poisson(µ(x)), where log µ(x) = β0 + xT β. [sent-543, score-0.106]
90 On the other hand, LASSO missed the difficult variables in cases 2 and 3 and also selected models with a large number of features to attenuate the bias of the variable selection procedure. [sent-781, score-0.109]
91 Using n = 70 and d = n/2, our new ISIS method includes all five important variables for 91 out of the 100 repetitions, while the original ISIS without feature deletion includes all the important features for only 36 out of the 100 repetitions. [sent-793, score-0.175]
92 The median model size of our new variable selection procedure with variable deletion is 21, whereas the median model size corresponding to the original ISIS of Fan & Lv (2008) is 19. [sent-794, score-0.227]
93 For 97 out of 100 repetitions, our new ISIS includes all the important features while ISIS without variable deletion includes all the important features for only 72 repetitions. [sent-796, score-0.204]
94 They analyzed 251 neuroblastoma specimens using a customized oligonucleotide microarray with the goal of developing a gene expression-based classification rule for neuroblastoma patients to reliably predict courses of the disease. [sent-904, score-0.21]
95 of features 6 2 4 2 42 3 Gender Testing error 4/126 4/126 4/126 4/126 5/126 4/126 Table 9: Results from analyzing two endpoints of the neuroblastoma data We can see from Table 9 that our (I)SIS methods compare favorably with the LASSO and NSC. [sent-924, score-0.15]
96 Variable selection via nonconcave penalized likelihood and its oracle properties. [sent-1052, score-0.178]
97 Sure independence screening for ultra-high dimensional feature space (with discussion). [sent-1059, score-0.378]
98 On non-concave penalized likelihood with diverging number of parameters. [sent-1067, score-0.121]
99 “Pre-conditioning” for feature selection and regression in high-dimensional problems. [sent-1165, score-0.119]
100 One-step sparse estimates in nonconcave penalized likelihood models (with discussion). [sent-1220, score-0.121]
wordName wordTfidf (topN-words)
[('sis', 0.569), ('isis', 0.36), ('fan', 0.303), ('screening', 0.29), ('lasso', 0.198), ('lv', 0.173), ('jr', 0.161), ('scad', 0.142), ('amworth', 0.123), ('ltrahigh', 0.114), ('xt', 0.1), ('penalized', 0.095), ('jianqing', 0.095), ('nsc', 0.089), ('imensional', 0.087), ('response', 0.08), ('eature', 0.079), ('neuroblastoma', 0.076), ('features', 0.074), ('sure', 0.07), ('median', 0.068), ('poisson', 0.066), ('indices', 0.06), ('inactive', 0.06), ('election', 0.059), ('penalty', 0.058), ('efs', 0.057), ('recruited', 0.057), ('deletion', 0.056), ('marginally', 0.055), ('corr', 0.051), ('xp', 0.049), ('dispersion', 0.048), ('unimportant', 0.048), ('ultrahigh', 0.047), ('feature', 0.045), ('active', 0.045), ('multicategory', 0.043), ('independence', 0.043), ('zou', 0.043), ('nal', 0.042), ('yi', 0.042), ('residuals', 0.041), ('logistic', 0.04), ('exchangeability', 0.04), ('regression', 0.039), ('aic', 0.038), ('mr', 0.038), ('recruiting', 0.038), ('trevor', 0.037), ('bic', 0.036), ('selection', 0.035), ('tibshirani', 0.033), ('hastie', 0.031), ('iterated', 0.031), ('jointly', 0.03), ('variant', 0.03), ('dimensionality', 0.03), ('test', 0.029), ('microarray', 0.029), ('gender', 0.029), ('gene', 0.029), ('bair', 0.028), ('dantzig', 0.028), ('kr', 0.028), ('samworth', 0.028), ('screened', 0.028), ('yichao', 0.028), ('stage', 0.028), ('guyon', 0.027), ('log', 0.026), ('likelihood', 0.026), ('li', 0.026), ('probe', 0.025), ('aggressive', 0.025), ('genes', 0.024), ('probes', 0.024), ('shrunken', 0.024), ('ews', 0.024), ('marginal', 0.024), ('xa', 0.023), ('zhao', 0.023), ('positives', 0.022), ('oracle', 0.022), ('khan', 0.022), ('huan', 0.022), ('covariate', 0.021), ('uncorrelated', 0.021), ('generalized', 0.021), ('correlation', 0.02), ('bayes', 0.02), ('utilities', 0.02), ('arrays', 0.02), ('efron', 0.02), ('intersection', 0.02), ('almuallim', 0.019), ('antoniadis', 0.019), ('debashis', 0.019), ('mcp', 0.019), ('nbs', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
2 0.11323473 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
3 0.08818724 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
4 0.074994847 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
5 0.058740024 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
6 0.055134222 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.05483168 23 jmlr-2009-Discriminative Learning Under Covariate Shift
9 0.048733387 90 jmlr-2009-Structure Spaces
10 0.037487052 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
11 0.033892684 13 jmlr-2009-Bounded Kernel-Based Online Learning
12 0.031795375 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
13 0.031372346 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
14 0.030802371 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
15 0.030469907 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
16 0.030429771 50 jmlr-2009-Learning When Concepts Abound
17 0.029612914 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning
18 0.028645994 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
19 0.027525047 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
20 0.026883589 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
topicId topicWeight
[(0, 0.168), (1, 0.004), (2, 0.062), (3, 0.028), (4, 0.052), (5, -0.079), (6, -0.04), (7, 0.082), (8, -0.054), (9, -0.029), (10, 0.185), (11, -0.001), (12, 0.003), (13, 0.008), (14, 0.334), (15, -0.093), (16, -0.251), (17, -0.235), (18, 0.05), (19, 0.133), (20, 0.009), (21, 0.047), (22, 0.106), (23, 0.048), (24, 0.157), (25, -0.029), (26, 0.066), (27, -0.111), (28, -0.017), (29, 0.093), (30, 0.212), (31, -0.016), (32, -0.03), (33, 0.116), (34, -0.055), (35, -0.035), (36, 0.092), (37, 0.105), (38, 0.013), (39, -0.074), (40, 0.004), (41, -0.03), (42, -0.198), (43, -0.013), (44, 0.009), (45, 0.061), (46, 0.098), (47, 0.112), (48, -0.017), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.92500722 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
2 0.60929507 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
3 0.46963644 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
4 0.43583062 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola
Abstract: Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach. Keywords: trees, resampling, importance, masking, residuals
6 0.23096359 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.22261652 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
8 0.21828482 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
9 0.19278003 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
10 0.19261192 38 jmlr-2009-Hash Kernels for Structured Data
11 0.18126871 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
12 0.18057077 23 jmlr-2009-Discriminative Learning Under Covariate Shift
13 0.17657506 50 jmlr-2009-Learning When Concepts Abound
14 0.16432677 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
15 0.16064698 90 jmlr-2009-Structure Spaces
16 0.15955491 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning
17 0.1576445 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
18 0.15217099 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
19 0.14770181 29 jmlr-2009-Estimating Labels from Label Proportions
20 0.14564003 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
topicId topicWeight
[(8, 0.027), (11, 0.03), (19, 0.017), (26, 0.05), (38, 0.029), (44, 0.391), (47, 0.018), (52, 0.066), (55, 0.055), (58, 0.029), (65, 0.01), (66, 0.063), (68, 0.043), (90, 0.048), (96, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.68425083 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
2 0.3072474 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
3 0.30701414 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
4 0.30603245 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
Author: Marc Boullé
Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We finally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging
5 0.3059094 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
6 0.30375883 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.30091286 38 jmlr-2009-Hash Kernels for Structured Data
8 0.29881957 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
10 0.29562831 48 jmlr-2009-Learning Nondeterministic Classifiers
11 0.29544777 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
12 0.29512233 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
13 0.2931619 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
14 0.29301235 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
15 0.29137957 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
16 0.28989369 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
17 0.28984651 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
18 0.28849909 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
20 0.28381851 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent