jmlr jmlr2009 jmlr2009-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
Reference: text
sentIndex sentText sentNum sentScore
1 We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. [sent-12, score-0.234]
2 In sparse setups the empirical Bayes implicitly performs an efficient variable selection. [sent-13, score-0.172]
3 It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. [sent-14, score-0.165]
4 We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. [sent-15, score-0.365]
5 Keywords: non parametric empirical Bayes, high dimension, classification 1. [sent-16, score-0.213]
6 Suppose the distribution of the explanatory variables, conditional on Y = −1 and on Y = 1, is G1 and G2 correspondingly, where Gi are multivariate normals i = 1, 2. [sent-45, score-0.082]
7 Then apply Fisher’s rule by plugging in the estimated diagonal covariance matrix and the estimated vectors of means. [sent-52, score-0.073]
8 Bickel and Levina (2004) showed that in many cases, by this trivial estimation of the covariance matrix, one does not lose too much in terms of classification error, relative to incorporating the true covariance matrix, and suggested this practice. [sent-53, score-0.12]
9 Note, the bottom line of this practice is to treat the explanatory variables as if they are independent, or act “assuming” independence of the explanatory variables. [sent-54, score-0.137]
10 In other words, often, attempting to estimate the 2p coordinates of the two mean vectors, by the corresponding averages of n observations on each, is already “too much”, and leads to overfit. [sent-60, score-0.108]
11 The FAIR approach suggested by Fan and Fan (2008), and the conditional MLE approach suggested by Greenshtein et al. [sent-61, score-0.09]
12 The FAIR method estimates the means of the selected variables by the corresponding sample means (the MLE), while the conditional MLE method estimates by the conditional MLE, conditional on the event that the variables were selected. [sent-65, score-0.156]
13 The above approaches are helpful especially in a high dimensional sparse setup, while the non parametric Empirical Bayes approach that we will present is helpful also in non-sparse setups. [sent-66, score-0.218]
14 A ‘sparse’ setup is such, that relatively few of the explanatory variables are informative for the classification. [sent-68, score-0.11]
15 However, by a sparse vector ν we mean that most of its coordinates are exactly zero. [sent-71, score-0.138]
16 In our simulation, we achieve p ≫ ||ν||, by considering the following three types of configurations for vectors ||ν||: (a) Very few non-zero coordinates of a large/moderate magnitude (i. [sent-76, score-0.081]
17 , sparse vectors) (b) Very few coordinates of a large magnitude, mixed with many very small coordinates (i. [sent-78, score-0.219]
18 (c) Many coordinates of a very small magnitude (i. [sent-81, score-0.081]
19 Specifically, it is better in moderately sparse setups, while in extremely sparse cases, it is inferior. [sent-85, score-0.143]
20 See also a recent paper by Greenshtein and Rotov (2009) on efficiency of compound and empirical Bayes procedures with respect to the class of permutation invariant procedures. [sent-90, score-0.149]
21 A recent comprehensive study and performance comparison, of various methods for estimating a vector of normal means under squared error loss, was conducted by Brown (2008), the very good performance of non parametric empirical Bayes methods is demonstrated also there. [sent-91, score-0.404]
22 We will introduce and explain the virtues of our empirical Bayes classification method and provide simulation evidence as well as real data evidence to its excellent performance. [sent-93, score-0.081]
23 In the next section we introduce our formal setup and explain the relation between estimating a vector of means under a squared loss and classification. [sent-98, score-0.099]
24 In Section 3 we introduce a class of non-parametric empirical Bayes estimators of a vector of normal means and define our classifier. [sent-99, score-0.191]
25 Preliminaries Assume a multivariate normal distribution of the vector (X1 , . [sent-102, score-0.096]
26 The CLT arguments are problematic when the X j s have heavy tails. [sent-127, score-0.076]
27 In Table 5 of Section 4 some simulations are carried to demonstrate the effect of heavy tailed distributions. [sent-128, score-0.204]
28 This implies that the coordinates aopt = j aopt j νj ∑ ν2 j of the optimal choice satisfy: , j = 1, . [sent-153, score-0.313]
29 , aopt ) = ||ν||, consequently for the optimal choice aopt , the Bayes risk is: p 1 0 ||ν|| ) 2s where Φ is the cumulative distribution of a standard normal distribution. [sent-181, score-0.328]
30 , p, n n (5) are independent normal random variables with EZ j = ν j and variance, denoted σ2 , σ2 = 2s2 . [sent-198, score-0.129]
31 Note however, that for two procedures with very accurate classification rate, ignoring the variability of V might be significantly misleading even if E(V ) is large compare to the standard deviation of V , this is due to the thin tail of the normal distribution. [sent-227, score-0.147]
32 , p, is aopt = j νj ∑ ν2 j , a natural way to proceed is to estimate ˆ ν j by a ‘good’ estimator ν j for ν j , and then plug-in, that is, let a j = ˆ ˆ νj ˆj ∑ ν2 . [sent-232, score-0.189]
33 In the sequel we will indicate why the squared error loss function is especially appropriate. [sent-234, score-0.078]
34 In general, the fact that ν is a good estimator for ν under a squared ˆ error loss, does not indicate that T (ν) is a good estimator for T (ν) under (say) a squared loss. [sent-236, score-0.268]
35 For example in the case T (ν) = ∑ ν j , plugging in the MLE for ν will often be better than plugging in the James-Stein estimator because of the bias of the J-S estimator which is accumulated. [sent-237, score-0.202]
36 This is although the J-S estimator dominates the MLE in estimating ν under a squared error loss. [sent-238, score-0.147]
37 Hence good properties of the Empirical Bayes as an estimator for ν under squared loss in high dimensions, 1691 G REENSHTEIN AND PARK do not automatically indicate that it should be plugged-in in order to obtain good estimators for aopt , j and thus provide good classifiers. [sent-239, score-0.335]
38 j The last equation motivates the particular choice of a squared error loss when evaluating an ˆ estimator ν j . [sent-248, score-0.113]
39 An estimator with particularly appealing properties, in estimation of a vector of means under a squared loss in high dimensions, is the non-parametric empirical Bayes estimator, see Brown and Greenshtein (2009). [sent-251, score-0.165]
40 The more difficult is the task of estimating ν by our non parametric empirical Bayes method in terms of MSE, the less successful is our classification method. [sent-255, score-0.247]
41 1 Known Homoscedastic Variance In the sequel we rescale X j , so that Z j defined in (5) will have variance σ2 = 1, j = 1, . [sent-265, score-0.096]
42 The extension of this subsection for the latter case and for non equal samples n1 and n2 is explicitly given in the next subsection. [sent-271, score-0.103]
43 Under the non-parametric empirical Bayes approach for estimating a vector of means, we consider the means νi = E(Zi ), i = 1, . [sent-272, score-0.086]
44 The estimator that we suggest for δG , is of the form ∗′ g (z) ˆ ˆ δh (z) = z + h g∗ (z) ˆh ′ ˆh where g∗ (z) and g∗ (z) are appropriate kernel estimators for the density g∗ (z) and its derivative ˆh ∗′ (z). [sent-302, score-0.116]
45 nh ∑ h h In Brown and Greenshtein (2009), it is suggested to let the bandwidth converge slowly to zero as p → ∞, they suggested that h2 should approach zero ‘just faster’ than 1/ log(p). [sent-308, score-0.114]
46 Let ˆj (s1 )2 (s2 )2 ˆj ˆ + , Sj = n1 n2 ˆ ¯ ¯ thus S j is our estimator for the standard deviation of X j1 − X j2 . [sent-356, score-0.097]
47 ˆ As before let νi be the empirical Bayes estimators of E(Zi ), and let ai = ˆ ˆ νi ˆj ∑ j ν2 , i = 1, . [sent-361, score-0.095]
48 We study sparse configurations where for l variables the corresponding mean is fixed ν j = ∆, while the remaining p − l variables have ν j = 0, p ≫ l. [sent-385, score-0.123]
49 The small variance of the normal distribution is in order to control the magnitude of ||ν||; recall from the introduction, we want p ≫ ||ν||. [sent-388, score-0.129]
50 The distribution X j is normal throughout this section, except for the simulations reported j in Table 5. [sent-396, score-0.167]
51 In Table 5 the effect of a heavy tailed distribution variables X j is studied. [sent-397, score-0.166]
52 Table 1 shows the misclassification rates of the empirical Bayes, conditional MLE, FAIR, and the plug in Fisher’s rule which is also termed IR (Independence Rule). [sent-398, score-0.139]
53 We see that the empirical Bayes approach produces the best results for non-sparse and for moderately sparse configurations. [sent-403, score-0.138]
54 3), this other version screens variables more aggressively and involves computation of eigenvalues of the empirical covariance matrix. [sent-408, score-0.13]
55 In addition, computing eigenvalues for empirical covariance matrix with p = 105 is computationally intensive. [sent-410, score-0.097]
56 ˆ ˆ ˆ In order to demonstrate the effect of dependence and to compare the methods for correlated variables, we also consider correlated normal variables where the correlation of Xi and X j , namely ρi j , has the form of ρ|i− j| known as √ AR(1) model. [sent-420, score-0.129]
57 The empirical Bayes still achieves the lowest error rates for all those non-sparse configurations. [sent-423, score-0.079]
58 In general the effect of correlation (especially heavy positive correlation) on the EB classification method is stronger than on the other methods. [sent-434, score-0.076]
59 In Table 4, we compare the above mentioned procedures in non sparse setups where there are many small signals. [sent-605, score-0.228]
60 However, attempting to estimate the corresponding means by FAIR or Fisher’s plug-in and the Conditional MLE methods yield poor classifiers, while the non parametric empirical Bayes method yields classifiers with excellent performance in some cases. [sent-608, score-0.213]
61 In Table 5, we present simulation studies for X j s with a heavy tailed distribution. [sent-609, score-0.162]
62 (∆1 , l1 , ∆2 , l2 ) represents l1 and l2 coordinates in ν are ∆1 and ∆2 respectively. [sent-743, score-0.081]
63 However, compared to the results in Table 1, the EB method and CMLE have a worst performance which is caused by some sensitivity to the heavy tailed distribution of the X j s. [sent-745, score-0.133]
64 This advantage is not on the expanse of being a good classifier also under moderately sparse configurations. [sent-748, score-0.107]
65 We consider three real data sets and compare the empirical Bayes approach with nearest centroid shrunken (henceforth NSC), and FAIR. [sent-752, score-0.169]
66 The first example is of a leukemia data set, which was previously analyzed in Golub et al. [sent-756, score-0.111]
67 (∆1 , l1 , ∆2 , l2 ) represents l1 and l2 coordinates in ν are ∆1 and ∆2 respectively. [sent-882, score-0.081]
68 Table 6 shows the results of the nearest shrunken centroid, FAIR, and empirical Bayes methods. [sent-884, score-0.169]
69 The empirical Bayes approach misclassified 3 out of 34 test samples which is the same result as NSC, but slightly worse than FAIR. [sent-885, score-0.116]
70 Figure 1 shows histograms of ∑ j a jU j corresponding to the two ˆ groups, under the training and under the test sets. [sent-886, score-0.21]
71 The second example is of lung cancer data which were previously analyzed by Gordon et al. [sent-887, score-0.22]
72 The training sample set consists of 32 samples(n1 = 16 from MPM and n2 = 16 from ADCA) and the test has 149 samples (15 from MPM and 134 from ADCA). [sent-893, score-0.102]
73 As displayed in Table 7, the empirical Bayes method classified all the training samples correctly and 148 out of 149 test samples correctly, which is a significant improvement compared to NSC and FAIR. [sent-894, score-0.176]
74 In Figure 2, we show histograms of ∑ a jU j under the ˆ two groups, for the training and for the test sets. [sent-895, score-0.21]
75 The last example is of prostate cancer data studied by Singh et al. [sent-896, score-0.266]
76 The training data set has 102 samples, n1 = 52 of which are prostate tumor samples and n2 = 50 of which are normal samples. [sent-902, score-0.416]
77 Training error 1/38 1/38 0/38 Test error 3/34 1/34 3/34 Table 6: Classification errors of Leukemia data set Method Nearest shrunken centroids FAIR E. [sent-1020, score-0.123]
78 Two ˆ panels in the first columns are histograms for ALL and AML from training sets and two in the second columns are for ALL and AML from test sets. [sent-1023, score-0.308]
79 Red vertical lines in all ˆ ˆ histograms represent cut off value which is −a0 = (θALL + θAML )/2 = −15. [sent-1024, score-0.157]
80 10 ˆ training,ADCA test,ADCA 10 50 8 40 6 30 4 20 2 10 0 −200 0 0 200 −200 training,MPM 0 200 test,MPM 10 50 8 40 6 30 4 20 2 10 0 −200 0 0 200 −200 0 200 Figure 2: Histograms of ∑ j a jU j of ADCA and MPM for training and test sets of lung cancer data. [sent-1025, score-0.278]
81 ˆ Two panels in the first columns are histograms for ADCA and MPM from training sets and two in the second columns are for ADCA and MPM from test sets. [sent-1026, score-0.308]
82 Red vertical lines ˆ ˆ in all histograms represent cut off value which is −a0 = (θADCA + θMPM )/2 = 27. [sent-1027, score-0.157]
83 1700 N ON PARAMETRIC E MPRICAL BAYES IN H IGH D IMENSIONAL C LASSIFICATION Method Nearest shrunken centroids FAIR E. [sent-1029, score-0.123]
84 Two panels in the first columns are histograms for normal and tumor from training sets and two in the second columns are for normal and tumor from test sets. [sent-1032, score-0.754]
85 Red vertiˆ ˆ cal lines in all histograms represent cut off value which is −a0 = (θnormal + θtumor )/2 = 213. [sent-1033, score-0.157]
86 independent test data set, from a different experiment, has 25 tumor and 9 normal samples. [sent-1035, score-0.265]
87 As displayed in Table 8, for the prostate cancer data, the empirical Bayes approach has a very large training error compared to NSC and FAIR, but the test error is smaller than both NSC and FAIR. [sent-1037, score-0.398]
88 One is the difference in the proportion of tumor and normal samples in the training versus the test set. [sent-1039, score-0.325]
89 In the histograms of ∑ j a jU j corresponding to the normal and ˆ tumor groups from the training data, we may see that the two training sets look noisier. [sent-1043, score-0.429]
90 , p, for the leukemia, lung cancer, and prostate cancer data sets. [sent-1073, score-0.331]
91 06 Figure 5: Histograms of the first 50 largest |a j | for the leukemia, lung cancer, and prostate cancer ˆ data sets. [sent-1099, score-0.331]
92 3 Number of Selected Variables Figure 4 shows the histograms of a j , j = 1, . [sent-1101, score-0.13]
93 Our ˆ empirical Bayes method uses many variables for the classification. [sent-1109, score-0.085]
94 Our suggested classifiers are meant only to produce good classification and thus use many variables if necessary. [sent-1113, score-0.084]
95 However, selecting a subset of variables following an empirical Bayes estimation of the means, makes much sense, for producing simpler classifiers. [sent-1115, score-0.085]
96 Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of means. [sent-1139, score-0.122]
97 Regularization through variable selection and conditional mle with application to classification in high dimensions. [sent-1159, score-0.149]
98 Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. [sent-1206, score-0.353]
99 General maximum likelihood empirical Bayes estimation of normal means. [sent-1212, score-0.148]
100 Diagnosis of multiple cancer types by shrunken centroids of gene expression. [sent-1255, score-0.256]
wordName wordTfidf (topN-words)
[('cmle', 0.416), ('eb', 0.325), ('ir', 0.26), ('greenshtein', 0.249), ('fair', 0.246), ('fan', 0.2), ('bayes', 0.199), ('reenshtein', 0.15), ('mprical', 0.133), ('prostate', 0.133), ('cancer', 0.133), ('histograms', 0.13), ('aml', 0.127), ('tumor', 0.127), ('mle', 0.119), ('adca', 0.116), ('aopt', 0.116), ('mpm', 0.116), ('igh', 0.113), ('nsc', 0.113), ('imensional', 0.101), ('misclassi', 0.099), ('normal', 0.096), ('ju', 0.089), ('leukemia', 0.089), ('park', 0.088), ('fisher', 0.085), ('shrunken', 0.085), ('gurations', 0.083), ('coordinates', 0.081), ('non', 0.081), ('parametric', 0.08), ('heavy', 0.076), ('estimator', 0.073), ('simulations', 0.071), ('compound', 0.07), ('lassification', 0.07), ('brown', 0.067), ('classi', 0.067), ('lung', 0.065), ('setups', 0.063), ('sparse', 0.057), ('tailed', 0.057), ('explanatory', 0.052), ('empirical', 0.052), ('eitan', 0.05), ('zi', 0.046), ('covariance', 0.045), ('estimators', 0.043), ('test', 0.042), ('squared', 0.04), ('centroids', 0.038), ('sequel', 0.038), ('corr', 0.038), ('panels', 0.038), ('sj', 0.038), ('xip', 0.038), ('training', 0.038), ('table', 0.035), ('estimating', 0.034), ('clt', 0.033), ('junyong', 0.033), ('zj', 0.033), ('ers', 0.033), ('variables', 0.033), ('variance', 0.033), ('virtually', 0.032), ('nearest', 0.032), ('er', 0.031), ('bickel', 0.03), ('plug', 0.03), ('correspondingly', 0.03), ('suggested', 0.03), ('conditional', 0.03), ('columns', 0.03), ('bandwidth', 0.029), ('moderately', 0.029), ('simulation', 0.029), ('plugging', 0.028), ('heteroscedastic', 0.028), ('homoscedastic', 0.028), ('cut', 0.027), ('averages', 0.027), ('procedures', 0.027), ('rates', 0.027), ('setup', 0.025), ('nh', 0.025), ('levina', 0.025), ('realizations', 0.025), ('rescale', 0.025), ('golub', 0.024), ('deviation', 0.024), ('con', 0.024), ('malignant', 0.023), ('tamayo', 0.023), ('samples', 0.022), ('microarray', 0.022), ('analyzed', 0.022), ('lehmann', 0.022), ('good', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
2 0.11323473 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
3 0.064806864 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
Author: Jean Hausser, Korbinian Strimmer
Abstract: We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. SpeciÄ?Ĺš cally, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efÄ?Ĺš cient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator. Keywords: entropy, shrinkage estimation, James-Stein estimator, “small n, large pâ€? setting, mutual information, gene association network
4 0.052360337 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
5 0.051521465 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
Author: Cynthia Rudin
Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval
6 0.045682959 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
7 0.044056319 23 jmlr-2009-Discriminative Learning Under Covariate Shift
8 0.042766668 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
9 0.042019814 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
10 0.038193494 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
11 0.037805241 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
12 0.036402691 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
13 0.034737356 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
14 0.03209798 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
15 0.031473644 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
16 0.029653434 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
17 0.028712573 16 jmlr-2009-Classification with Gaussians and Convex Loss
18 0.025144197 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.025059316 29 jmlr-2009-Estimating Labels from Label Proportions
20 0.023137828 50 jmlr-2009-Learning When Concepts Abound
topicId topicWeight
[(0, 0.143), (1, -0.054), (2, 0.1), (3, 0.011), (4, 0.03), (5, -0.043), (6, -0.034), (7, 0.097), (8, -0.01), (9, 0.057), (10, 0.106), (11, -0.011), (12, -0.048), (13, -0.147), (14, 0.289), (15, -0.125), (16, -0.102), (17, -0.241), (18, 0.137), (19, -0.025), (20, -0.015), (21, -0.086), (22, -0.042), (23, 0.25), (24, 0.031), (25, 0.116), (26, -0.12), (27, -0.268), (28, -0.04), (29, 0.131), (30, 0.179), (31, 0.039), (32, -0.025), (33, 0.022), (34, -0.157), (35, -0.051), (36, -0.062), (37, 0.101), (38, -0.031), (39, -0.087), (40, -0.028), (41, -0.126), (42, -0.116), (43, 0.126), (44, -0.079), (45, -0.097), (46, -0.068), (47, 0.02), (48, -0.036), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.94370562 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
2 0.592942 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
3 0.37469485 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
Author: Jean Hausser, Korbinian Strimmer
Abstract: We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. SpeciÄ?Ĺš cally, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efÄ?Ĺš cient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator. Keywords: entropy, shrinkage estimation, James-Stein estimator, “small n, large pâ€? setting, mutual information, gene association network
4 0.36323389 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
5 0.30532488 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg
Abstract: We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classifier systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG
6 0.27963343 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
7 0.25658667 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
8 0.22155143 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
10 0.19703746 23 jmlr-2009-Discriminative Learning Under Covariate Shift
11 0.18831722 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
12 0.18780653 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
13 0.18587257 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
14 0.18042402 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
15 0.17162293 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
16 0.16922006 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
17 0.16734099 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
18 0.15565713 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
19 0.15320635 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
20 0.15120572 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
topicId topicWeight
[(8, 0.03), (11, 0.032), (21, 0.011), (26, 0.017), (38, 0.026), (43, 0.012), (47, 0.02), (52, 0.051), (55, 0.068), (58, 0.024), (66, 0.093), (68, 0.036), (90, 0.051), (94, 0.426), (96, 0.021)]
simIndex simValue paperId paperTitle
Author: Sébastien Bubeck, Ulrike von Luxburg
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency
same-paper 2 0.70899838 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
Author: Eitan Greenshtein, Junyong Park
Abstract: We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classifiers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efficient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classification procedures give up on using. We compare our method with FAIR and other classification methods in simulation for sparse and non sparse setups, and in real data examples involving classification of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classification
3 0.31196702 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.31179947 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
Author: Junning Li, Z. Jane Wang
Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton
6 0.30808553 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
7 0.30733854 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
8 0.30011213 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
9 0.29986569 48 jmlr-2009-Learning Nondeterministic Classifiers
10 0.29958892 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
11 0.29905683 38 jmlr-2009-Hash Kernels for Structured Data
12 0.29780751 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
13 0.2969932 13 jmlr-2009-Bounded Kernel-Based Online Learning
14 0.29621109 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
15 0.29582572 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
16 0.29517987 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
17 0.29412055 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
18 0.29251716 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
19 0.29193324 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
20 0.29175866 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence