jmlr jmlr2006 jmlr2006-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar
Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classification, stepwise regression, multiple regression, feature selection, false discovery rate
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Editor: Isabelle Guyon Abstract In streamwise feature selection, new features are sequentially considered for addition to a predictive model. [sent-12, score-1.063]
2 When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. [sent-13, score-1.485]
3 Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. [sent-14, score-0.53]
4 In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. [sent-15, score-1.659]
5 We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. [sent-16, score-0.998]
6 Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. [sent-19, score-1.36]
7 Figure 1 gives the basic framework of streamwise feature selection. [sent-32, score-0.804]
8 In stepwise regression, there is no order on the features; all features must be known in advance, since all features are evaluated at each iteration and the best feature is added to the model. [sent-36, score-0.751]
9 ) In contrast, in streamwise regression, since potential features are tested one by one, they can be generated dynamically. [sent-39, score-0.888]
10 By modeling the candidate feature set as a dynamically generated stream, we can handle candidate feature sets of unknown, or even infinite size, since not all potential features need to be generated and tested. [sent-40, score-0.592]
11 Traditional regularization and feature selection settings assume that all features are pre-computed and presented to a learner before any feature selection begins. [sent-48, score-0.597]
12 The algorithms select features and add these features into regression models. [sent-52, score-0.514]
13 {initialize} model = {} //initially no features in model i=1 // index of features while CPU time used < max CPU time do xi ← get next feature() {Is xi a “good” feature? [sent-57, score-0.472]
14 } if fit of(xi , model) > threshold then model ← model ∪ xi // add xi to the model decrease threshold else increase threshold end if i ← i+1 end while Figure 1: Algorithm: general framework of streamwise feature selection. [sent-58, score-0.993]
15 As feature set size becomes larger, streamwise regression offers significant computational savings and higher prediction accuracy. [sent-65, score-0.915]
16 If features which are cheaper to collect are placed early in the feature stream, they will be preferentially selected over redundant expensive features later in the stream. [sent-73, score-0.563]
17 1863 Z HOU , F OSTER , S TINE AND U NGAR Alternatively, features can be sorted so as to place potentially higher signal content features earlier in the feature stream, making it easier to discover the useful features. [sent-75, score-0.544]
18 A combination of domain knowledge and use of the different sizes of the feature sets can be used to provide a partial order on the features, and thus to take full advantage of streamwise feature selection. [sent-80, score-0.942]
19 Many commercial statistical packages offer variants of a greedy method, stepwise feature selection, an iterative procedure in which at each step all features are tested at each iteration, and the single best feature is selected and added to the model. [sent-86, score-0.705]
20 Stepwise selection is terminated when either all candidate features have been added, or none of the remaining features lead to increased expected benefit according to some measure, such as a p-value threshold. [sent-88, score-0.491]
21 Variants of stepwise selection abound, including forward (adding features deemed helpful), backward (removing features no longer deemed helpful), and mixed methods (alternating between forward and backward). [sent-90, score-0.633]
22 We speak of the set of beneficial features in a stream as those which would have improved the prediction accuracy of the model at the time they were considered for addition if all prior beneficial features had been added. [sent-109, score-0.559]
23 Using the fact that the log-likelihood of the data given a model gives the number of bits to code the model residual error leads to the criteria for feature selection: accept a new feature xi only if the change in log-likelihood from adding the feature is greater than the penalty F, that is, if 2. [sent-120, score-0.692]
24 They thus are expected to perform poorly as m grows larger than n, a situation common in streamwise regression settings. [sent-131, score-0.756]
25 RIC can be problematic for streamwise feature selection since RIC requires that we know m in advance, which is often not the case (see Section 3). [sent-139, score-0.863]
26 Bonferroni methods are known to be overly stringent (Benjamini and Hochberg, 1995), a problem exacerbated in streamwise feature selection applications when m should technically be chosen to be the largest number of features that might be examined. [sent-144, score-1.083]
27 Streamwise feature selection is closer in spirit to an alternate class of feature selection methods that control the false discovery rate (FDR), the fraction of the features that are added to the model that reduce predictive accuracy (Benjamini and Hochberg, 1995). [sent-146, score-0.743]
28 Interleaving Feature Generation and Testing In streamwise feature selection, candidate features are sequentially presented to the modeling code for potential inclusion in the model. [sent-150, score-1.098]
29 ) We refer to the interaction terms as generated features; they are examples of a more general class of features formed from transformations of the original features (square root, log, etc. [sent-161, score-0.475]
30 ) Statistical relational learning (SRL) methods can easily generate millions of potentially predictive features as they “crawl” through a database or other relational structure and generate features by building increasingly complex compound relations or SQL queries (Popescul and Ungar, 2004). [sent-167, score-0.526]
31 Both stepwise regression and standard shrinkage methods require knowing all features in advance, and are poorly suited for the feature sets generated by SRL. [sent-169, score-0.599]
32 Since stepwise regression tests all features for inclusion at each iteration, it is computational infeasible on large data sets. [sent-170, score-0.477]
33 Some other strategy such as streamwise feature selection is required. [sent-172, score-0.863]
34 One cannot use the coefficients of the features that were not added to the model, since streamwise regression does not include the cost of coding these coefficients, and so this would lead to overfitting. [sent-174, score-1.105]
35 The main feature stream used in streamwise regression is dynamically constructed by taking the next feature from the sub stream which has had the highest success in having features accepted. [sent-182, score-1.475]
36 To assure that all streams are eventually tried, one can use a score for each stream defined as (number of features selected + a)/(number of features tried + b). [sent-184, score-0.541]
37 We first present streamwise regression in an information-investing setting. [sent-189, score-0.756]
38 Therefore, bits saved is the decrease in the bits required to code the model error minus the increase in the bits required to code the model. [sent-196, score-0.47]
39 Streamwise regression with information-investing consists of three components: • Wealth Updating: a method for adaptively adjusting the bits available to code the features which will not be added to the model. [sent-205, score-0.492]
40 {initialize} model = {} //initially no features in model i=1 // index of features // initial bits available for coding w1 = W0 while CPU time used < max CPU time do xi ← get next feature() // select bid amount εi ← wi /2i {see Section 4. [sent-212, score-0.786]
41 3 for the calculation of bits saved} if bits saved(xi , εi , model) > WΔ then model ← model ∪ xi // add xi to the model wi+1 ← wi +WΔ // increase wealth else wi+1 ← wi − εi // reduce wealth end if i ← i+1 end while Figure 2: Algorithm: streamwise regression using information-investing. [sent-213, score-1.531]
42 2 Bid Selection The selection of εi as wi /2i gives the slowest possible decrease in wealth such that all wealth is used; that is, so that as many features as possible are included in the model without systematically overfitting. [sent-224, score-0.702]
43 Therefore, the probability of adding a feature is 1 − e−ε = 1 − (1 − ε + O(ε2 )) ≈ ε, and the cost in bits of coding the fact the feature is added is roughly − log(ε) bits. [sent-237, score-0.579]
44 It can be generalized to add WΔ bits to the wealth each time a feature is added to the model. [sent-268, score-0.486]
45 Streamwise Regression using Alpha-investing One can define an alternate form of streamwise regression, α-investing (Zhou et al. [sent-271, score-0.666]
46 Of the three components of streamwise regression using information-investing, in α-investing, wealth updating is similar, bid selection is identical, and feature coding is not required. [sent-274, score-1.304]
47 The two different streamwise regression algorithms are asymptotically identical (the wealth update of αΔ − αi approaches the update of WΔ as αi becomes small), but differ slightly when the initial features in the stream are considered. [sent-275, score-1.233]
48 {initialize} model = {} //initially no features in model i=1 // index of features // initial prob. [sent-280, score-0.472]
49 } if get p-value(xi , model) < αi then model ← model ∪ xi // add xi to the model wi+1 ← wi + αΔ − αi // increase wealth else wi+1 ← wi − αi // reduce wealth end if i ← i+1 end while Figure 4: Algorithm: streamwise regression with α-investing. [sent-282, score-1.299]
50 The idea of α-investing is to adaptively control the threshold for adding features so that when new (probably predictive) features are added to the model, one “invests” α increasing the wealth, raising the threshold, and allowing a slightly higher future chance of incorrect inclusion of features. [sent-294, score-0.527]
51 Experimental Evaluation We compared streamwise feature selection using α-investing against both streamwise and stepwise feature selection (see Section 2) using the AIC, BIC and RIC penalties on a battery of synthetic and real data sets. [sent-301, score-1.913]
52 The artificially simple structure of the data (the features are uncorrelated and have relatively strong signal) allows us to easily see which feature selection methods are adding spurious features or failing to find features that should be in the model. [sent-311, score-0.918]
53 As one would also expect, if all of the beneficial features in the model occur at the beginning of the stream, α-investing does better, giving the same error as RIC, while if all of the beneficial features in the model are last, α-investing does (two times) worse than RIC. [sent-316, score-0.472]
54 Stepwise regression gave noticeably better results than streamwise regression for this problem when the penalty is AIC or BIC. [sent-318, score-0.874]
55 Stepwise regression with RIC gave the same error of its streamwise counterpart. [sent-320, score-0.756]
56 However, using standard code from R, the stepwise regression was much slower than streamwise regression. [sent-321, score-0.951]
57 One might hope that adding more spurious features to the end of a feature stream would not severely harm an algorithm’s performance. [sent-323, score-0.552]
58 6 However, AIC and BIC, since their penalty is not a function of m, will add even more spurious features (if they haven’t already added a feature for every observation! [sent-324, score-0.497]
59 1873 Z HOU , F OSTER , S TINE AND U NGAR streamwise AIC BIC RIC α-invest. [sent-333, score-0.666]
60 Table 6 shows that when the number of potential features goes up to 1,000,000, RIC puts in one less beneficial feature, while streamwise regression puts the same four beneficial features plus a half of a spurious feature. [sent-389, score-1.252]
61 Thus, streamwise regression is able to find the extra feature even when the feature is way out in the 1,000,000 features. [sent-390, score-1.032]
62 Since the gene expression data sets have large feature sets, we shuffled their original features five times (in addition to the cross validations), applied streamwise regression on each feature 1874 S TREAMWISE F EATURE S ELECTION m RIC RIC RIC α invest. [sent-401, score-1.271]
63 features, m nominal features continuous features observations, n baseline accuracy cleve 13 7 6 296 54% internet 1558 1555 3 2359 84% ionosphere 34 0 34 351 64% spect 22 22 0 267 79% wdbc 30 0 30 569 63% wpbc 33 0 33 194 76% Table 7: Description of the UCI data sets. [sent-434, score-0.487]
64 The second interleaved feature selection and generation, initially testing PCA components and the original features, and then generating interaction terms between any of the features which had been selected and any of the original features. [sent-440, score-0.485]
65 On UCI data sets (Figure 5)7 , when only the original feature set is used, paired two-sample t-tests show that α-investing has better performance than streamwise AIC and BIC only on two of the six UCI data sets: the internet and wpbc data sets. [sent-446, score-0.855]
66 On the other data sets, which have relatively few features, the less stringent penalties do as well as or better than streamwise regression. [sent-447, score-0.702]
67 When interaction terms and PCA components are included, α-investing gives better performance than streamwise AIC on five data sets, than streamwise BIC on three data sets, and than streamwise RIC on two data sets. [sent-448, score-2.064]
68 On the UCI data sets (Figure 5), we also compared streamwise regression with α-investing8 with stepwise regression. [sent-451, score-0.924]
69 We did not tune any parameters in streamwise regression to particular problems either. [sent-460, score-0.756]
70 On the other data sets, streamwise regression doesn’t have 7. [sent-462, score-0.756]
71 We were unable to compute the stepwise regression results on the internet data set using the software at hand when interaction terms and PCA components were included giving millions of potential features with thousands of observations. [sent-466, score-0.588]
72 These tests shows that the performance of streamwise regression is at least comparable to those of SVM, NNET, and TREE. [sent-481, score-0.772]
73 But when interaction terms and PCA components are included , RIC is often too conservative to select even only one feature, whereas α-investing has stable performance and the t-tests show that α-investing has significant better prediction accuracies than streamwise RIC on 5 out of 7 data sets. [sent-483, score-0.801]
74 Note that, regardless of whether or not interaction terms and PCA components are included, α-investing always has much higher accuracy than streamwise AIC and BIC. [sent-484, score-0.732]
75 Streamwise RIC has similar SEs as α-investing has, but streamwise AIC and BIC usually have one percent higher SEs than α-investing and streamwise RIC. [sent-488, score-1.332]
76 Also note that, for streamwise AIC, BIC, and RIC, adding interaction terms and PCA components often hurts. [sent-493, score-0.773]
77 10 Streamwise and stepwise feature selection are one step less greedy, sequentially adding features by computing I(y; xi |Model). [sent-514, score-0.628]
78 This richer feature set of transformed variables could, of course, be used within the streamwise feature selection setting, or streamwise regression could be used to find an initial set of features to be provided to the Bayesian model. [sent-518, score-1.96]
79 When SVM is used on the features selected by streamwise regression, the errors on arcene, gisette, and madelon are reduced further to 0. [sent-524, score-0.935]
80 There is only one data set, madelon, where streamwise regression gives substantially higher error than the other methods. [sent-529, score-0.756]
81 This may be partly due to the fact that madelon has substantially more observations than features, thus making streamwise regression (when not augmented with sophisticated feature generators) less competitive with more complex models. [sent-530, score-0.965]
82 Discussion: Statistical Feature Selection Recent developments in statistical feature selection take into account the size of the feature space, but only allow for finite, fixed feature spaces, and do not support streamwise feature selection. [sent-533, score-1.277]
83 The black-dot and solid black bars are the average accuracies using streamwise regressions. [sent-553, score-0.695]
84 0 Table 10: Comparison of streamwise regression and other methods on UCI Data. [sent-612, score-0.756]
85 The number in parentheses is the average number of features used by the streamwise regression, and these features includes PCA components, raw features, and interaction terms (see Footnote 7 for the details of this kind of feature stream). [sent-615, score-1.295]
86 features greatest-hits-one error greatest-hits-one features BayesNN-DFT error BayesNN-DFT features arcene 0. [sent-621, score-0.636]
87 The black-dot bars are the average accuracies using streamwise regressions on raw features. [sent-641, score-0.73]
88 The solid black bars are the average accuracy using streamwise regressions with PCA components, raw features, and interaction terms (see Footnote 7 for the details of this kind of feature stream). [sent-642, score-0.889]
89 The Benjamini-Hochberg method for controlling the FDR suggests the α-investing rule used in streamwise regression, which keeps the FDR below α: order the p-values of the independents tests of H1 , H2 , . [sent-658, score-0.682]
90 The α-investing rule of streamwise regression controls a similar characteristic. [sent-671, score-0.756]
91 As a streamwise feature selector, if one has added, say, 20 features to the model, then with high probability (tending to 1 as the number of accepted features grows) no more than 5% (that is, one feature in the case of 20 features) are false positives. [sent-674, score-1.385]
92 Algorithms such as the streamwise regression presented in this paper, which are online in the features being used are much less common. [sent-677, score-0.959]
93 In such cases, regularization or smoothing methods work well and streamwise feature selection does not make sense. [sent-679, score-0.863]
94 For the problems in citation prediction and bankruptcy prediction that we have looked at, generating potential features (for example, by querying a database or by computing transformations or combinations of the raw features) takes far more time than the streamwise feature selection. [sent-683, score-1.122]
95 Thus, the flexibility that streamwise regression provides to dynamically decide which features to generate and add to the feature stream provides potentially large savings in computation. [sent-684, score-1.256]
96 1882 S TREAMWISE F EATURE S ELECTION Empirical tests show that for the smaller UCI data sets where stepwise regression can be done, streamwise regression gives comparable results to stepwise regression or techniques such as decision trees, neural networks, or SVMs. [sent-685, score-1.288]
97 Fortunately, given any software which incrementally considers features for addition and calculates their p-value or entropy reduction, streamwise regression using information-investing or α-investing is extremely easy to implement. [sent-688, score-0.959]
98 For linear and logistic regression, we have found that streamwise regression can easily handle millions of features. [sent-689, score-0.779]
99 The results presented here show that streamwise feature selection is highly competitive even when there is no prior knowledge about the structure of the feature space. [sent-690, score-1.001]
100 Our expectation is that in real problems where we do know more about the different kinds of features that can be generated, streamwise regression will provide even greater benefit. [sent-691, score-0.959]
wordName wordTfidf (topN-words)
[('streamwise', 0.666), ('ric', 0.406), ('features', 0.203), ('aic', 0.199), ('wealth', 0.175), ('stepwise', 0.168), ('bic', 0.16), ('feature', 0.138), ('bits', 0.116), ('coding', 0.107), ('stream', 0.099), ('foster', 0.096), ('fdr', 0.09), ('regression', 0.09), ('alpha', 0.088), ('ngar', 0.08), ('oster', 0.08), ('tine', 0.08), ('treamwise', 0.08), ('stine', 0.073), ('spurious', 0.071), ('eature', 0.068), ('nnet', 0.06), ('selection', 0.059), ('bid', 0.053), ('hou', 0.052), ('interaction', 0.05), ('election', 0.049), ('bene', 0.047), ('madelon', 0.047), ('ungar', 0.047), ('dynamically', 0.042), ('adding', 0.041), ('pca', 0.039), ('added', 0.039), ('wi', 0.038), ('false', 0.037), ('predictive', 0.037), ('gene', 0.036), ('saved', 0.035), ('raw', 0.035), ('bonferroni', 0.035), ('popescul', 0.033), ('model', 0.033), ('interactions', 0.031), ('mdl', 0.031), ('relational', 0.03), ('accuracies', 0.029), ('benjamini', 0.028), ('penalty', 0.028), ('code', 0.027), ('arcene', 0.027), ('hung', 0.027), ('ses', 0.027), ('wpbc', 0.027), ('candidate', 0.026), ('uci', 0.025), ('coef', 0.024), ('george', 0.024), ('paired', 0.024), ('zhou', 0.024), ('log', 0.024), ('observations', 0.024), ('threshold', 0.024), ('receiver', 0.023), ('millions', 0.023), ('nips', 0.023), ('sql', 0.023), ('dzeroski', 0.023), ('shuf', 0.023), ('cpu', 0.022), ('ha', 0.022), ('prediction', 0.021), ('upenn', 0.02), ('ation', 0.02), ('batches', 0.02), ('dexter', 0.02), ('gisette', 0.02), ('pcancer', 0.02), ('wdbc', 0.02), ('included', 0.019), ('potential', 0.019), ('selected', 0.019), ('penalties', 0.019), ('transformations', 0.019), ('sequentially', 0.019), ('add', 0.018), ('guyon', 0.018), ('adjusting', 0.017), ('streams', 0.017), ('stringent', 0.017), ('cleve', 0.017), ('diffusion', 0.017), ('invested', 0.017), ('spect', 0.017), ('chance', 0.017), ('generation', 0.017), ('advance', 0.016), ('components', 0.016), ('tests', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 88 jmlr-2006-Streamwise Feature Selection
Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar
Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classification, stepwise regression, multiple regression, feature selection, false discovery rate
2 0.067318663 12 jmlr-2006-Active Learning with Feedback on Features and Instances
Author: Hema Raghavan, Omid Madani, Rosie Jones
Abstract: We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorization tasks indicate that there is significant potential in improving classifier performance by feature re-weighting, beyond that achieved via membership queries alone (traditional active learning) if we have access to an oracle that can point to the important (most predictive) features. Our experiments on human subjects indicate that human feedback on feature relevance can identify a sufficient proportion of the most relevant features (over 50% in our experiments). We find that on average, labeling a feature takes much less time than labeling a document. We devise an algorithm that interleaves labeling features and documents which significantly accelerates standard active learning in our simulation experiments. Feature feedback can complement traditional active learning in applications such as news filtering, e-mail classification, and personalization, where the human teacher can have significant knowledge on the relevance of features. Keywords: active learning, feature selection, relevance feedback, term feedback, text classification
3 0.041779716 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
Author: Don Hush, Patrick Kelly, Clint Scovel, Ingo Steinwart
Abstract: We describe polynomial–time algorithms that produce approximate solutions with guaranteed accuracy for a class of QP problems that are used in the design of support vector machine classifiers. These algorithms employ a two–stage process where the first stage produces an approximate solution to a dual QP problem and the second stage maps this approximate dual solution to an approximate primal solution. For the second stage we describe an O(n log n) algorithm that maps an √ √ approximate dual solution with accuracy (2 2Kn + 8 λ)−2 λε2 to an approximate primal solution p with accuracy ε p where n is the number of data samples, Kn is the maximum kernel value over the data and λ > 0 is the SVM regularization parameter. For the first stage we present new results for decomposition algorithms and describe new decomposition algorithms with guaranteed accuracy and run time. In particular, for τ–rate certifying decomposition algorithms we establish the optimality of τ = 1/(n − 1). In addition we extend the recent τ = 1/(n − 1) algorithm of Simon (2004) to form two new composite algorithms that also achieve the τ = 1/(n − 1) iteration bound of List and Simon (2005), but yield faster run times in practice. We also exploit the τ–rate certifying property of these algorithms to produce new stopping rules that are computationally efficient and that guarantee a specified accuracy for the approximate dual solution. Furthermore, for the dual QP problem corresponding to the standard classification problem we describe operational conditions for which the Simon and composite algorithms possess an upper bound of O(n) on the number of iterations. For this same problem we also describe general conditions for which a matching lower bound exists for any decomposition algorithm that uses working sets of size 2. For the Simon and composite algorithms we also establish an O(n2 ) bound on the overall run time for the first stage. Combining the first and second stages gives an overall run time of O(n2 (c
4 0.041754279 83 jmlr-2006-Sparse Boosting
Author: Peter Bühlmann, Bin Yu
Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression
5 0.037991464 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
6 0.035670504 45 jmlr-2006-Learning Coordinate Covariances via Gradients
7 0.035496816 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
8 0.033192649 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
9 0.030778987 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
10 0.026965 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
11 0.026355071 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection
12 0.022932062 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
14 0.022337947 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
16 0.021791289 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
17 0.021113379 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)
18 0.020979173 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
19 0.020974575 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)
20 0.020848922 47 jmlr-2006-Learning Image Components for Object Recognition
topicId topicWeight
[(0, 0.117), (1, -0.056), (2, -0.02), (3, 0.013), (4, -0.039), (5, -0.03), (6, -0.038), (7, -0.119), (8, 0.067), (9, -0.026), (10, -0.005), (11, 0.042), (12, 0.053), (13, 0.06), (14, 0.0), (15, 0.018), (16, -0.116), (17, 0.092), (18, 0.211), (19, -0.005), (20, 0.041), (21, -0.029), (22, 0.04), (23, -0.078), (24, 0.016), (25, -0.138), (26, 0.253), (27, -0.045), (28, -0.205), (29, -0.193), (30, 0.069), (31, 0.207), (32, 0.197), (33, 0.106), (34, -0.178), (35, -0.097), (36, 0.344), (37, 0.101), (38, 0.017), (39, -0.031), (40, 0.273), (41, -0.116), (42, 0.023), (43, -0.152), (44, 0.064), (45, -0.04), (46, -0.058), (47, 0.149), (48, 0.017), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.96668059 88 jmlr-2006-Streamwise Feature Selection
Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar
Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classification, stepwise regression, multiple regression, feature selection, false discovery rate
2 0.57717234 12 jmlr-2006-Active Learning with Feedback on Features and Instances
Author: Hema Raghavan, Omid Madani, Rosie Jones
Abstract: We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorization tasks indicate that there is significant potential in improving classifier performance by feature re-weighting, beyond that achieved via membership queries alone (traditional active learning) if we have access to an oracle that can point to the important (most predictive) features. Our experiments on human subjects indicate that human feedback on feature relevance can identify a sufficient proportion of the most relevant features (over 50% in our experiments). We find that on average, labeling a feature takes much less time than labeling a document. We devise an algorithm that interleaves labeling features and documents which significantly accelerates standard active learning in our simulation experiments. Feature feedback can complement traditional active learning in applications such as news filtering, e-mail classification, and personalization, where the human teacher can have significant knowledge on the relevance of features. Keywords: active learning, feature selection, relevance feedback, term feedback, text classification
3 0.2510888 83 jmlr-2006-Sparse Boosting
Author: Peter Bühlmann, Bin Yu
Abstract: We propose Sparse Boosting (the SparseL2 Boost algorithm), a variant on boosting with the squared error loss. SparseL2 Boost yields sparser solutions than the previously proposed L2 Boosting by minimizing some penalized L2 -loss functions, the FPE model selection criteria, through smallstep gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with SparseL2 Boost. We prove an equivalence of SparseL2 Boost to Breiman’s nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of SparseL2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in SparseL2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2 Boosting. Consequently, we can select between SparseL2 Boost and L2 Boosting by comparing their gMDL scores. Keywords: lasso, minimum description length (MDL), model selection, nonnegative garrote, regression
4 0.23886181 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
Author: Don Hush, Patrick Kelly, Clint Scovel, Ingo Steinwart
Abstract: We describe polynomial–time algorithms that produce approximate solutions with guaranteed accuracy for a class of QP problems that are used in the design of support vector machine classifiers. These algorithms employ a two–stage process where the first stage produces an approximate solution to a dual QP problem and the second stage maps this approximate dual solution to an approximate primal solution. For the second stage we describe an O(n log n) algorithm that maps an √ √ approximate dual solution with accuracy (2 2Kn + 8 λ)−2 λε2 to an approximate primal solution p with accuracy ε p where n is the number of data samples, Kn is the maximum kernel value over the data and λ > 0 is the SVM regularization parameter. For the first stage we present new results for decomposition algorithms and describe new decomposition algorithms with guaranteed accuracy and run time. In particular, for τ–rate certifying decomposition algorithms we establish the optimality of τ = 1/(n − 1). In addition we extend the recent τ = 1/(n − 1) algorithm of Simon (2004) to form two new composite algorithms that also achieve the τ = 1/(n − 1) iteration bound of List and Simon (2005), but yield faster run times in practice. We also exploit the τ–rate certifying property of these algorithms to produce new stopping rules that are computationally efficient and that guarantee a specified accuracy for the approximate dual solution. Furthermore, for the dual QP problem corresponding to the standard classification problem we describe operational conditions for which the Simon and composite algorithms possess an upper bound of O(n) on the number of iterations. For this same problem we also describe general conditions for which a matching lower bound exists for any decomposition algorithm that uses working sets of size 2. For the Simon and composite algorithms we also establish an O(n2 ) bound on the overall run time for the first stage. Combining the first and second stages gives an overall run time of O(n2 (c
5 0.21449511 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
6 0.18522881 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
7 0.18058632 93 jmlr-2006-Universal Kernels
8 0.17036584 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
9 0.16623256 69 jmlr-2006-One-Class Novelty Detection for Seizure Analysis from Intracranial EEG
10 0.15975498 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
11 0.15298806 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
12 0.14666075 53 jmlr-2006-Learning a Hidden Hypergraph
13 0.14181441 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)
15 0.12099294 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
16 0.11324456 45 jmlr-2006-Learning Coordinate Covariances via Gradients
17 0.10969367 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
18 0.106187 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
19 0.10533679 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
20 0.10302535 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
topicId topicWeight
[(8, 0.017), (35, 0.448), (36, 0.085), (45, 0.017), (46, 0.021), (50, 0.043), (61, 0.013), (63, 0.025), (68, 0.016), (70, 0.014), (76, 0.014), (78, 0.026), (81, 0.024), (84, 0.023), (90, 0.018), (91, 0.018), (96, 0.089)]
simIndex simValue paperId paperTitle
Author: Olvi L. Mangasarian
Abstract: Support vector machines utilizing the 1-norm, typically set up as linear programs (Mangasarian, 2000; Bradley and Mangasarian, 1998), are formulated here as a completely unconstrained minimization of a convex differentiable piecewise-quadratic objective function in the dual space. The objective function, which has a Lipschitz continuous gradient and contains only one additional finite parameter, can be minimized by a generalized Newton method and leads to an exact solution of the support vector machine problem. The approach here is based on a formulation of a very general linear program as an unconstrained minimization problem and its application to support vector machine classification problems. The present approach which generalizes both (Mangasarian, 2004) and (Fung and Mangasarian, 2004) is also applied to nonlinear approximation where a minimal number of nonlinear kernel functions are utilized to approximate a function from a given number of function values.
same-paper 2 0.74812037 88 jmlr-2006-Streamwise Feature Selection
Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar
Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classification, stepwise regression, multiple regression, feature selection, false discovery rate
3 0.32117951 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
Author: Greg Hamerly, Erez Perelman, Jeremy Lau, Brad Calder, Timothy Sherwood
Abstract: An essential step in designing a new computer architecture is the careful examination of different design options. It is critical that computer architects have efficient means by which they may estimate the impact of various design options on the overall machine. This task is complicated by the fact that different programs, and even different parts of the same program, may have distinct behaviors that interact with the hardware in different ways. Researchers use very detailed simulators to estimate processor performance, which models every cycle of an executing program. Unfortunately, simulating every cycle of a real program can take weeks or months. To address this problem we have created a tool called SimPoint that uses data clustering algorithms from machine learning to automatically find repetitive patterns in a program’s execution. By simulating one representative of each repetitive behavior pattern, simulation time can be reduced to minutes instead of weeks for standard benchmark programs, with very little cost in terms of accuracy. We describe this important problem, the data representation and preprocessing methods used by SimPoint, the clustering algorithm at the core of SimPoint, and we evaluate different options for tuning SimPoint. Keywords: k-means, random projection, Bayesian information criterion, simulation, SimPoint
Author: Matthias Heiler, Christoph Schnörr
Abstract: We exploit the biconvex nature of the Euclidean non-negative matrix factorization (NMF) optimization problem to derive optimization schemes based on sequential quadratic and second order cone programming. We show that for ordinary NMF, our approach performs as well as existing stateof-the-art algorithms, while for sparsity-constrained NMF, as recently proposed by P. O. Hoyer in JMLR 5 (2004), it outperforms previous methods. In addition, we show how to extend NMF learning within the same optimization framework in order to make use of class membership information in supervised learning problems. Keywords: non-negative matrix factorization, second-order cone programming, sequential convex optimization, reverse-convex programming, sparsity
5 0.31392112 44 jmlr-2006-Large Scale Transductive SVMs
Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou
Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP
6 0.30475095 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
7 0.3027159 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals
8 0.30144191 83 jmlr-2006-Sparse Boosting
13 0.28180724 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
14 0.28102273 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
17 0.27681243 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
18 0.27581424 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
19 0.27488014 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
20 0.2746723 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples