jmlr jmlr2010 jmlr2010-94 jmlr2010-94-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, Carlos Santa Cruz
Abstract: Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets. We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem. In order to limit the computational complexity of solving the optimization problem, QPFS uses the Nystr¨ m method for approximate matrix diago onalization. QPFS is thus capable of dealing with very large data sets, for which the use of other methods is computationally expensive. In experiments with small and medium data sets, the QPFS method leads to classification accuracy similar to that of other successful techniques. For large data sets, QPFS is superior in terms of computational efficiency. Keywords: feature selection, quadratic programming, Nystr¨ m method, large data set, higho dimensional data
(Peng et al., 2005; Zhang et al., 2008) (Li et al., 2004; Zhang et al., 2008) (Zhu et al., 2008) (Li et al., 2004; Zhu et al., 2008) (Li et al., 2004; Zhang et al., 2008) (Zhu et al., 2008) (Hua et al., 2009) (Lauer et al., 2007; Lecun et al., 1998) Table 2: Description of the data sets used in experiments. N is the number of examples, M is the number of variables, and C is the number of classes. Baseline error rate is the rate obtained taking into account all variables. The last column cites papers where the data sets have been mentioned. 45 mRMR c=0.01 QPFS MI alpha=0.412 c=0.01 mRMR c=1.0 QPFS MI alpha=0.412 c=1.0 mRMR c=10.0 QPFS MI alpha=0.412 c=10.0 Error Rate (%) 40 35 30 25 20 15 0 50 100 150 Number of features 200 250 Figure 2: Classification error as a function of the number of features for the ARR data set and different regularization parameter values c in linear SVM. The figure shows that for c = 0.01, the SVM is too regularized. The effect when c = 10.0 is the opposite and the SVM overfits the training data. A value of c = 1.0 is a good tradeoff. The value of the α parameter chosen for each data set is shown in Table 3. This value is obtained according to Equations 3 and 9. Our hypothesis is that high values of α are better for data sets with high redundancy among variables. On the other hand, if there is low redundancy then small α should yield better results. The Nystr¨ m sampling rate p is chosen as large as possible while still yielding o a reasonable running time, since larger values reduce error in the approximation of the Q matrix. Other values of the α parameter, α ∈ {0.0, 0.1, 0.3, 0.5, 0.7, 0.9}, were considered in all experiments in order to verify that the proposed method of setting α provides good results. 1502 Q UADRATIC P ROGRAMMING F EATURE S ELECTION Data Set ARR (cor) NCI60 (cor) ARR (MI) NCI60 (MI) SRBCT (MI) GCM (MI) RAT (MI) MNIST (MI) p ¯ q ¯ f ˆ α 0.05 0.1 - 0.0889 0.267 0.0106 0.0703 0.0188 0.0284 0.0346 0.0454 0.0958 0.165 0.0152 0.253 0.0861 0.158 0.0187 0.0515 0.481 0.618 0.411 0.217 0.179 0.152 0.649 0.469 Table 3: Values of the α parameter for each data set. Correlation (cor) and mutual information (MI) were used as similarity measures for ARR and NCI60 data sets. Only mutual information was used for SRBCT, GCM, RAT and MNIST data sets. p is the subsampling rate in the Nystr¨ m method, q is the mean value of the elements of the matrix Q (similarity among o ¯ each pair of features), and f¯ is the mean value of the elements of the F vector (similarity of each feature with the target class). For the MNIST data set only nonzero values have been considered for the statistics due to the high level of sparsity of its features (80.78% sparsity in average). 3.2 Classification Accuracy Results The aim of the experiments described in this section is to compare classification accuracy achieved with mRMR and with QPFS, with and without Nystr¨ m approximation. The MaxRel algorithm o (Peng et al., 2005) is also included in the comparison. Two similarity measures, mutual information (MI) and correlation are considered. Classification error is measured as a function of the number of features. We also give results from a baseline method that does random selection of features, in order to determine the absolute advantage of using any feature selection method. Figure 3 shows the average classification error rate for the ARR data set as a function of the number of features. In Figure 3a, correlation is the similarity measure while mutual information (MI) is applied for Figure 3b. In both cases, the best accuracy is obtained with α = 0.5, which means that an equal tradeoff between relevance and redundancy is best. However, accuracies using the values of α specified by our heuristic are similar. Better accuracy is obtained when MI is used, in which case (Figure 3b) the error rate curve for α = 0.5 is similar to that obtained with mRMR. The random selection method yields results significantly worse than those obtained with other algorithms. Comparison with this method shows that the other methods provide a significant benefit up to about 150 features. For the NCI60 data set (Figure 4), the best accuracy is obtained when mutual information is used (Figure 4b) and α is set to 0.217 according to Table 3. In this case, the accuracy of QPFS is slightly better than the accuracy of mRMR. The value of α close to zero indicates that it is appropriate to give more weight to the quadratic term in QPFS. When correlation is used (Figure 4a), the best accuracy is obtained when α is set according to Equation 3. Generally, MI as similarity measure leads to better accuracy than correlation. This finding is reasonable given that MI can capture nonlinear relationships between variables. MI is used in the experiments described in the remainder of this section. 1503 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ (a) Baseline Error Random Selection QPFS correlation alpha=0.481 QPFS correlation alpha=0.5 Error Rate (%) 40 35 30 25 20 0 50 100 150 Number of features (b) 200 250 45 Baseline Error Random Selection MaxRel mRMR QPFS MI alpha=0.5 QPFS MI alpha=0.412 Error rate (%) 40 35 30 25 20 15 0 50 100 150 Number of features 200 250 Figure 3: Classification error as a function of the number of features for the ARR data set. (a) QPFS results using correlation as similarity measure with different α values. (b) MaxRel, mRMR and QPFS results using mutual information as similarity measure and different values of α for QPFS. Average error rate for the SRBCT data set and different sampling rates as a function of the number of features is shown in Figure 5. Results for the best α value in the grid {0, 0.1, 0.3, 0.5, 0.7, 0.9}, ˆ α = 0.1, and the estimated α = 0.179 are shown in Figure 5a. Accuracies for both α values are similar. The fact that a low value of α is best indicates low redundancy among variables compared to their relevance with the target class. QPFS classification accuracy is similar to that of mRMR. As shown in Figure 5b, when the QPFS+Nystr¨ m method is used, the higher the parameter p, the o closer the Nystr¨ m approximation is to complete diagonalization. QPFS+Nystr¨ m gives classificao o tion accuracy similar to that of QPFS when p > 0.1. Figure 6 shows error rates for the GCM data set using the algorithms MaxRel, mRMR, and ˆ QPFS+Nystr¨ m with α = 0.1 and α = 0.152. When the number of features is over 60, accuracy o achieved with QPFS+Nystr¨ m is better than with mRMR. A sampling rate of 3% is adequate for o 1504 Q UADRATIC P ROGRAMMING F EATURE S ELECTION (a) 100 Baseline Error Random Selection QPFS correlation alpha=0.619 90 Error Rate (%) 80 70 60 50 40 30 20 0 10 20 30 40 50 60 Number of features (b) 70 80 90 100 100 Baseline Error Random Selection MaxRel mRMR QPFS MI alpha=0.217 90 Error Rate (%) 80 70 60 50 40 30 20 0 10 20 30 40 50 60 Number of features 70 80 90 100 Figure 4: Classification error as a function of number of features for the NCI60 data set. (a) QPFS results using correlation as similarity measure with different α values. (b) MaxRel, mRMR and QPFS results using mutual information as similarity measure and different values of α for QPFS. this data set, which represents a major time complexity reduction given a feature space of 16063 variables. Another data set with many features is the RAT data set, for which Figure 7 shows results. In this case, QPFS+Nystr¨ m gives classification accuracy similar to that of mRMR when the subset o size is over 80 and the sampling rate is 10%. Given the good performance of the MaxRel algorithm ˆ for this data set, it is not surprising that a large α value α = 0.9 or α = 0.649 is best, considering also that QPFS with α = 1.0 is equivalent to MaxRel. The MNIST data set has a high number of training examples. Results for it are shown in Figure 8 ˆ ˆ for the QPFS with α = 0.3, the estimation α = 0.469 and the QPFS+Nystr¨ m with α and p ∈ o {0.1, 0.2, 0.5}. Our C code of mRMR is used instead of the code on the mRMR web site (Peng et al., 2005) which takes a long time to read the training file. The error rate for all algorithms 1505 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ (a) 20 Baseline Error MaxRel mRMR QPFS MI alpha=0.179 QPFS MI alfa=0.1 QPFS+Nystrom MI alpha=0.1 p=0.5 Error Rate (%) 15 10 5 0 0 10 20 30 40 50 60 Number of features (b) 70 80 90 100 20 QPFS+Nystrom MI alpha=0.1 p=0.01 QPFS+Nystrom MI alpha=0.1 p=0.04 QPFS+Nystrom MI alpha=0.1 p=0.1 QPFS+Nystrom MI alpha=0.1 p=0.5 Error Rate (%) 15 10 5 0 0 10 20 30 40 50 60 Number of features 70 80 90 100 Figure 5: Error rates using MaxRel, mRMR and QPFS+Nystr¨ m methods, with mutual information o as similarity measure for the SRBCT data set. reaches a minimum when about 350 features are selected. This is not a surprising fact: analyzing the sparsity of the MNIST features, approximately 400 of them have a level of sparsity higher than 70%. But if the feature space needs a greater reduction, significant differences appears between the ˆ studied methods as shown in Figure 8. mRMR and QPFS with α = 0.469 have similiar performance and close to the best results obtained by QPFS with α = 0.3. For this data set, the number of samples is much greater than the number of features, N ≫ M, and therefore , the time complexity of mRMR and QPFS is the same (O(NM 2 )). When QPFS+Nystr¨ m is applied with p = 0.2, the error rate is o competitive and the MNIST provides an example of the ability of QPFS+Nystr¨ m to handle large o data sets reducing the computational cost of mRMR and QPFS by a factor of 5. Note that the error rates shown for the MNIST data set are obtained using a linear kernel. The radial basis function kernel for SVM classifiers is known to lead to lower error rates for the full MNIST data set, but the choice of kernel is an issue separate from feature selection, which is the focus of this paper. 1506 Q UADRATIC P ROGRAMMING F EATURE S ELECTION Baseline Error MaxRel mRMR QPFS+Nystrom MI alpha=0.1 p=0.03 QPFS+Nystrom MI alpha=0.152 p=0.05 80 Error Rate (%) 70 60 50 40 30 0 50 100 150 200 250 Number of features 300 350 400 450 Figure 6: Error rates using MaxRel, mRMR and QPFS+Nystr¨ m methods, with mutual information o as similarity measure for the GCM data set. 35 Baseline Error MaxRel mRMR QPFS+Nystrom MI alpha=0.9 p=0.1 QPFS+Nystrom MI alpha=0.649 p=0.1 Error Rate (%) 30 25 20 15 10 5 0 50 100 150 200 250 Number of features 300 350 400 450 Figure 7: Error rates using MaxRel, mRMR and QPFS+Nystr¨ m methods, with mutual information o as similarity measure for the RAT data set. Figure 9 shows a grid of 780 pixels arrayed in the same way as the images in the MNIST data sets. A pixel is black if it corresponds to one of the top 100 (Figure 9a) and 350 (Figure 9b) selected features, and white otherwise. Black pixels are more dense towards the middle of the grid, because that is where the most informative features are. Pixels sometimes appear in a black/white/black checkerboard pattern, because neighboring pixels tend to make each other redundant. Table 4 evaluates the statistical significance of error rate differences. For each data set, 100 classifiers were trained using the stated number M of selected features. The 100 classifiers arise 1507 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ 40 Baseline Error MaxRel mRMR MID QPFS MI alpha=0.3 QPFS+Nystrom MI alpha=0.469 p=0.1 QPFS+Nystrom MI alpha=0.469 p=0.2 QPFS+Nystrom MI alpha=0.469 p=0.5 QPFS MI alpha=0.469 35 Error Rate (%) 30 25 20 15 10 5 10 20 30 40 50 60 70 80 Number of features 90 100 110 120 Figure 8: Error rates using MaxRel, mRMR and QPFS+Nystr¨ m methods, with mutual information o as similarity measure for the MNIST data set. ˆ Figure 9: First (a) 100 and (b) 350 features selected by QPFS+Nystr¨ m (α = 0.469 and p = 0.5) o for the MNIST data set (black pixels). from 10 repetitions of 10-fold cross-validation, so which M features are used may be different for each classifier. The one-tailed paired t-test for equal means is applied to the two sets of error rates, one set for mRMR and one set for QPFS. The test is one-tailed because the null hypothesis is that the mRMR method is as good or better than the QPFS method. The test is paired because both methods were applied to the same 100 data set versions. Results of the test are given in the row labeled significant?. 1508 Q UADRATIC P ROGRAMMING F EATURE S ELECTION For the NCI60 and SRBCT data sets, the best result is obtained when QPFS is used and it is statistically significantly better than mRMR. When 200 to 400 variables are considered, mRMR and QPFS are not statistically significantly different but the accuracy is not as good as in the case of 100 features, probably due to overfitting. In the case of the GCM data set, the mRMR method is statistically significantly better when fewer than 50 variables are considered. If the number of features is over 100, the accuracy with QPFS is significantly better than with mRMR, and the best performance is obtained in this case. For the ARR data set, mRMR is statistically significantly better than QPFS if fewer than 10 features are considered but the error rate obtained can be improved if more features are taken into account. When more than 50 featues are selected, the two methods are not statistically significantly different. The RAT data set behavior is quite similar. When fewer than 100 features are used, the mRMR algorithm is satatistically better than QPFS, but the error rate can be reduced adding more features. The two algorithms are not statistically significantly different in the other cases, except if more than 400 features are involved in which case QPFS is statistically significantly better than mRMR. Note that the error rates shown for QPFS are obtained with the ˆ proposed estimation of α. In some cases, as shown in Figures 3 to 7, this α value is not the best choice. Beyond simple binary statistical significance, Table 4 indicates that the QPFS method is statisˆ tically significantly better when the value of α is small. A possible explanation for this finding is ˆ the following. When α is small, features are highly correlated with the label ( f¯ ≫ q). The mRMR ¯ method is greedy, and only takes into account redundancy among features selected in previous iterations. When features are highly correlated with the label, then mRMR selects features with high relevance and mostly ignores redundancy. In contrast, QPFS evaluates all variables simultaneously, and always balances relevance and redundancy. 3.2.1 C OMPARISON WITH OTHER F EATURE S ELECTION M ETHODS The experiments of this work are focused in comparing QPFS with the greedy filter-type method mRMR (difference form, named MID) which also takes into account the difference between redundancy and relevance. Nevertheless, other feature selection methods independent of the classifier have been considered in the described experiments: • mRMR (quotient form, named MIQ) (Ding and Peng, 2005). While in mRMR (MID form) the difference between the estimation of redundancy and relevance is considered, in the case of mRMR (MIQ form) the quotient of both approximations is calculated. ˇ • reliefF (Robnik-Sikonja and Kononenko, 2003). The main idea of ReliefF is to evaluate the quality of a feature according to how well it distinguishes between instances that are near to each other. This algorithm is efficient in problems with strong dependencies between attributes. • Streamwise Feature Selection (SFS) (Zhou et al., 2006). SFS selects a feature if the benefit of adding it to the model is greater than the increase in the model complexity. The algorithm scales well to large feature sets and considers features sequentially for addition to a model making unnecessary to know all the features in advance. Average error rates for MaxRel, mRMR (MID), mRMR (MIQ), reliefF and QPFS using linear SVM (c = 1.0) and different number of features are shown in Table 5. Table 6 shows the error rate 1509 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ 10 mRMR ˆ QPFS α = 0.65 significant? p value mRMR ˆ QPFS α = 0.41 significant? p value mRMR ˆ QPFS α = 0.22 significant? p value mRMR ˆ QPFS α = 0.18 significant? p value mRMR ˆ QPFS α = 0.15 significant? p value M 100 50 200 400 21.15 ± 0.31 27.13 ± 0.33 no 1.00 RAT 16.18 ± 0.27 14.88 ± 0.24 18.16 ± 0.29 15.24 ± 0.27 no no 1.00 0.89 12.81 ± 0.25 12.85 ± 0.26 no 0.56 10.95 ± 0.23 10.51 ± 0.21 yes 1.7 × 10−2 25.19 ± 0.65 28.05 ± 0.65 no 1.00 ARR 20.76 ± 0.63 21.71 ± 0.61 21.30 ± 0.65 21.52 ± 0.65 no no 0.96 0.69 21.64 ± 0.61 21.76 ± 0.58 no 0.39 - 53.50 ± 2.17 46.33 ± 2.19 yes 1.6 × 10−3 NCI60 34.33 ± 1.74 32.00 ± 1.93 29.83 ± 1.68 29.00 ± 1.83 yes yes −3 7.5 × 10 3.3 × 10−2 32.83 ± 1.84 34.67 ± 1.81 no 0.95 33.64 ± 1.80 35.17 ± 1.95 no 0.96 9.38 ± 1.06 3.89 ± 0.75 yes 5.4 × 10−9 SRBCT 2.31 ± 0.51 0.47 ± 0.23 0.11 ± 0.11 0.05 ± 0.11 yes yes −5 5.6 × 10 2.3 × 10−2 0.24 ± 0.17 0.11 ± 0.11 no 0.27 0.49 ± 0.30 0.35 ± 0.25 no 0.36 54.26 ± 1.19 65.66 ± 1.03 no 1.00 GCM 43.38 ± 1.18 41.38 ± 1.08 44.11 ± 1.11 39.57 ± 1.24 no yes 0.81 0.037 38.26 ± 1.06 38.06 ± 1.16 no 0.40 38.50 ± 1.10 35.23 ± 1.17 yes 1.42 × 10−4 Table 4: Average error rates using the mRMR and QPFS methods, for classifiers based on M feaˆ tures. The parameter α of the QPFS method is indicated; rows are ordered according to this value. The Nystr¨ m approximation was used for the GCM and RAT data sets. o and the average number of features selected by Streamwise Feature Selection. SFS was applied to the binary data sets ARR and RAT and was used only as a feature selection method (a feature generation step was not included). Table 5 shows that for ARR, NCI60, SRBCT and GCM data sets, the best selector is mRMR or QPFS. A statistical study of the performance of both methods is given in Table 4. In the case of the RAT data set, the best methods are MaxRel and reliefF. The fact that the best results are obtained with methods which only consider relevance with the target class fits in with the analysis of Figure 7. Finally, for the MNIST data set the best choice is the mRMR (MIQ) algorithm. Nevertheless, the 1510 Q UADRATIC P ROGRAMMING F EATURE S ELECTION performance of MIQ in some data sets is not competitive (see, for instance, the ARR and NCI60 results). The accuracy of QPFS+Nystr¨ m (p = 0.2) is good if a high enough number of features is o used, and it has lower computational cost than mRMR and QPFS. Regarding SFS, Table 6 shows that SFS provides a competitive error rate for the ARR data set with few features (around 11) but its effectiveness in the RAT data set is improved by other feature selection algorithms when more than 6 attributes are considered. It is noticeable the efficiency of SFS getting acceptable accuracies using a small number of features. ReliefF and SFS are feature selection methods which need to establish the value of some parameters like in QPFS. In ReliefF all instances were used (not random subsampling) and the number of neighbors was set to 3 for all data sets, except for MNIST where 10 neighbors were considered. In the case of the SFS algorithm, the default values (wealth = 0.5 and △ α = 0.5) were used. 3.3 Time Complexity Results Since the previous subsection has established the effectiveness of the QPFS method, it is useful now to compare mRMR and QPFS experimentally with respect to time complexity. As stated in Table 1 in Section 2.4, the running times of mRMR and QPFS with and without Nystr¨ m all depend linearly o on N when M and p are fixed. In order to confirm experimentally this theoretical dependence, time consumption as a function of the number of training examples is measured on the SRBCT data set. Figure 10a shows the time consumed for the modified SRBCT data set, averaged over 50 runs, as a function of the number of samples, N, for the mRMR, QPFS and QPFS+Nystr¨ m methods. o As expected, both mRMR and QPFS show a linear dependence on the number of patterns. For QPFS+Nystr¨ m, Table 1 shows that the slope of this linear dependence is proportional to the o sampling rate p. Over the range p = 0.01 to p = 0.5, a decrease in p leads to a decrease in the slope of the linear dependence on N. Therefore, although all algorithms are linearly dependent on N, the QPFS+Nystr¨ m is computationally the most efficient. The time cost advantage increases with o increasing number of training examples because the slope is greater for mRMR than for QPFS. The next question is the impact on performance of the number of features, M. Table 1 shows that mRMR and QPFS have quadratic and cubic dependence on M, respectively. However, the QPFS+Nystr¨ m cubic coefficient is proportional to the square of the sampling rate. When small o value of p are sufficient, which is the typical case, the cubic terms are not dominant. These results are illustrated in the experiments shown in Figure 10b. This figure shows the average time cost for the SRBCT data set as a function of the problem dimension, M, for the mRMR, QPFS, and QPFS+Nystr¨ m methods. As expected from Table 1, mRMR and QPFS empirically o show quadratic and cubic dependence on problem dimension. QPFS+Nystr¨ m shows only quadratic o dependence on problem dimension, with a decreasing coefficient for decreasing p values. In all cases, a t-test has been used to verify the order of the polynomial that best fits each curve by least-squares fitting (Neter and Wasserman, 1974). Overall, for small Nystr¨ m sampling rates, o QPFS+Nystr¨ m is computationally the most efficient. o Last but not least important, Table 1 shows there should be a quadratic dependence on sampling rate for the QPFS+Nystr¨ m algorithm. Figure 10c shows the empirical average time cost for the o SRBCT data set as a function of the sampling rate p. As expected, there is quadratic dependence on p and cubic dependence on the problem dimension M. 1511 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ Data Set ARR NCI60 SRBCT GCM RAT MNIST Method MaxRel MID MIQ reliefF QPFS MaxRel MID MIQ reliefF QPFS MaxRel MID MIQ reliefF QPFS MaxRel MID MIQ reliefF QPFS+N p = 0.05 MaxRel MID MIQ reliefF QPFS+N p = 0.1 MaxRel MID MIQ reliefF QPFS+N p = 0.2 10 27.48 25.19 29.79 30.64 28.05 61.33 53.50 56.50 56.93 46.33 21.58 9.39 10.11 6.38 3.89 79.32 54.26 79.32 61.25 65.66 19.95 21.15 23.69 22.16 27.13 59.19 53.39 51.69 50.91 57.00 20 24.68 22.99 27.78 24.48 23.72 49.83 41.50 47.50 54.17 36.00 14.33 3.33 2.18 4.18 1.57 60.78 48.45 56.48 51.61 54.72 17.32 18.46 19.62 20.40 21.89 40.98 29.37 25.98 40.20 35.39 40 21.70 20.64 23.89 21.54 22.39 40.00 36.33 38.83 48.49 33.00 6.36 2.01 0.47 1.65 0.97 48.46 44.16 46.64 46.36 46.09 15.40 16.53 17.23 17.44 19.02 25.77 19.56 11.79 23.81 23.62 M 50 20.82 20.76 23.32 21.34 21.30 38.67 34.33 38.17 48.49 29.83 4.51 2.31 0.72 1.79 0.11 45.58 43.38 43.96 43.83 44.11 15.16 16.18 16.61 16.45 18.16 22.5 17.40 10.87 19.56 20.48 100 20.31 21.71 21.53 20.90 21.52 34.83 32.00 32.83 38.07 29.00 2.19 0.47 0.24 0.96 0.05 40.98 41.38 41.80 39.35 39.57 14.34 14.88 15.07 13.68 15.24 12.09 11.72 7.78 12.31 11.31 200 21.73 21.64 21.74 21.66 21.76 35.50 32.83 35.50 32.13 34.67 0.24 0.24 0.25 0.40 0.11 39.98 38.26 38.46 39.75 38.06 13.54 12.81 12.46 11.43 12.85 7.64 7.55 6.90 8.47 7.71 400 34.17 33.67 35.17 34.36 35.17 0.13 0.49 0.72 0.40 0.35 38.77 35.50 38.05 37.08 35.26 11.97 10.95 10.96 9.85 10.51 6.72 6.66 6.33 6.86 6.54 Table 5: Error rates for different feature selection methods and Linear SVM. The best result in each case is marked in bold. QPFS+N indicates that the Nystr¨ m approximation is used in the o QPFS method and p represents the subsampling rate in Nystr¨ m method. In all cases, the o ˆ α parameter of QPFS is set to α. 4. Conclusions This paper has presented and studied a new feature selection method for multiclass classifier learning problems. The new method, named Quadratic Programming Feature Selection (QPFS), is based 1512 Q UADRATIC P ROGRAMMING F EATURE S ELECTION Data Set Number of Selected Features (average) Error rate (%) ARR RAT 10.75 ± 0.155 6.12 ± 0.13 23.34 ± 0.63 22.87 ± 0.33 Table 6: Streamwise Feature Selection error rates. mRMR QPFS alpha=0.1 QPFS+N alpha=0.1 p=0.01 0.25 0.2 Average time cost (seconds) Average time cost (seconds) 3.5 QPFS+N alpha=0.1 p=0.25 3 0.15 QPFS+Nystrom alpha=0.1 p=0.01 QPFS+Nystrom alpha=0.1 p=0.05 2.5 QPFS+N alpha=0.1 p=0.5 9 QPFS alpha=0.1 Average time cost (seconds) mRMR 0.3 10 4 0.35 QPFS+Nystrom alpha=0.1 p=0.1 2 1.5 0.1 1 8 7 M=100 M=300 M=400 M=500 6 5 4 3 2 0.05 0 0 0.5 50 100 150 200 250 Number of Patterns (N) 300 0 0 1 500 1000 Dimension (M) 1500 0 0 0.2 0.4 0.6 Subsampling rate (p) 0.8 Figure 10: Time cost in seconds for mRMR and QPFS as a function of: (a) the number of patterns, N; (b) the dimension, M; and (c) the sampling rate, p. QPFS+N indicates that the Nystr¨ m approximation is used in the QPFS method. o on the optimization of a quadratic function that is reformulated in a lower-dimensional space using the Nystr¨ m approximation (QPFS+Nystr¨ m). The QPFS+Nystr¨ m method, using either Pearson o o o correlation coefficient or mutual information as the underlying similarity measure, is computationally more efficient than the leading previous methods, mRMR and MaxRel. With respect to classification accuracy, the QPFS method is similar to MaxRel and mRMR when mutual information is used, and yields slightly better results if there is high redundancy. In all experiments, mutual information yields better classification accuracy than correlation, presumably because mutual information better captures nonlinear dependencies. Small sampling rates in the Nystr¨ m method still lead to reasonable approximations of exact matrix diagonalization, sharply o reducing the time complexity of QPFS. In summary, the new QPFS+Nystr¨ m method for selecting o a subset of features is a competitive and efficient filter-type feature selection algorithm for highdimensional classifier learning problems. Acknowledgments I.R.-L. is supported by an FPU grant from Universidad Aut´ noma de Madrid, and partially supo ported by the Universidad Aut´ noma de Madrid-IIC Chair. R.H. acknowledges partial support by o ONR N00014-07-1-0741. 1513 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammarling, J. Demmel, C. H. Bischof, and Danny C. Sorensen. LAPACK: a portable linear algebra library for high-performance computers. In SC, pages 2–11, 1990. R. Bekkerman, N. Tishby, Y. Winter, I. Guyon, and A. Elisseeff. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183–1208, 2003. D.P. Bertsekas. Nonlinear Programming. Athena Scientific, September 1999. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J.Stone. Classification and Regression Trees. Chapman & Hall/CRC, January 1984. C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm. T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. T.M. Cover. The best two independent measurements are not the two best. IEEE Trans. Systems, Man, and Cybernetics, 4:116–117, 1974. C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol, 3(2):185–205, April 2005. R. O. Duda, P.E. Hart, and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, November 2000. S. Fine, K. Scheinberg, N. Cristianini, J. Shawe-Taylor, and B. Williamson. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, 2001. G. Forman. BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge mining, pages 263–270, New York, NY, USA, 2008. ACM. G. Forman. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3:1289–1305, 2003. C. Fowlkes, S. Belongie, and J. Malik. Efficient spatiotemporal grouping using the Nystr¨ m method. o In Proc. IEEE Conf. Comput. Vision and Pattern Recognition, pages 231–238, 2001. D. Goldfarb and A. Idnani. A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27(1):1–33, 1983. I. Guyon. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. M. A. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the International Conference on Machine Learning, pages 359–366. Morgan Kaufmann, 2000. 1514 Q UADRATIC P ROGRAMMING F EATURE S ELECTION J. Hua, W. D. Tembe, and E. R. Dougherty. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recogn., 42(3):409–424, 2009. A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):4–37, January 2000. G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference, San Francisco, 1994. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2): 273–324, 1997. S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nystr¨ m method. In Proo ceeding of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. P. Langley. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall symposium on relevance, pages 140–144. AAAI Press, 1994. F. Lauer, C. Y. Suen, and G. Bloch. A trainable feature extractor for handwritten digit recognition. Pattern Recogn., 40(6):1816–1824, 2007. Y. Lecun, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. R. Leitner, H. Mairer, and A. Kercek. Real-time classification of polymers with NIR spectral imaging and blob analysis. Real-Time Imaging, 9:245 – 251, 2003. T. Li, C. Zhang, and M. Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20:2429–2437, 2004. K. Momen, S. Krishnan, and T. Chau. Real-time classification of forearm electromyographic signals corresponding to user-selected intentional movements for multifunction prosthesis control. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 15(4):535–542, Dec. 2007. J. Neter and W. Wasserman. Applied Linear Statistical Models. Richard D. Irwin, INC., 1974. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell, 27: 1226–1238, 2005. Software available at http://research.janelia.org/peng/proj/mRMR/. ˇ M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn., 53(1-2):23–69, 2003. J. Rodriguez, A. Goni, and A. Illarramendi. Real-time classification of ECGs on a PDA. IEEE Transactions on Information Technology in Biomedicine, 9(1):23–34, 2005. P. Shenoy, K. J. Miller, B. Crawford, and R. N. Rao. Online electromyographic control of a robotic prosthesis. IEEE Trans Biomed Eng, 55(3):1128–1135, Mar 2008. 1515 RODRIGUEZ -L UJAN , H UERTA , E LKAN AND S ANTA C RUZ B. A. Turlach and A. Weingessel. The quadprog package, version 1.4-11, 2000. Available electronically at cran.r-project.org/web/packages/quadprog/index.html. J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, pages 668–674. MIT Press, 2001. C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In o Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), 2003. Y. Zhang, C. Ding, and T. Li. Gene selection algorithm by combining reliefF and mRMR. BMC Genomics, 9(Suppl 2), 2008. J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. J. Mach. Learn. Res., 7:1861–1885, 2006. S. Zhu, D. Wang, and T. Li. Feature selection for gene expression using model-based entropy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 99(1), 2008. 1516