jmlr jmlr2007 jmlr2007-7 knowledge-graph by maker-knowledge-mining

7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition


Source: pdf

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Center for Imaging Science The Johns Hopkins University 3400 N-Charles Street Baltimore MD 21218-2686, USA Editor: Isabelle Guyon Abstract We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. [sent-6, score-0.159]

2 We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. [sent-9, score-0.355]

3 The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. [sent-11, score-0.134]

4 Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection 1. [sent-13, score-0.26]

5 Introduction Most of the recent instances of pattern recognition problems (whether in computer vision, image understanding, biology, text interpretation, or spam detection) involve highly complex data sets with a huge number of possible explanatory variables. [sent-14, score-0.134]

6 A wrapper algorithm explores the space of features subsets to optimize the induction algorithm that uses the subset for classification. [sent-38, score-0.158]

7 These methods based on penalization face a combinatorial challenge when the set of variables has no specific order and when the search must be done over its subsets since many problems related to feature extraction have been shown to be NP-hard (Blum and Rivest, 1992). [sent-39, score-0.148]

8 (2002) construct another recursive selection method to optimize generalization ability with a gradient descent algorithm on the margin of Support Vector classifiers. [sent-46, score-0.293]

9 This weighting technique is not so new in the context of feature selection since it has been used in Sun and Li (2006). [sent-54, score-0.178]

10 Next, in Sections 3 and 4, we define stochastic algorithms which solve the optimization problem and discuss its convergence properties. [sent-60, score-0.135]

11 We denote δ(I) the complete set of features extracted from an input signal I. [sent-71, score-0.149]

12 ˆ This modified function g will be used later in combination with a stochastic algorithm which will replace the expectation over the training and validation subsets by empirical averages. [sent-132, score-0.17]

13 We address this with a soft selection procedure that attributes weights to the features F . [sent-141, score-0.19]

14 The relevant features will then be the set of variables δ ∈ F for which P(δ) is large. [sent-147, score-0.151]

15 Remark: We use P here as a control parameter: we first make a random selection of features before setting the learning algorithm. [sent-150, score-0.19]

16 Our first option will be to use projected gradient descent to minimize E , taking only constraint (3) into account. [sent-164, score-0.22]

17 This implies solving the gradient descent equation dPt = −πHF (∇E (Pt )) dt (5) which is well-defined as long as Pt ∈ SF . [sent-165, score-0.253]

18 We will later propose two new strategies to deal with the positivity constraint (4), the first one using the change of variables P → log P, and the second being a constrained optimization algorithm, that we will implement as a constrained stochastic diffusion on SF . [sent-168, score-0.421]

19 It is also be possible to design a constrained optimization algorithm, exploring the faces of the simplex when needed, but this is a rather complex procedure, which is harder to conciliate with the stochastic approximations we will describe in Section 4. [sent-182, score-0.238]

20 This approach will in fact be computationally easier to handle with a constrained stochastic diffusion algorithm, as described in Section 3. [sent-183, score-0.339]

21 ω∈F k Then, we have: Proposition 2 The gradient of E with respect to these new variables is given by: ˜ ∇y E (δ) = ∑ ω∈F ey(ω1 )+···+y(ωk )C(ω, δ)g(ω). [sent-188, score-0.158]

22 (9) k We can interpret this gradient on the variables y as a gradient on the variables P with a Riemannian metric u, v P = ∑ u(δ)v(δ)/P(δ). [sent-189, score-0.316]

23 This yields the evolution equation in the y variables dyt (δ) ˜ = −∇yt E (δ) + κt eyt (δ) , dt 516 (10) A S TOCHASTIC A LGORITHM FOR F EATURE S ELECTION IN PATTERN R ECOGNITION where ∑ κt = δ ∈F ˜ ∇yt E (δ )eyt (δ ) / does not depend on δ. [sent-194, score-0.156]

24 This provides an alternative scheme of gradient descent on E which has the advantage of satisfying the positivity constraints (4) by construction. [sent-197, score-0.22]

25 We implement this using a stochastic diffusion process constrained to the simplex S F . [sent-205, score-0.339]

26 This however can be handled using a stochastic approximation, as described in the next section. [sent-213, score-0.135]

27 1 Stochastic Gradient Descent We first recall general facts on stochastic approximation algorithms. [sent-215, score-0.135]

28 2 A PPROXIMATION T ERMS To implement our gradient descent equations in this framework, we therefore need to identify two random variables dn or d˜n such that E [dn ] = πHF [∇Pn E ] and ˜ E d˜n = πyn ∇yn E . [sent-226, score-0.307]

29 This would yield the stochastic algorithm: ˜ Pn+1 = Pn − εn dn or Pn+1 = Pn ¿From (7), we have: ∇P E (δ) = Eω e−εn dn . [sent-227, score-0.241]

30 ) Consequently, following (14), it is natural to define the approximation term of the gradient descent (5) by: C(ωn , . [sent-232, score-0.22]

31 ) where the set of k features ωn is a random variable extracted from F with law P⊗k and T1n , T2n are n independently sampled into the training set T . [sent-234, score-0.149]

32 In a similar way, we can compute the approximation term of the gradient descent based on (9) since ˜ ∇y E (δ) = Eω [g(ω)C(ω, δ)] yielding d˜n = πyn C(ωn , . [sent-235, score-0.22]

33 n By construction, we therefore have the proposition Proposition 3 The mean effect of random variables dn and d˜n is the global gradient descent, in other words: E [dn ] = πHF (∇E (Pn )) and ˜ E [d˜n ] = πyn ∇E (yn ) . [sent-237, score-0.211]

34 For instance, in the first case, at step n, one can see that for all features of δ ∈ ωn , we substract from Pn (δ) amount proportional to the error performed with ω and inversely proportional to Pn (δ) although for other features out of ωn , weights are a little bit increased. [sent-248, score-0.234]

35 Remark We provide the Euclidean gradient algorithm, which is subject to failure (one weight might become nonpositive) because it may converge for some applications, and in these cases, is much faster than the exponential formulation. [sent-261, score-0.162]

36 We here rapidly describe in which sense the stochastic approximations we have designed converge to their homologous ODE’s. [sent-265, score-0.135]

37 Thus, we can compare the stochastic gradient algorithms with the exact gradient algorithm and evaluate the speed of decay of E . [sent-351, score-0.383]

38 Moreover, one can see in the construction of our signals that some features are relevant with several classes (reusables features), some features are important only for one class and others are simply noise on the input. [sent-352, score-0.234]

39 stochastic exponential gradient descent (dashed line) classification rates on the training set. [sent-373, score-0.393]

40 stochastic exponential gradient descent (dashed line) classification rates on the training set. [sent-375, score-0.393]

41 Results We provide in Figure 6 the evolution of the mean error E on our training set set against the computation time for exact and stochastic exponential gradient descent algorithms. [sent-376, score-0.482]

42 The exact algorithm is faster but is quickly captured in a local minimum although exponential stochastic descent avoids more traps. [sent-377, score-0.269]

43 Also shown in Figure 6, is the fact that the stochastic Euclidean method achieved better results faster than the exponential stochastic approach and avoided more traps than the exponential algorithm to reach lower error rates. [sent-378, score-0.346]

44 Finally, Figure 7 shows that the efficiency of the stochastic gradient descent and of the reflected diffusion are almost similar in our synthetic example. [sent-381, score-0.511]

45 This has in fact always been so in our experiments: the diffusion is slightly better than the gradient when the latter converges. [sent-382, score-0.28]

46 For this reason, we will only compare the exponential gradient and the diffusion in the experiments which follow. [sent-383, score-0.318]

47 Remark that in this toy example; the exact gradient descent and the Euclidean stochastic gradient (first algorithm of Section 4. [sent-385, score-0.479]

48 Interpretation We observe that the features which are preferably selected are those which lie in j several subspaces Fi , and which bring information for at least two classes. [sent-390, score-0.151]

49 42 0 200 400 600 800 1000 Iterates n Figure 7: Stochastic Euclidean gradient descent (dashed line) vs. [sent-406, score-0.22]

50 reflected diffusion (full line) classification rate on the training set. [sent-407, score-0.199]

51 Remark finally that the features selected by OFW after the reusables ones are still relevant for the classification task. [sent-410, score-0.151]

52 Even though our framework is to select features in a large dictionary of variables, it will be interesting to look at the behavior of our algorithm on IRIS since results about feature selection are already known on this classical example. [sent-414, score-0.242]

53 We extract 2 variables at each step of the algorithm, 100 samples out of 150 are used to train our feature weighting procedure. [sent-416, score-0.173]

54 This result is consistent with the selection performed by CART on this database since we obtain similar results as seen in Figure 12. [sent-419, score-0.143]

55 The classification on Test Set is improved selecting two features (with OFW as Fisher Scoring) since we obtain an error rate of 2. [sent-421, score-0.16]

56 42 0 200 400 600 800 1000 Iterates n Figure 8: Comparison of the mean error rate computed on the test set with the 4 exact or stochastics gradient descents. [sent-432, score-0.201]

57 2 In our experiments, we arbitrarily fixed the number of features per classifier (to 100 for the Faces, Handwritten Digits and Leukemia data and to 15 for the email database). [sent-445, score-0.156]

58 2 0 0 10 20 30 40 50 60 70 80 90 100 Iterates n Figure 10: Evolution with n of the distribution on the 4 variables using a stochastic Euclidean algorithm. [sent-462, score-0.169]

59 2 0 0 10 20 30 40 50 Iterates n 60 70 80 90 100 Figure 11: Evolution with n of the distribution on the 4 variables using a stochastic Euclidean diffusion algorithm. [sent-467, score-0.325]

60 We have remarked in our experiments that the estimation of P was fairly robust to to variations of the number of features extracted at each step (k in our notation). [sent-469, score-0.149]

61 1 FACE D ETECTION Experimental framework We use in this section the face database from MIT, which contains 19 × 19 gray level images; samples extracted from the database are represented in Figure 13. [sent-482, score-0.172]

62 Results We first show the improvement of the mean performance of our extraction method, learned on the training set, and computed on the test set, from a random uniform sampling of features (Figure 14). [sent-490, score-0.21]

63 We remark again that the constrained diffusion performs better than the stochastic exponential gradient. [sent-495, score-0.414]

64 Figure 16 shows a comparison of the efficiency (computed on the test set) of Fisher, RFE, L0SVM and our weighting procedure to select features; besides we have shown the performance of A without any selection and the best performance of Random Forests (as an asymptote). [sent-500, score-0.126]

65 We observe that our method is here less efficient with a small number of features (for 20 features selected, we obtain 7. [sent-501, score-0.234]

66 25 Random Forest-1000 Trees Random Forest-100 Trees Fisher L0 OFW RFE Without Selection Rate of Misclassification 20 15 10 5 0 10 100 Number of Features 1000 Figure 16: Efficiency of several feature extractions methods for the faces database test set. [sent-505, score-0.206]

67 As the problem is quite simple using the last 3 features of the previous list, we choose to remove these 3 variables (which depends on the number of capital letters in an email), we start consequently with a list of 54 features. [sent-521, score-0.151]

68 We use here a 4-nearest neighbor algorithm and we extract 15 features at each step. [sent-522, score-0.151]

69 The database is composed by 4601 messages and we use 75% of the email database to learn our probability P ∞ , representing our extraction method while the 25% samples of data is left to form the test set. [sent-523, score-0.241]

70 Newman and Merz (1998) computed on the test set, using a stochastic gradient descent with an exponential parameterization (left) and with a constrained diffusion (right). [sent-559, score-0.628]

71 On our test set, the method based on the exponential parameterization achieves better results than those obtained by reflected diffusion which is slower because of our Brownian noise. [sent-561, score-0.225]

72 The words which are useful for spam recognition (left columns) are not surprising (“business”, “remove”, “receive” or “free” are highly weighted). [sent-567, score-0.134]

73 6% Figure 18: Words mostly selected by P∞ (exponential gradient learning procedure) for the spam/email classification. [sent-592, score-0.158]

74 Without any selection, the linear SVM has more than 15% error rate while each one of the former feature selection algorithms achieve better results using barely 5 words. [sent-597, score-0.168]

75 Figure 21 provides the variation of the mean error rate in function of the number of features k used in each ω. [sent-620, score-0.16]

76 The features are extracted first with the uniform distribution U F on F , then using the final P∞ . [sent-625, score-0.18]

77 Since the features we consider can be computed at every location in the image, it is interesting to visualize where the selection has occurred. [sent-637, score-0.19]

78 534 A S TOCHASTIC A LGORITHM FOR F EATURE S ELECTION IN PATTERN R ECOGNITION Horizontal edges Vertical edges Diagonal edges “+π/4” Diagonal edges “−π/4” Figure 23: Representation of the selected features after a stochastic exponential gradient learning for USPS digits. [sent-639, score-0.448]

79 4 G ENE S ELECTION FOR L EUKEMIA AML-ALL R ECOGNITION Experimental framework We carry on our experiments with feature selection and classification for microarray data. [sent-642, score-0.152]

80 We run a preselection method to obtain the database used by Deb and Reddy (2003) that contains 3859 genes. [sent-646, score-0.127]

81 02 0 0 100 200 300 400 500 Time t Figure 24: Evolution of the mean energy E computed by the constrained diffusion method on the training set with time for k = 100. [sent-656, score-0.234]

82 535 G ADAT AND YOUNES Results Figure 24 shows the efficiency of our method of weighting features in reducing the mean error E on the training set. [sent-663, score-0.17]

83 We remark that with random uniform selection of 100 features, linear support vector machines obtain a poor rate larger than 15% while learning P ∞ , we achieve a mean error rate less than 1%. [sent-664, score-0.227]

84 We then perform our Optimal Feature Weighting algorithm with the new set of features obtained by the Fisher preselection using for A a support vector machine with linear kernel. [sent-695, score-0.174]

85 Figure 27 show the decreasing evolution of the mean BER on the training set for each data sets of the feature selection challenge. [sent-696, score-0.214]

86 We select the number of features used for the classification on the validation set on the basis of a 10-fold 537 G ADAT AND YOUNES 0. [sent-700, score-0.152]

87 Table 2 summarizes results obtained by our OFW algorithm and, Linear SVM,5 and others algorithms of feature selection (Transductive SVM of Wu and Li (2006), combined filter methods with svm as F+SVM of Chen and Lin (2006) and FS+SVM of Lal et al. [sent-747, score-0.226]

88 FCBF, which does not intend directly to increase the accuracy of any classifier as a wrapper algorithm, selects the features by identifying the redundancy between features and relevance analysis of variables. [sent-768, score-0.275]

89 stochastic algorithm has been temporarily trapped in a neighborhood of a local minimum of our energy E . [sent-791, score-0.165]

90 Note also that all the results obtained require larger feature subsets than OFW and use a larger amount of probes in the set of selected features. [sent-796, score-0.173]

91 Thus, a good feature selection algorithm should obtained a small fraction of probes on the final selection. [sent-923, score-0.212]

92 In the first synthetic example, one can make the important remark that reusable features are mainly favored by OFW. [sent-930, score-0.217]

93 The learning time of OFW mostly depends on the initial number of variables in the feature space and the step of our stochastic scheme; for the Leukemia database which contains 3859 genes, learning took about one hour. [sent-938, score-0.291]

94 However, OFW can be easily implemented with parallel techniques since at each step of the stochastic procedure, one can test several subsets of the feature set still using the same update formula of Pn . [sent-939, score-0.187]

95 Moreover, we remark that it can be effective for the calculation time to first filter out very irrelevant features (selected for instance by a Fisher Score) and run the OFW procedure. [sent-940, score-0.154]

96 Our selection of features is done by learning a probability distribution on the original feature set, based on a gradient descent of an energy E within the simplex SF . [sent-944, score-0.492]

97 543 G ADAT AND YOUNES Our future work will enable the feature space F to be also modified during the algorithm, by allowing for combination rules, creation and deletion of tests, involving a hybrid evolution in the set of probability measures and in the feature space. [sent-947, score-0.193]

98 This will be implemented as a generalization of our constrained diffusion algorithm to include jumps in the underlying feature space. [sent-948, score-0.256]

99 Another development would be to speed up the learning procedure using stochastic algorithm techniques to handle even larger databases without using a combination of filter and wrapper method. [sent-949, score-0.176]

100 Convergence with probability one of stochastic approximation algorithms whose averı age is cooperative. [sent-963, score-0.135]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ofw', 0.388), ('pn', 0.287), ('younes', 0.239), ('ber', 0.229), ('adat', 0.217), ('tochastic', 0.217), ('hf', 0.205), ('ecognition', 0.203), ('sf', 0.176), ('diffusion', 0.156), ('eature', 0.147), ('election', 0.136), ('stochastic', 0.135), ('rfe', 0.125), ('gradient', 0.124), ('features', 0.117), ('lgorithm', 0.106), ('svm', 0.101), ('dorothea', 0.097), ('gisette', 0.097), ('spam', 0.096), ('descent', 0.096), ('evolution', 0.089), ('probes', 0.087), ('petal', 0.077), ('dexter', 0.077), ('selection', 0.073), ('database', 0.07), ('arcene', 0.068), ('extraction', 0.062), ('usps', 0.059), ('sepal', 0.058), ('amit', 0.058), ('fcbf', 0.058), ('madelon', 0.058), ('gadat', 0.057), ('preselection', 0.057), ('uf', 0.057), ('guyon', 0.056), ('kn', 0.056), ('faces', 0.055), ('weighting', 0.053), ('dn', 0.053), ('geman', 0.053), ('fisher', 0.052), ('performances', 0.052), ('lter', 0.052), ('feature', 0.052), ('ey', 0.051), ('leukemia', 0.051), ('euclidean', 0.049), ('constrained', 0.048), ('fleuret', 0.048), ('dupuis', 0.046), ('kushner', 0.046), ('ttrain', 0.046), ('classi', 0.046), ('iterates', 0.044), ('pt', 0.043), ('rate', 0.043), ('ciency', 0.042), ('wrapper', 0.041), ('forests', 0.039), ('email', 0.039), ('qn', 0.039), ('cachan', 0.039), ('recognition', 0.038), ('yn', 0.038), ('exponential', 0.038), ('remark', 0.037), ('detectors', 0.037), ('forest', 0.037), ('cart', 0.035), ('validation', 0.035), ('rf', 0.035), ('xik', 0.035), ('extract', 0.034), ('ode', 0.034), ('benveniste', 0.034), ('dpt', 0.034), ('fsc', 0.034), ('reusable', 0.034), ('skorokhod', 0.034), ('stochastics', 0.034), ('variables', 0.034), ('selected', 0.034), ('dt', 0.033), ('fs', 0.032), ('width', 0.032), ('extracted', 0.032), ('iris', 0.031), ('ow', 0.031), ('parameterization', 0.031), ('uniform', 0.031), ('xn', 0.03), ('energy', 0.03), ('favored', 0.029), ('extractions', 0.029), ('tsvm', 0.029), ('microarray', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

2 0.22951017 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds

3 0.081535555 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics

4 0.064359136 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

Author: Marc Boullé

Abstract: The naive Bayes classifier has proved to be very effective on many real data applications. Its performance usually benefits from an accurate estimation of univariate conditional probabilities and from variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a Bayesian regularization technique to select the most probable subset of variables compliant with the naive Bayes assumption. We also study the limits of Bayesian model averaging in the case of the naive Bayes assumption and introduce a new weighting scheme based on the ability of the models to conditionally compress the class labels. The weighting scheme on the models reduces to a weighting scheme on the variables, and finally results in a naive Bayes classifier with “soft variable selection”. Extensive experiments show that the compressionbased averaged classifier outperforms the Bayesian model averaging scheme. Keywords: naive Bayes, Bayesian, model selection, model averaging

5 0.058282297 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds

6 0.057162978 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

7 0.055704866 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

8 0.054716375 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning     (Special Topic on Model Selection)

9 0.053638358 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

10 0.052895959 52 jmlr-2007-Margin Trees for High-dimensional Classification

11 0.048228126 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

12 0.047962654 70 jmlr-2007-Ranking the Best Instances

13 0.046693802 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

14 0.045527663 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

15 0.042964987 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

16 0.042672113 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

17 0.039852098 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

18 0.038616158 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

19 0.038565505 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

20 0.037135106 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.251), (1, 0.045), (2, -0.04), (3, 0.014), (4, 0.049), (5, 0.093), (6, -0.026), (7, -0.237), (8, 0.323), (9, 0.301), (10, -0.2), (11, 0.169), (12, 0.178), (13, 0.027), (14, -0.104), (15, 0.011), (16, 0.124), (17, -0.014), (18, -0.08), (19, 0.064), (20, -0.019), (21, -0.174), (22, -0.024), (23, -0.057), (24, -0.016), (25, -0.045), (26, -0.124), (27, 0.001), (28, 0.066), (29, -0.024), (30, -0.017), (31, -0.082), (32, 0.111), (33, -0.082), (34, 0.049), (35, 0.019), (36, 0.037), (37, 0.037), (38, 0.058), (39, -0.06), (40, 0.017), (41, -0.024), (42, 0.035), (43, -0.012), (44, 0.038), (45, 0.095), (46, 0.006), (47, -0.103), (48, -0.048), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93315047 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

2 0.78663582 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds

3 0.35167074 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds

4 0.34541956 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

Abstract: Face recognition is a challenging problem due to variations in pose, illumination, and expression. Techniques that can provide effective feature representation with enhanced discriminability are crucial. Wavelets have played an important role in image processing for its ability to capture localized spatial-frequency information of images. In this paper, we propose a novel local discriminant coordinates method based on wavelet packet for face recognition to compensate for these variations. Traditional wavelet-based methods for face recognition select or operate on the most discriminant subband, and neglect the scattered characteristic of discriminant features. The proposed method selects the most discriminant coordinates uniformly from all spatial frequency subbands to overcome the deficiency of traditional wavelet-based methods. To measure the discriminability of coordinates, a new dilation invariant entropy and a maximum a posterior logistic model are put forward. Moreover, a new triangle square ratio criterion is used to improve classification using the Euclidean distance and the cosine criterion. Experimental results show that the proposed method is robust for face recognition under variations in illumination, pose and expression. Keywords: local discriminant coordinates, invariant entropy, logistic model, wavelet packet, face recognition, illumination, pose and expression variations

5 0.31066021 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

Author: Marc Boullé

Abstract: The naive Bayes classifier has proved to be very effective on many real data applications. Its performance usually benefits from an accurate estimation of univariate conditional probabilities and from variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a Bayesian regularization technique to select the most probable subset of variables compliant with the naive Bayes assumption. We also study the limits of Bayesian model averaging in the case of the naive Bayes assumption and introduce a new weighting scheme based on the ability of the models to conditionally compress the class labels. The weighting scheme on the models reduces to a weighting scheme on the variables, and finally results in a naive Bayes classifier with “soft variable selection”. Extensive experiments show that the compressionbased averaged classifier outperforms the Bayesian model averaging scheme. Keywords: naive Bayes, Bayesian, model selection, model averaging

6 0.30863708 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

7 0.29064932 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

8 0.27516976 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

9 0.24355897 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

10 0.24142408 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

11 0.23353104 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

12 0.21624443 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

13 0.21357755 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

14 0.21068993 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

15 0.20648798 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning     (Special Topic on Model Selection)

16 0.2054548 15 jmlr-2007-Bilinear Discriminant Component Analysis

17 0.20029831 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

18 0.19993275 70 jmlr-2007-Ranking the Best Instances

19 0.19304408 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

20 0.1901236 52 jmlr-2007-Margin Trees for High-dimensional Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.021), (8, 0.021), (10, 0.025), (12, 0.034), (15, 0.036), (28, 0.056), (34, 0.372), (40, 0.029), (45, 0.026), (48, 0.067), (60, 0.027), (80, 0.017), (85, 0.076), (98, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67377293 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

2 0.38188559 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, affinity matrix

3 0.38101524 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.37741107 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

5 0.37302887 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

6 0.37166822 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

7 0.37043115 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

8 0.37019849 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

9 0.36632067 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

10 0.36541632 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

11 0.3641026 39 jmlr-2007-Handling Missing Values when Applying Classification Models

12 0.3638227 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

13 0.36357176 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

14 0.36333597 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

15 0.36319521 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

16 0.36085624 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

17 0.35938042 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

18 0.3590309 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

19 0.35788593 72 jmlr-2007-Relational Dependency Networks

20 0.35748047 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes