nips nips2002 nips2002-89 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nuno Vasconcelos
Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We address the question of feature selection in the context of visual recognition. [sent-3, score-0.439]
2 It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. [sent-4, score-0.737]
3 The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. [sent-5, score-1.293]
4 The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. [sent-6, score-1.533]
5 Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it. [sent-7, score-0.201]
6 1 Introduction It has long been recognized that feature extraction and feature selection are important problems in statistical learning. [sent-8, score-0.652]
7 Given a classification or regression task in some observation space (typically high-dimensional), the goal is to find the best transform into a feature space (typically lower dimensional) where learning is easier (e. [sent-9, score-0.266]
8 While in the case of feature extraction there are few constraints on , for feature selection the transformation is constrained to be a projection, i. [sent-12, score-0.645]
9 the components of a feature vector in are a subset of the components of the associated vector in . [sent-14, score-0.246]
10 Both feature extraction and selection can be formulated as optimization problems where the goal is to find the transform that best satisfies a given criteria for “feature goodness”. [sent-15, score-0.551]
11 ¡ ¢ ¡ ¢ In this paper we concentrate on visual recognition, a subset of the classification problem for which various optimality criteria have been proposed throughout the years. [sent-16, score-0.197]
12 In this context, the best feature spaces are those that maximize discrimination, i. [sent-17, score-0.196]
13 the separation between the different image classes to recognize. [sent-19, score-0.097]
14 However, classical discriminant criteria such as linear discriminant analysis make very specific assumptions regarding class densities, e. [sent-20, score-0.21]
15 Recently, various authors have advocated the use of information theoretic measures for feature extraction or selection [15, 3, 9, 11, 1]. [sent-23, score-0.42]
16 These can be seen as instantiations of the the infomax principle of neural organization1 proposed by Linsker [7], which also encompasses information theoretic approaches for independent component analysis and blind-source separation [2]. [sent-24, score-0.683]
17 In the classification context, infomax recommends the selection of the feature transform that maximizes the mutual information (MI) between features and class labels. [sent-25, score-1.155]
18 By noting that to maximize MI between features and class labels is the same as minimizing the entropy of labels given features, it is possible to establish a connection through Fano’s inequality: that class-posterior entropy (CPE) is a lower bound on the PE [11, 4]. [sent-27, score-0.314]
19 This connection is, however, weak in the sense that there is little insight on how tight the bound is, or if minimizing it has any relationship to minimizing PE. [sent-28, score-0.219]
20 In fact, among all lower bounds on PE, it is not clear that CPE is the most relevant. [sent-29, score-0.065]
21 An obvious alternative is the Bayes error (BE) which 1) is the tightest possible classifierindependent lower-bound, 2) is an intrinsic measure of the complexity of the discrimination problem and, 3) like CPE, depends on the feature transformation and class labels alone. [sent-30, score-0.264]
22 Minimizing BE has been recently proposed for feature extraction in speech problems [10]. [sent-31, score-0.3]
23 In particular, it is shown that 1) CPE is a lower bound on BE and 2) this bound is tight, in the sense that the former is a good approximation to the latter. [sent-33, score-0.222]
24 It follows that infomax solutions are near-optimal in the minimum BE sense. [sent-34, score-0.641]
25 While for feature extraction both infomax and BE appear to be difficult to optimize directly, we show that infomax has clear computational advantages for feature selection, particularly in the context of the sequential procedures that are prevalent in the feature selection literature [6]. [sent-35, score-2.032]
26 The analysis of some simple classification problems reveals that a quantity which plays an important role in infomax solutions is the marginal diversity: the average distance between each of the marginal class-conditional densities and their mean. [sent-36, score-1.063]
27 This serves as inspiration to a generic principle for feature selection, the principle of maximum marginal diversity (MMD), that only requires marginal density estimates and can therefore be implemented with extreme computational simplicity. [sent-37, score-1.137]
28 While heuristics that are close to the MMD principle have been proposed in the past, very little is known regarding their optimality. [sent-38, score-0.104]
29 In this paper we summarize the main results of a theoretical characterization of the problems for which the principle is guaranteed to be optimal in the infomax sense (see [13] for further details). [sent-39, score-0.812]
30 First, it shows that there is a family of classification problems for which a near-optimal solution, in the BE sense, can be achieved with a computational procedure that does not involve combinatorial search. [sent-41, score-0.098]
31 This is a major improvement, from a computational standpoint, to previous solutions for which some guarantee of optimality (branch and bound search) or near optimality (forward or backward search) is available [6]. [sent-42, score-0.208]
32 Second, when combined with recent studies on the statistics of biologically plausible image transformations [8, 5], it suggests that in the context of visual recognition, MMD feature selection will lead to solutions that are optimal in the infomax sense. [sent-43, score-1.194]
33 2 Infomax vs minimum Bayes error In this section we show that, for classification problems, the infomax principle is closely related to the minimization of Bayes error. [sent-46, score-0.725]
34 1 Under the infomax principle, the optimal organization for a complex multi-layered perceptual system is one where the information that reaches each layer is processed so that the maximum amount of information is preserved for subsequent layers. [sent-48, score-0.624]
35 Theorem 1 Given a classification problem with classes in a feature space , the decision function which minimizes the probability of classification error is the Bayes classifier , where is a random variable that assigns to one of classes, and . [sent-49, score-0.241]
36 Principle 1 (infomax) Consider an -class classification problem with observations drawn from random variable , and the set of feature transformations . [sent-55, score-0.253]
37 The best feature space is the one that maximizes the mutual information where is the class indicator variable defined above, , and the mutual information between and . [sent-56, score-0.353]
38 We next derive pend on , infomax is equivalent to the minimization of the CPE a bound that plays a central role on the relationship between this quantity and BE. [sent-59, score-0.685]
39 Furthermore, the bound is tight in the sense that equality A oP A m ¢v t and D { y w i |uzxyC 2 iC fj%A A uP " t %! [sent-62, score-0.166]
40 Then, (3) ¢t t m The following theorem follows from this bound. [sent-64, score-0.05]
41 C %A (4) q % P h 6¤ nf% '¨ 4 m¤ A B 6 k l¨ ¤ ¢} H 7 is the random vector from which features are drawn. [sent-65, score-0.066]
42 The two bounds are equal up to an additive constant ( ) that quickly decreases to zero with the number of classes . [sent-68, score-0.045]
43 It follows that, at least when the number of classes is large, Fano’s is really a lower bound on BE, not only on PE. [sent-69, score-0.152]
44 First, since constants do not change the location of the bound’s extrema, it shows that infomax minimizes a lower bound on BE. [sent-71, score-0.661]
45 Second, unlike Fano’s bound, it sheds considerable insight on the relationship between the extrema of the bound and those of the BE. [sent-72, score-0.134]
46 ¨ e d d xi e e e B d P r¨ h 4 6 e d u'§¤ ¤ k $ In fact, it is clear from the derivation of the theorem that, the only reason why the righthand (RHS) and left-hand (LHS) sides of (4) differ is the application of (2). [sent-73, score-0.083]
47 First, bound (2) is tight in the sense defined in the lemma. [sent-148, score-0.166]
48 It follows that infomax solutions will, in general, be very similar to those that minimize the BE . [sent-152, score-0.603]
49 3 Feature selection For feature extraction, both infomax and minimum BE are complicated problems that can only be solved up to approximations [9, 11, 10]. [sent-154, score-0.98]
50 We now show that the opposite holds for feature selection, where the minimization of CPE is significantly simpler than that of BE. [sent-156, score-0.25]
51 We start by recalling that, because the possible number of feature subsets in a feature selection problem is combinatorial, feature selection techniques rely on sequential search methods [6]. [sent-157, score-0.939]
52 These methods proceed in a sequence of steps, each adding a set of features to the current best subset, with the goal of optimizing a given cost function 2. [sent-158, score-0.095]
53 We denote the current subset by , the added features by and the new subset by . [sent-159, score-0.166]
54 a b c ¨ Gh C Gh ¤ Gh b Uh a Uh Theorem 3 Consider an -class classification problem with observations drawn from a random variable , and a feature transformation . [sent-160, score-0.225]
55 is a infomax feature ¢ ¢ c a ¡ 7 ` 2 These methods are called forward search techniques. [sent-161, score-0.789]
56 There is also an alternative set of backward search techniques, where features are successively removed from an initial set containing all features. [sent-162, score-0.105]
57 Noting that , it clearly favors feature spaces where each class-conditional density is as distant as possible (in the KL sense) from the average among all classes. [sent-167, score-0.196]
58 This is a sensible way to quantify the intuition that optimal discriminant transforms are the ones that best separate the different classes. [sent-168, score-0.13]
59 Equation (6), in turn, leads to an optimal rule to merge with the current optimal solution : the set which for finding the features minimizes . [sent-169, score-0.148]
60 The equation also leads to a straightforward procedure for updating the optimal cost once this set is determined. [sent-170, score-0.07]
61 For this reason, press infomax is a better principle for feature selection problems than direct minimization of BE. [sent-174, score-1.075]
62 U 4 2¤ " ( & $ " 0 R 4 Maximum marginal diversity To gain some intuition for infomax solutions, we next consider the Gaussian problem of Figure 3. [sent-176, score-1.07]
63 Assuming that the two classes have equal prior probabilities , the marginals and are equal and feature does not contain any useful information for classification. [sent-177, score-0.294]
64 On the other hand, because the classes are clearly separated along the axis, feature contains all the information available for discriminating between them. [sent-178, score-0.273]
65 The different discriminating powers of the two variables are reflected by the infomax costs: while leads to , from it follows that , and (5) recommends the selection of . [sent-179, score-0.789]
66 The key advantage of infomax is that it emphasizes marginal diversity. [sent-181, score-0.746]
67 @C U$ h 4¨ 4 ( R 7 ¨ CDD D 5FEFC d ¤ H £ The intuition conveyed by the example above can be easily transformed into a generic principle for feature selection. [sent-185, score-0.36]
68 Principle 2 (Maximum marginal diversity) The best solution for a feature selection problem is to select the subset of features that leads to a set of maximally diverse marginal densities. [sent-186, score-0.852]
69 Right: marginals for ¡ −1 §¨ −2 −3 1 1. [sent-204, score-0.053]
70 First it is inherently discriminant, recommending the elimination of the dimensions along which the projections of the class densities are most similar. [sent-208, score-0.079]
71 features the following h Algorithm 1 (MMD feature selection) For a classification problem with , classes and class priors procedure returns the top MMD features. [sent-210, score-0.346]
72 D¤ P I In general, there are no guarantees that MMD will lead to the infomax solution. [sent-216, score-0.554]
73 In [13] we seek a precise characterization of the problems where MMD is indeed equivalent to infomax. [sent-217, score-0.073]
74 Theorem 4 Consider a classification problem with class labels drawn from a random variable and features drawn from a random vector and let be the optimal feature subset of size in the infomax sense. [sent-219, score-0.946]
75 Furthermore, (8) in the (9) 1 4 4 ¤ 10 20 $ 4Ey¨2 83¦r (& 2W$ ¥ H £¢ The theorem states that the MMD and infomax solutions will be identical when the mutual information between features is not affected by knowledge of the class label. [sent-221, score-0.817]
76 This is an interesting condition in light of various recent studies that have reported the observationof consistent patterns of dependence between the features of various biologically plausible image transformations [8, 5]. [sent-222, score-0.177]
77 Even though the details of feature dependence will vary from one image class to the next, these studies suggest that the coarse structure of the patterns of dependence between such features follow universal statistical laws that hold for all types of images. [sent-223, score-0.355]
78 The potential implications of this conjecture are quite significant. [sent-224, score-0.052]
79 6 Cumulative marginal diversity Classification rate Jain/Zongker score 0. [sent-231, score-0.484]
80 75 0 5 Sample size a) 10 15 20 Number of features 25 0. [sent-245, score-0.066]
81 that, in the context of visual processing, (8) will be approximately true and the MMD principle will consequently lead to solutions that are very close to optimal, in the minimum BE sense. [sent-248, score-0.304]
82 Given the simplicity of MMD feature selection, this is quite remarkable. [sent-249, score-0.196]
83 Second, it implies that when combined with such transformations, the marginal diversity is a close predictor for the CPE (and consequently the BE) achievable in a given feature space. [sent-250, score-0.742]
84 This enables quantifying the goodness of the transformation without even having to build the classifier. [sent-251, score-0.061]
85 5 Experimental results In this section we present results showing that 1) MMD feature selection outperforms combinatorial search when (8) holds, and 2) in the context of visual recognition problems, marginal diversity is a good predictor of PE. [sent-253, score-1.1]
86 We start by reporting results on a synthetic problem, introduced by Trunk to illustrate the curse of dimensionality [12], and used by Jain and Zongker (JZ) to evaluate various feature selection procedures [6]. [sent-254, score-0.42]
87 It consists of and is an intwo Gaussian classes of identity covariance and means teresting benchmark for feature selection because it has a clear optimal solution for the best subset of features (the first ) for any . [sent-255, score-0.587]
88 JZ exploited this property to propose an automated procedure for testing the performance of feature selection algorithms across variations in dimensionality of the feature space and sample size. [sent-256, score-0.548]
89 We repeated their experiments, simply replacing the cost function they used (Mahalanobis distance - MDist between the means) by the marginal diversity. [sent-257, score-0.221]
90 X cd ¡ D D EFD d ¢ ¡ A d U ¡ Figure 4 a) presents the JZ score obtained with MMD as a function of the sample size. [sent-258, score-0.087]
91 A comparison with Figure 5 of [6] shows that these results are superior to all those obtained by JZ, including the ones relying on branch and bound. [sent-259, score-0.059]
92 This is remarkable, since branch and bound is guaranteed to find the optimal solution and the Mdist is inversely proportional to the PE for Gaussian classes. [sent-260, score-0.175]
93 We believe that the superiority of MMD is due to the fact that it only requires estimates of the marginals, while the MDist requires estimates of joint densities and is therefore much more susceptible to the curse of dimensionality. [sent-261, score-0.068]
94 In any case, this problem is a good example of situations where, because (8) holds, MMD will find the optimal solution at a computational cost that is various orders of magnitude smaller than standard procedures based on combinatorial search (e. [sent-263, score-0.211]
95 Figures 4 b) and c) show that, for problems involving commonly used image transformations, marginal diversity is indeed a good predictor of classification accuracy. [sent-266, score-0.583]
96 The figures compare, for each space dimension, the recognition accuracy of a complete texture recognition system with the predictions provided by marginal diversity. [sent-267, score-0.315]
97 Recognition accuracy was measured on the Brodatz texture database ( texture classes) and a dimensional feature space consisting of the coefficients of a multiresolution decomposition over regions of pixels. [sent-268, score-0.282]
98 Three transformations were considered: the discrete cosine transform, principal component analysis, and a three-level wavelet decomposition (see [14] for detailed description of the experimental set up). [sent-269, score-0.116]
99 The classifier was based on Gauss mixtures and marginal diversity was computed with Algorithm 1. [sent-270, score-0.484]
100 Note that the curves of cumulative marginal diversity are qualitatively very similar to those of recognition accuracy. [sent-271, score-0.586]
wordName wordTfidf (topN-words)
[('infomax', 0.554), ('mmd', 0.372), ('diversity', 0.292), ('feature', 0.196), ('marginal', 0.192), ('lhs', 0.185), ('selection', 0.156), ('cpe', 0.142), ('dd', 0.13), ('classi', 0.106), ('principle', 0.104), ('jz', 0.101), ('fano', 0.093), ('rhs', 0.093), ('cd', 0.087), ('pe', 0.082), ('bound', 0.075), ('mdist', 0.07), ('extraction', 0.068), ('features', 0.066), ('uh', 0.065), ('combinatorial', 0.062), ('branch', 0.059), ('wavelet', 0.059), ('mutual', 0.059), ('bayes', 0.058), ('criteria', 0.057), ('transformations', 0.057), ('discriminant', 0.057), ('gh', 0.057), ('cation', 0.056), ('marginals', 0.053), ('tight', 0.051), ('theorem', 0.05), ('subset', 0.05), ('solutions', 0.049), ('px', 0.049), ('visual', 0.048), ('cdd', 0.047), ('cddd', 0.047), ('feffc', 0.047), ('foreach', 0.047), ('fort', 0.047), ('nuno', 0.047), ('recommends', 0.047), ('vasconcelos', 0.047), ('classes', 0.045), ('texture', 0.043), ('optimality', 0.042), ('optimal', 0.041), ('dct', 0.041), ('wf', 0.041), ('brodatz', 0.041), ('sense', 0.04), ('densities', 0.04), ('procedures', 0.04), ('recognition', 0.04), ('context', 0.039), ('search', 0.039), ('class', 0.039), ('transform', 0.038), ('minimum', 0.038), ('entropy', 0.038), ('characterization', 0.037), ('bm', 0.037), ('denver', 0.037), ('cumulative', 0.036), ('predictor', 0.036), ('problems', 0.036), ('standpoint', 0.034), ('clear', 0.033), ('pop', 0.032), ('extrema', 0.032), ('discriminating', 0.032), ('goodness', 0.032), ('intuition', 0.032), ('lower', 0.032), ('colorado', 0.031), ('jain', 0.031), ('visualization', 0.031), ('cost', 0.029), ('maximum', 0.029), ('minimization', 0.029), ('transformation', 0.029), ('curse', 0.028), ('generic', 0.028), ('middle', 0.028), ('blind', 0.027), ('collins', 0.027), ('image', 0.027), ('relationship', 0.027), ('studies', 0.027), ('curves', 0.026), ('plots', 0.026), ('connection', 0.026), ('consequently', 0.026), ('implications', 0.026), ('conjecture', 0.026), ('holds', 0.025), ('separation', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 89 nips-2002-Feature Selection by Maximum Marginal Diversity
Author: Nuno Vasconcelos
Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.
2 0.15657136 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
3 0.13150051 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
4 0.13079387 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.086740419 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
Author: Elad Schneidman, William Bialek, Michael Ii
Abstract: A population of neurons typically exhibits a broad diversity of responses to sensory inputs. The intuitive notion of functional classification is that cells can be clustered so that most of the diversity is captured by the identity of the clusters rather than by individuals within clusters. We show how this intuition can be made precise using information theory, without any need to introduce a metric on the space of stimuli or responses. Applied to the retinal ganglion cells of the salamander, this approach recovers classical results, but also provides clear evidence for subclasses beyond those identified previously. Further, we find that each of the ganglion cells is functionally unique, and that even within the same subclass only a few spikes are needed to reliably distinguish between cells. 1
6 0.086091027 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
7 0.085994579 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
8 0.084886305 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
9 0.083225064 92 nips-2002-FloatBoost Learning for Classification
10 0.081369825 181 nips-2002-Self Supervised Boosting
11 0.081003316 10 nips-2002-A Model for Learning Variance Components of Natural Images
12 0.079688482 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
13 0.079518117 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
14 0.079017311 114 nips-2002-Information Regularization with Partially Labeled Data
15 0.077890255 55 nips-2002-Combining Features for BCI
16 0.075679787 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
17 0.072335601 124 nips-2002-Learning Graphical Models with Mercer Kernels
18 0.070802703 53 nips-2002-Clustering with the Fisher Score
19 0.068924151 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
20 0.064968124 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
topicId topicWeight
[(0, -0.212), (1, -0.083), (2, 0.029), (3, 0.039), (4, 0.111), (5, -0.007), (6, -0.052), (7, -0.075), (8, -0.047), (9, 0.04), (10, 0.021), (11, 0.018), (12, 0.033), (13, 0.077), (14, 0.056), (15, -0.047), (16, 0.002), (17, -0.079), (18, -0.046), (19, 0.061), (20, -0.041), (21, 0.049), (22, 0.014), (23, -0.063), (24, -0.083), (25, -0.017), (26, -0.117), (27, -0.069), (28, 0.151), (29, 0.021), (30, 0.007), (31, -0.059), (32, -0.017), (33, -0.273), (34, -0.037), (35, -0.013), (36, 0.057), (37, 0.029), (38, 0.045), (39, 0.038), (40, 0.127), (41, -0.116), (42, -0.055), (43, -0.084), (44, 0.017), (45, -0.075), (46, -0.076), (47, 0.003), (48, -0.083), (49, 0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.96136576 89 nips-2002-Feature Selection by Maximum Marginal Diversity
Author: Nuno Vasconcelos
Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.
2 0.79463619 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
3 0.60201037 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
4 0.56381273 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.54005581 55 nips-2002-Combining Features for BCI
Author: Guido Dornhege, Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Recently, interest is growing to develop an effective communication interface connecting the human brain to a computer, the ’Brain-Computer Interface’ (BCI). One motivation of BCI research is to provide a new communication channel substituting normal motor output in patients with severe neuromuscular disabilities. In the last decade, various neurophysiological cortical processes, such as slow potential shifts, movement related potentials (MRPs) or event-related desynchronization (ERD) of spontaneous EEG rhythms, were shown to be suitable for BCI, and, consequently, different independent approaches of extracting BCI-relevant EEG-features for single-trial analysis are under investigation. Here, we present and systematically compare several concepts for combining such EEG-features to improve the single-trial classification. Feature combinations are evaluated on movement imagination experiments with 3 subjects where EEG-features are based on either MRPs or ERD, or both. Those combination methods that incorporate the assumption that the single EEG-features are physiologically mutually independent outperform the plain method of ’adding’ evidence where the single-feature vectors are simply concatenated. These results strengthen the hypothesis that MRP and ERD reflect at least partially independent aspects of cortical processes and open a new perspective to boost BCI effectiveness.
6 0.52384812 124 nips-2002-Learning Graphical Models with Mercer Kernels
7 0.51934713 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
8 0.48005942 53 nips-2002-Clustering with the Fisher Score
9 0.47090825 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
10 0.46172017 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
11 0.45781627 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
12 0.45608976 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
13 0.4478251 114 nips-2002-Information Regularization with Partially Labeled Data
14 0.42277354 92 nips-2002-FloatBoost Learning for Classification
15 0.41128674 142 nips-2002-Maximum Likelihood and the Information Bottleneck
16 0.40835929 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
17 0.40168026 178 nips-2002-Robust Novelty Detection with Single-Class MPM
18 0.39163524 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
19 0.38925385 181 nips-2002-Self Supervised Boosting
20 0.38716811 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
topicId topicWeight
[(3, 0.02), (11, 0.012), (23, 0.017), (38, 0.247), (42, 0.056), (54, 0.128), (55, 0.049), (67, 0.022), (68, 0.02), (74, 0.147), (92, 0.057), (98, 0.124)]
simIndex simValue paperId paperTitle
1 0.85699141 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
Author: Jean-philippe Vert, Minoru Kanehisa
Abstract: We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.
same-paper 2 0.84096253 89 nips-2002-Feature Selection by Maximum Marginal Diversity
Author: Nuno Vasconcelos
Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.
3 0.68572211 2 nips-2002-A Bilinear Model for Sparse Coding
Author: David B. Grimes, Rajesh P. Rao
Abstract: Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. As a result, they produce image codes that are redundant because the same feature is learned at multiple locations. We describe an algorithm for sparse coding based on a bilinear generative model of images. By explicitly modeling the interaction between image features and their transformations, the bilinear approach helps reduce redundancy in the image code and provides a basis for transformationinvariant vision. We present results demonstrating bilinear sparse coding of natural images. We also explore an extension of the model that can capture spatial relationships between the independent features of an object, thereby providing a new framework for parts-based object recognition.
4 0.68175554 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik
Abstract: The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, a classifier is trained using human labeled images as ground truth. We present precision-recall curves showing that the resulting detector outperforms existing approaches.
5 0.68166208 10 nips-2002-A Model for Learning Variance Components of Natural Images
Author: Yan Karklin, Michael S. Lewicki
Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.
6 0.68016267 27 nips-2002-An Impossibility Theorem for Clustering
7 0.6787374 74 nips-2002-Dynamic Structure Super-Resolution
8 0.67779535 53 nips-2002-Clustering with the Fisher Score
9 0.67736173 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
10 0.67645836 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
11 0.676117 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.67602098 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
13 0.67438114 135 nips-2002-Learning with Multiple Labels
14 0.67382759 124 nips-2002-Learning Graphical Models with Mercer Kernels
15 0.67352843 39 nips-2002-Bayesian Image Super-Resolution
16 0.67259842 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
17 0.67167711 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
18 0.67156523 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
19 0.67129678 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
20 0.67028487 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond