nips nips2002 nips2002-90 knowledge-graph by maker-knowledge-mining

90 nips-2002-Feature Selection in Mixture-Based Clustering


Source: pdf

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Figueiredo a Instituto de Telecomunicacoes, ¸˜ Instituto Superior T´ cnico e 1049-001 Lisboa Portugal Abstract There exist many approaches to clustering, but the important issue of feature selection, i. [sent-10, score-0.345]

2 , selecting the data attributes that are relevant for clustering, is rarely addressed. [sent-12, score-0.144]

3 Feature selection for clustering is difficult due to the absence of class labels. [sent-13, score-0.453]

4 We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. [sent-14, score-0.494]

5 In the first one, instead of making hard selections, we estimate feature saliencies. [sent-15, score-0.345]

6 The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. [sent-17, score-0.584]

7 Feature selection is then carried out by a backward search scheme. [sent-18, score-0.228]

8 This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. [sent-19, score-0.516]

9 However, not all the features are useful in constructing the partitions: some features may be just noise, thus not contributing to (or even degrading) the clustering process. [sent-22, score-0.688]

10 The task of selecting the “best” feature subset, known as feature selection (FS), is therefore an important task. [sent-23, score-0.885]

11 In addition, FS may lead to more economical clustering algorithms (both in storage and computation) and, in many cases, it may contribute to the interpretability of the models. [sent-24, score-0.236]

12 FS is particularly relevant for data sets with large numbers of features; e. [sent-25, score-0.06]

13 , on the order of thousands as seen in some molecular biology [22] and text clustering applications [21]. [sent-27, score-0.236]

14 In supervised learning, FS has been widely studied, with most methods falling into two classes: filters, which work independently of the subsequent learning algorithm; wrappers, which use the learning algorithm to evaluate feature subsets [12]. [sent-28, score-0.49]

15 In contrast, FS has received little attention in clustering, mainly because, without class labels, it is unclear how to assess feature relevance. [sent-29, score-0.424]

16 The problem is even more difficult when the number of clusters is unknown, since the number of clusters and the best feature subset are inter-related [6]. [sent-30, score-0.544]

17 Some approaches to FS in clustering have been proposed. [sent-31, score-0.236]

18 Dy and Brodley [6] suggested a heuristic to compare feature subsets, using cluster separability. [sent-45, score-0.345]

19 A Bayesian approach for multinomial mixtures was proposed in [21]; another Bayesian approach using a shrinkage prior was considered in [8]. [sent-46, score-0.102]

20 Dash and Liu [4] assess the clustering tendency of each feature by an entropy index. [sent-47, score-0.788]

21 Finally, Devaney and Ram [5] use a notion of “category utility” for FS in conceptual clustering, and Modha and Scott-Spangler [17] assign weights to feature groups with a score similar to Fisher discrimination. [sent-50, score-0.407]

22 In this paper, we introduce two new FS approaches for mixture-based clustering [10, 15]. [sent-51, score-0.236]

23 The first is based on a feature saliency measure which is obtained by an EM algorithm; unlike most FS methods, this does not involve any explicit search. [sent-52, score-0.704]

24 The second approach extends the mutual-information based criterion of [13] to the unsupervised context; it is a wrapper, since FS is wrapped around a basic mixture estimation algorithm. [sent-53, score-0.38]

25 Each is a -dimensional feature and all components have the same form (e. [sent-59, score-0.387]

26 In the sequel, we will use the indices , and to run through data points, mixture components, and features, respectively. [sent-72, score-0.106]

27 Assume now that some features are irrelevant, in the following sense: if feature is irrelevant, then , for , where is the common (i. [sent-73, score-0.571]

28 Let be a set of binary parameters, such that if feature is relevant and otherwise; then, W   (5)   2 „ ¤) „ e '  %$#2 „ ¤ ) „ ¢   „ £ 2 ¥ ¤¥ „ ƒ¥ ) ' 'e #2 „ "! [sent-76, score-0.405]

29 © §¦a' %U qW 2 „ ¤ ) „ e ' ££ 2 „D )ƒ „ e (&  £ ¢ ' Our approach consists of: (i) treating the ’s as missing variables rather than as parameters; (ii) estimating from the data; is the probability that the -th feature is useful, which we call its saliency. [sent-78, score-0.437]

30 The resulting mixture model (see proof in [14]) is (6) © …8„ ‰© ED F 8 2 …)w ' F ¢ The form of reflects our prior knowledge about the distribution of the non-salient features. [sent-79, score-0.106]

31 As in a standard finite mixture, we first select the component label by sampling from a multinomial distribution with parameters . [sent-84, score-0.096]

32 Then, for each feature , we flip a biased coin whose probability of getting a head is ; if we get a head, we use the mixture component to generate the -th feature; otherwise, the common component is used. [sent-85, score-0.639]

33 1 Model Selection Standard EM for mixtures exhibits some weaknesses which also affect the EM algorithm just mentioned: it requires knowledge of , and a good initialization is essential for reaching a good local optimum. [sent-89, score-0.166]

34 To overcome these difficulties, we adopt the approach in [9], which is based on the MML criterion [23, 24]. [sent-90, score-0.074]

35 The MML criterion for the proposed model (see details in [14]) consists of minimizing, with respect to , the following cost function ! [sent-91, score-0.114]

36 From a parameter estimation viewpoint, this is equivalent to a MAP estimate with conjugate (improper) Dirichlet-type priors on the ’s and ’s (see details in [14]); thus, the EM algorithm undergoes a minor modification in the M-step, which still has a closed form. [sent-97, score-0.115]

37 For the -th feature in the -th component, the F 2 D F„  $ „  D ¡ ' %#D " ƒ „  “effective” number of data points for estimating is . [sent-100, score-0.381]

38 Similarly, for the -th feature in the common component, the number of effective data points for estimation is . [sent-102, score-0.416]

39 When goes to zero, the -th feature is no longer salient and and are removed. [sent-107, score-0.345]

40 „ „   „ „  „ „ ¤ D  F Bƒ „ Ea¨aa „ © ƒ Finally, since the model selection algorithm determines the number of components, it can be initialized with a large value of , thus alleviating the need for a good initialization [9]. [sent-109, score-0.308]

41 2 Experiments and Results The first data set considered consists of 800 points from a mixture of 4 equiprobable Gaussians with mean vectors , , , , and identity covariance matrices. [sent-113, score-0.175]

42 Eight “noisy” features (sampled from a density) were appended to this data, yielding a set of 800 10-D patterns. [sent-114, score-0.277]

43 The proposed algorithm was run 10 times, each initialized with ; the common component is initialized to cover all data, and the feature saliencies are initialized at 0. [sent-115, score-0.796]

44 The saliencies of all the ten features, together with their standard deviations (error bars), are shown in Fig. [sent-118, score-0.103]

45 We conclude that, in this case, the algorithm successfully locates the clusters and correctly assigns the feature saliencies. [sent-120, score-0.478]

46 2 Figure 1: Feature saliency for 10-D 4-component 5 10 Feature no 15 20 Figure 2: Feature saliency for the Trunk Gaussian mixture. [sent-135, score-0.718]

47 The smaller the feature number, the more important is the feature. [sent-139, score-0.345]

48 Note that these features have a descending order of relevance. [sent-142, score-0.268]

49 The values of the feature saliencies are shown in Fig. [sent-145, score-0.448]

50 We see the general trend that as the feature number increases, the saliency decreases, following the true characteristics of the data. [sent-147, score-0.746]

51 Feature saliency values were also computed for the “wine” data set (available at the UCI repository at www. [sent-150, score-0.39]

52 After standardizing all features to zero mean and unit variance, we applied the LNKnet supervised feature selection algorithm (available at www. [sent-155, score-0.83]

53 The nine features selected by LNKnet are 7, 13, 1, 5, 10, 2, 12, 6, 9. [sent-159, score-0.226]

54 Our feature saliency algorithm (with no class labels) yielded the values Table 1: Feature saliency of wine data 1 0. [sent-160, score-1.217]

55 Ranking the features in descending order of saliency, we get the ordering: 7, 12, 6, 1, 9, 11, 10, 13, 2, 8, 4, 5, 3. [sent-174, score-0.268]

56 The top 5 features (7, 12, 6, 1, 9) are all in the subset selected by LNKnet. [sent-175, score-0.299]

57 If we skip the sixth feature (11), the following three features (10, 13, 2) were also selected by LNKnet. [sent-176, score-0.571]

58 Thus we can see that for this data set, our algorithm, though totally unsupervised, performs comparably with a supervised feature selection algorithm. [sent-177, score-0.564]

59 4 A Feature Selection Wrapper Our second approach is more traditional in the sense that it selects a feature subset, instead of estimating feature saliency. [sent-178, score-0.69]

60 The number of mixture components is assumed known a priori, though no restriction on the covariance of the Gaussian components is imposed. [sent-179, score-0.223]

61 1 Irrelevant Features and Conditional Independence Assume that the class labels, , and the full feature vector, , follow some joint probability . [sent-181, score-0.383]

62 Recall that the (see (3)) are posterior class probabilities, Prob class . [sent-184, score-0.076]

63 If is a completely irrelevant equals exactly, because of the conditional independence in (9), feature subset, then applied to (3). [sent-186, score-0.427]

64 In practice, such features rarely exist, though they do exhibit different degrees of irrelevance. [sent-187, score-0.264]

65 As both and are probabilities, a natural criterion for assessing their closeness is the expected value of the Kullback-Leibler divergence (KLD, [3]). [sent-189, score-0.074]

66 This criterion is computed as a sample mean 2 '  gf7    2 ' D hf7  D gf7 t  (11)  g7 f 2 ' © 8 ED © 87 @2 2 ' D gf7  £ ' $ D gf7 t 5#" D hf7 t B A  A   t    §  in our case. [sent-190, score-0.074]

67 A low value of indicates that the features in are “almost” conditionally independent from the expected class labels, given the features in . [sent-191, score-0.566]

68 At each stage, we find the feature such that is smallest and add it to . [sent-193, score-0.345]

69 EM is then run again, using the features not in , to update the posterior probabilities . [sent-194, score-0.226]

70 The process is then repeated until only one feature remains, in what can be considered as a backward search algorithm that yields a sorting of the features by decreasing order of irrelevance. [sent-195, score-0.723]

71 2 The assignment entropy Given a method to sort the features in the order of relevance, we now require a method to measure how good each subset is. [sent-197, score-0.528]

72 Unlike in supervised learning, we can not resort to classification accuracy. [sent-198, score-0.07]

73 We adopt the criterion that a clustering is good if the clusters are “crisp”, i. [sent-199, score-0.404]

74 A natural way to formalize this is to consider the mean entropy of the ; that is, the clustering is considered to be good if is small. [sent-202, score-0.466]

75 In the sequel, we call “the entropy of the assignment”. [sent-203, score-0.166]

76 An important characteristic of the entropy is that it cannot increase when more features are used (because, for any random variables , , and , , a fundamental inequality of information theory [3]; note that is a conditional entropy ). [sent-204, score-0.558]

77 Moreover, exhibits a diminishing returns behavior (decreasing abruptly as the most relevant features are included, but changing little when less relevant features are used). [sent-205, score-0.633]

78 Of course, during the backward search, one can also consider picking the next feature whose removal least increases , rather than the one yielding the smallest KLD; both options are explored in the experiments. [sent-207, score-0.482]

79 Finally, we mention that other minimum-entropy-type criteria have been recently used for clustering [7], [18], but not for feature selection. [sent-208, score-0.581]

80 3 Experiments We have conducted experiments on data sets commonly used for supervised learning tasks. [sent-210, score-0.07]

81 Since we are doing unsupervised learning, the class labels are, of course, withheld and only used for evaluation. [sent-211, score-0.234]

82 The two heuristics for selecting the next feature to be removed (based on minimum KLD and minimum entropy) are considered in different runs. [sent-212, score-0.539]

83 To assess clustering quality, we assign each data point to the Gaussian component that most likely generated it and then compare this labelling with the ground-truth. [sent-213, score-0.367]

84 3 reveal that the general trend of the error rate agrees well with . [sent-216, score-0.088]

85 The error rates either have a minimum close to the “knee” of the H curve, or the curve becomes flat. [sent-217, score-0.086]

86 The two heuristics for selecting the feature to be removed perform comparably. [sent-218, score-0.426]

87 For the cover type data set, the DKL heuristic yields lower error rates than the one based on , while the contrary happens for image segmentation and WBC datasets. [sent-219, score-0.169]

88 ¢ ¢ 5 Concluding Remarks and Future Work The two approaches for unsupervised feature selection herein proposed have different advantages and drawbacks. [sent-220, score-0.629]

89 The first approach avoids explicit feature search and does not require a pre-specified number of clusters; however, it assumes that the features are conditionally independent, given the components. [sent-221, score-0.68]

90 Several issues require further work: weakly relevant features (in the sense of [12]) are not removed by the first algorithm while the second approach relies on a good initial clustering. [sent-224, score-0.392]

91 of classes cover type 2000 10 4 image segmentation 1000 18 7 4000 55 3500 50 3000 45 1500 40 1000 % Erorr 2000 65 60 55 2500 2000 50 1500 35 45 1000 500 30 500 0 25 0 70 1200 60 1000 55 2 4 6 No. [sent-231, score-0.123]

92 of features 15 30 5 16 250 14 200 12 Entropy 300 150 100 10 50 8 0 5 10 15 20 No. [sent-234, score-0.226]

93 of features (e) 6 30 25 (f) 35 70 35 70 30 60 30 25 50 25 40 20 30 15 10 20 10 5 10 0 0 50 20 40 15 % Error Entropy 60 Entropy 80 % Error Entropy 25 15 (d) % Error (c) 10 No. [sent-236, score-0.226]

94 of features % Error 0 30 20 10 0 2 4 6 8 No. [sent-237, score-0.226]

95 of features 0 12 (h) Figure 3: (a) and (b): cover type; (c) and (d): image segmentation; (e) and (f): WBC; (g) and (h): wine. [sent-239, score-0.3]

96 Feature removal by minimum KLD (left column) and minimum (right column). [sent-240, score-0.12]

97 Feature subset selection and order identification for unsupervised learning. [sent-282, score-0.357]

98 Bayesian feature weighting for unsupervised learning, with application to object recognition. [sent-297, score-0.48]

99 Generalized model selection for unsupervised learning in high dimensions. [sent-388, score-0.284]

100 MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. [sent-413, score-0.236]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('saliency', 0.359), ('feature', 0.345), ('fs', 0.331), ('clustering', 0.236), ('features', 0.226), ('em', 0.167), ('entropy', 0.166), ('selection', 0.149), ('kld', 0.138), ('wbc', 0.138), ('unsupervised', 0.135), ('mixture', 0.106), ('pami', 0.105), ('mml', 0.103), ('saliencies', 0.103), ('figueiredo', 0.09), ('wrapper', 0.09), ('irrelevant', 0.082), ('wine', 0.076), ('conditionally', 0.076), ('criterion', 0.074), ('cover', 0.074), ('subset', 0.073), ('gi', 0.073), ('supervised', 0.07), ('devaney', 0.069), ('lnknet', 0.069), ('michigan', 0.069), ('modha', 0.069), ('jain', 0.068), ('hi', 0.066), ('clusters', 0.063), ('icml', 0.063), ('labels', 0.061), ('component', 0.06), ('trunk', 0.06), ('dash', 0.06), ('instituto', 0.06), ('wallace', 0.06), ('wrappers', 0.06), ('relevant', 0.06), ('initialized', 0.058), ('aa', 0.054), ('missing', 0.052), ('yielding', 0.051), ('segmentation', 0.049), ('backward', 0.046), ('error', 0.046), ('selecting', 0.046), ('symbolic', 0.046), ('dy', 0.046), ('bars', 0.045), ('pdf', 0.044), ('descending', 0.042), ('trend', 0.042), ('sequel', 0.042), ('components', 0.042), ('gaussians', 0.041), ('assess', 0.041), ('removal', 0.04), ('treating', 0.04), ('algorithm', 0.04), ('details', 0.04), ('minimum', 0.04), ('koller', 0.039), ('gaussian', 0.039), ('class', 0.038), ('course', 0.038), ('rarely', 0.038), ('head', 0.038), ('law', 0.037), ('points', 0.036), ('multinomial', 0.036), ('subsets', 0.035), ('estimation', 0.035), ('removed', 0.035), ('posteriori', 0.034), ('mixtures', 0.033), ('search', 0.033), ('ml', 0.033), ('restriction', 0.033), ('considered', 0.033), ('assignment', 0.032), ('conceptual', 0.032), ('covariances', 0.032), ('repository', 0.031), ('good', 0.031), ('exhibits', 0.031), ('uci', 0.031), ('extends', 0.03), ('assign', 0.03), ('absence', 0.03), ('carbonetto', 0.03), ('locates', 0.03), ('coin', 0.03), ('wraps', 0.03), ('alleviating', 0.03), ('genomic', 0.03), ('abruptly', 0.03), ('mises', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 90 nips-2002-Feature Selection in Mixture-Based Clustering

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

2 0.15868884 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

3 0.15657136 89 nips-2002-Feature Selection by Maximum Marginal Diversity

Author: Nuno Vasconcelos

Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.

4 0.14910606 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

5 0.14559725 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

6 0.13609096 181 nips-2002-Self Supervised Boosting

7 0.13447809 188 nips-2002-Stability-Based Model Selection

8 0.12904529 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

9 0.12527156 83 nips-2002-Extracting Relevant Structures with Side Information

10 0.11802655 135 nips-2002-Learning with Multiple Labels

11 0.10576769 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

12 0.10535555 124 nips-2002-Learning Graphical Models with Mercer Kernels

13 0.10262294 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

14 0.10206522 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

15 0.095531434 142 nips-2002-Maximum Likelihood and the Information Bottleneck

16 0.094803818 55 nips-2002-Combining Features for BCI

17 0.087737866 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

18 0.087442607 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

19 0.084655315 98 nips-2002-Going Metric: Denoising Pairwise Data

20 0.081118658 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.255), (1, -0.101), (2, 0.031), (3, 0.056), (4, -0.031), (5, 0.072), (6, -0.112), (7, -0.176), (8, -0.107), (9, 0.102), (10, 0.064), (11, 0.028), (12, -0.048), (13, 0.169), (14, -0.062), (15, -0.072), (16, -0.038), (17, -0.149), (18, -0.001), (19, 0.169), (20, -0.012), (21, -0.011), (22, 0.04), (23, 0.031), (24, 0.005), (25, -0.073), (26, -0.033), (27, -0.147), (28, 0.138), (29, 0.029), (30, -0.017), (31, -0.04), (32, -0.003), (33, -0.239), (34, -0.071), (35, 0.013), (36, 0.063), (37, 0.023), (38, -0.045), (39, -0.002), (40, 0.083), (41, -0.049), (42, -0.095), (43, 0.019), (44, -0.107), (45, -0.062), (46, -0.07), (47, -0.035), (48, 0.015), (49, 0.155)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97886336 90 nips-2002-Feature Selection in Mixture-Based Clustering

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

2 0.78078616 89 nips-2002-Feature Selection by Maximum Marginal Diversity

Author: Nuno Vasconcelos

Abstract: We address the question of feature selection in the context of visual recognition. It is shown that, besides efficient from a computational standpoint, the infomax principle is nearly optimal in the minimum Bayes error sense. The concept of marginal diversity is introduced, leading to a generic principle for feature selection (the principle of maximum marginal diversity) of extreme computational simplicity. The relationships between infomax and the maximization of marginal diversity are identified, uncovering the existence of a family of classification procedures for which near optimal (in the Bayes error sense) feature selection does not require combinatorial search. Examination of this family in light of recent studies on the statistics of natural images suggests that visual recognition problems are a subset of it.

3 0.67678022 188 nips-2002-Stability-Based Model Selection

Author: Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann

Abstract: Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area.

4 0.64758009 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

5 0.6211524 142 nips-2002-Maximum Likelihood and the Information Bottleneck

Author: Noam Slonim, Yair Weiss

Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other.   ©§ ¥£ ¨¦¤¢   

6 0.60088348 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

7 0.52896971 83 nips-2002-Extracting Relevant Structures with Side Information

8 0.50803894 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

9 0.47286332 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

10 0.46611232 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

11 0.46287125 124 nips-2002-Learning Graphical Models with Mercer Kernels

12 0.45166814 27 nips-2002-An Impossibility Theorem for Clustering

13 0.43929982 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

14 0.43584061 135 nips-2002-Learning with Multiple Labels

15 0.43570304 181 nips-2002-Self Supervised Boosting

16 0.43259379 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.42765275 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

18 0.42719311 55 nips-2002-Combining Features for BCI

19 0.42397353 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

20 0.39285153 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.01), (23, 0.014), (42, 0.041), (54, 0.085), (55, 0.02), (67, 0.013), (68, 0.017), (74, 0.615), (92, 0.022), (98, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98862475 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning

Author: Stella X. Yu, Ralph Gross, Jianbo Shi

Abstract: Segmentation and recognition have long been treated as two separate processes. We propose a mechanism based on spectral graph partitioning that readily combine the two processes into one. A part-based recognition system detects object patches, supplies their partial segmentations as well as knowledge about the spatial configurations of the object. The goal of patch grouping is to find a set of patches that conform best to the object configuration, while the goal of pixel grouping is to find a set of pixels that have the best low-level feature similarity. Through pixel-patch interactions and between-patch competition encoded in the solution space, these two processes are realized in one joint optimization problem. The globally optimal partition is obtained by solving a constrained eigenvalue problem. We demonstrate that the resulting object segmentation eliminates false positives for the part detection, while overcoming occlusion and weak contours for the low-level edge detection.

2 0.96926701 114 nips-2002-Information Regularization with Partially Labeled Data

Author: Martin Szummer, Tommi S. Jaakkola

Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.

3 0.96681219 78 nips-2002-Efficient Learning Equilibrium

Author: Ronen I. Brafman, Moshe Tennenholtz

Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1

4 0.9656955 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity

Author: Terry Elliott, Jörg Kramer

Abstract: A neurotrophic model for the co-development of topography and ocular dominance columns in the primary visual cortex has recently been proposed. In the present work, we test this model by driving it with the output of a pair of neuronal vision sensors stimulated by disparate moving patterns. We show that the temporal correlations in the spike trains generated by the two sensors elicit the development of refined topography and ocular dominance columns, even in the presence of significant amounts of spontaneous activity and fixed-pattern noise in the sensors.

same-paper 5 0.9593401 90 nips-2002-Feature Selection in Mixture-Based Clustering

Author: Martin H. Law, Anil K. Jain, Mário Figueiredo

Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.

6 0.90872896 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text

7 0.76291108 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

8 0.72980034 39 nips-2002-Bayesian Image Super-Resolution

9 0.70078641 89 nips-2002-Feature Selection by Maximum Marginal Diversity

10 0.69761866 152 nips-2002-Nash Propagation for Loopy Graphical Games

11 0.69527614 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

12 0.69197947 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

13 0.69120705 182 nips-2002-Shape Recipes: Scene Representations that Refer to the Image

14 0.68651569 74 nips-2002-Dynamic Structure Super-Resolution

15 0.67746133 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

16 0.66236877 173 nips-2002-Recovering Intrinsic Images from a Single Image

17 0.64947718 135 nips-2002-Learning with Multiple Labels

18 0.6434809 2 nips-2002-A Bilinear Model for Sparse Coding

19 0.63709509 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

20 0.63591862 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm