nips nips2005 nips2005-104 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu 1 Abstract In supervised learning scenarios, feature selection has been studied widely in the literature. [sent-4, score-0.196]
2 Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. [sent-5, score-0.229]
3 And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. [sent-6, score-0.351]
4 In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. [sent-7, score-0.16]
5 Our method can be performed in either supervised or unsupervised fashion. [sent-8, score-0.114]
6 The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. [sent-9, score-0.082]
7 The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. [sent-10, score-0.186]
8 We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. [sent-11, score-0.532]
9 The wrapper model techniques evaluate the features using the learning algorithm that will ultimately be employed. [sent-14, score-0.183]
10 Most of the feature selection methods are wrapper methods. [sent-16, score-0.222]
11 Algorithms based on the filter model examine intrinsic properties of the data to evaluate the features prior to the learning tasks. [sent-17, score-0.138]
12 The filter based approaches almost always rely on the class labels, most commonly assessing correlations between features and the class label. [sent-18, score-0.193]
13 Data variance might be the simplest unsupervised evaluation of the features. [sent-22, score-0.182]
14 The variance along a dimension reflects its representative power. [sent-23, score-0.12]
15 Data variance can be used as a criteria for feature selection and extraction. [sent-24, score-0.28]
16 For example, Principal Component Analysis (PCA) is a classical feature extraction method which finds a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data. [sent-25, score-0.249]
17 Although the data variance criteria finds features that are useful for representing data, there is no reason to assume that these features must be useful for discriminating between data in different classes. [sent-26, score-0.36]
18 Fisher score seeks features that are efficient for discrimination. [sent-27, score-0.5]
19 It assigns the highest score to the feature on which the data points of different classes are far from each other while requiring data points of the same class to be close to each other. [sent-28, score-0.643]
20 Fisher criterion can be also used for feature extraction, such as Linear Discriminant Analysis (LDA). [sent-29, score-0.111]
21 In this paper, we introduce a novel feature selection algorithm called Laplacian Score (LS). [sent-30, score-0.16]
22 For each feature, its Laplacian score is computed to reflect its locality preserving power. [sent-31, score-0.483]
23 LS seeks those features that respect this graph structure. [sent-35, score-0.174]
24 The basic idea of LS is to evaluate the features according to their locality preserving power. [sent-37, score-0.226]
25 Let fri denote the i-th sample of the r-th feature, i = 1, · · · , m. [sent-40, score-0.23]
26 Construct a nearest neighbor graph G with m nodes. [sent-42, score-0.116]
27 We put an edge between nodes i and j if xi and xj are ”close”, i. [sent-44, score-0.084]
28 xi is among k nearest neighbors of xj or xj is among k nearest neighbors of xi . [sent-46, score-0.198]
29 When the label information is available, one can put an edge between two nodes sharing the same label. [sent-47, score-0.089]
30 The weight matrix S of the graph models the local structure of the data space. [sent-51, score-0.114]
31 For the r-th feature, we define: fr = [fr1 , fr2 , · · · , frm ]T , D = diag(S1), 1 = [1, · · · , 1]T , L = D − S where the matrix L is often called graph Laplacian [2]. [sent-53, score-0.625]
32 Compute the Laplacian Score of the r-th feature as follows: fr = fr − T Lr = fr Lfr T (1) fr D fr 3 3. [sent-55, score-2.861]
33 1 Justification Objective Function Recall that given a data set we construct a weighted graph G with edges connecting nearby points to each other. [sent-56, score-0.106]
34 Thus, the importance of a feature can be thought of as the degree it respects the graph structure. [sent-58, score-0.179]
35 To be specific, a ”good” feature should the one on which two data points are close to each other if and only if there is an edge between these two points. [sent-59, score-0.149]
36 A reasonable criterion for choosing a good feature is to minimize the following object function: − frj )2 Sij ij (fri Lr = (2) V ar(fr ) where V ar(fr ) is the estimated variance of the r-th feature. [sent-60, score-0.388]
37 By minimizing ij (fri − frj )2 Sij , we prefer those features respecting the pre-defined graph structure. [sent-61, score-0.333]
38 For a good feature, the bigger Sij , the smaller (fri − frj ), and thus the Laplacian Score tends to be small. [sent-62, score-0.124]
39 Following some simple algebraic steps, we see that 2 2 2 fri + frj − 2fri frj Sij (fri − frj ) Sij = ij = ij fri Sij frj = 2fT Dfr − 2fT Sfr = 2fT Lfr r r r 2 fri Sij − 2 2 ij ij By maximizing V ar(fr ), we prefer those features with large variance which have more representative power. [sent-63, score-1.562]
40 Recall that the variance of a random variable a can be written as follows: V ar(a) = (a − µ)2 dP (a), µ = adP (a) M M where M is the data manifold, µ is the expected value of a and dP is the probability measure. [sent-64, score-0.137]
41 By spectral graph theory [2], dP can be estimated by the diagonal matrix D on the sample points. [sent-65, score-0.092]
42 It would be important to note that, if we do not remove the mean, the vector fr can be a nonzero constant vector such as 1. [sent-69, score-0.55]
43 Unfortunately, this feature is clearly of no use since it contains no information. [sent-72, score-0.111]
44 With mean being removed, the new vector fr is orthogonal to 1 with respect to D, i. [sent-73, score-0.55]
45 Therefore, fr can not be any constant vector other than 0. [sent-76, score-0.55]
46 Thus, the Laplacian Score Lr becomes a trivial solution and the r-th feature is excluded from selection. [sent-78, score-0.111]
47 We can also simply replace it by the identity matrix I, in which case the weighted variance becomes the standard variance. [sent-80, score-0.177]
48 To be specific, fT 1 fT I1 fr = fr − r 1 = fr − r 1 = fr − µ1 n 1T I1 where µ is the mean of fri , i = 1, · · · , n. [sent-81, score-2.43]
49 Thus, T V ar(fr ) = fr I fr = 1 T (fr − µ1) (fr − µ1) n (3) which is just the standard variance. [sent-82, score-1.1]
50 In fact, the Laplacian scores can be thought of as the Rayleigh quotients for the features with respect to the graph G, please see [2] for details. [sent-83, score-0.181]
51 Let ni denote the i=1 2 number of data points in class i. [sent-87, score-0.316]
52 Let µi and σi be the mean and variance of class i, i = 1, · · · , c, corresponding to the r-th feature. [sent-88, score-0.165]
53 Let µ and σ 2 denote the mean and variance of the whole data set. [sent-89, score-0.137]
54 The Fisher score is defined below: Fr = c i=1 ni (µi − µ)2 c 2 i=1 ni σi (4) In the following, we show that Fisher score is equivalent to Laplacian score with a special graph structure. [sent-90, score-1.652]
55 (5) Without loss of generality, we assume that the data points are ordered according to which class they are in, so that {x1 , · · · , xn1 } are in the first class, {xn1 +1 , · · · , xn1 +n2 } are in the second class, etc. [sent-92, score-0.083]
56 0 0 0 Sc 1 where Si = ni 11T is an ni × ni matrix. [sent-96, score-0.699]
57 T Observation 1 With the weight matrix S defined in (5), we have fr Lfr = fT Lfr = r 2 i ni σi , where L = D − S. [sent-100, score-0.828]
58 To see this, define Li = Di − Si = Ii − Si , where Ii is the ni × ni identity matrix. [sent-101, score-0.484]
59 We have c fT Lfr = r c (fi )T Li fi = r r i=1 (fi )T (Ii − r i=1 1 T i 11 )fr = ni c c ni cov(fi , fi ) = r r i=1 2 ni σi i=1 Note that, since uT L1 = 1T Lu = 0, ∀u ∈ Rn , the value of fT Lfr remains unchanged by r T subtracting a constant vector (= α1) from fr . [sent-102, score-1.419]
60 This shows that fr Lfr = fT Lfr = r i 2 ni σi . [sent-103, score-0.783]
61 T Observation 2 With the weight matrix S defined in (5), we have fr Dfr = nσ 2 . [sent-104, score-0.595]
62 Observation 3 With the weight matrix S defined in (5), we have T fr D fr − T fr Lfr . [sent-107, score-1.695]
63 We therefore get the following relationship between the Laplacian score and Fisher score: Theorem 1 Let Fr denote the Fisher score of the r-th feature. [sent-109, score-0.756]
64 Proof From observations 1,2,3, we see that Fr = c i=1 T T r Thus, Lr = 4 T T ni (µi − µ)2 f Df − f Lf 1 = r rT Tr r = −1 c 2 Lr i=1 ni σi f Lf r 1 1+Fr . [sent-111, score-0.466]
65 Therefore, we compared our algorithm with data variance which can be performed in unsupervised fashion. [sent-114, score-0.215]
66 sepal length, sepal width, petal length, and petal width. [sent-120, score-0.212]
67 One class is linearly separable from the other two, but the other two are not linearly separable from each other. [sent-121, score-0.081]
68 Out of the four features it is known that the features F3 (petal length) and F4 (petal width) are more important for the underlying clusters. [sent-122, score-0.229]
69 The classification error rates for the four features are 0. [sent-130, score-0.126]
70 Variance, Fisher score and Laplacian Score for feature selection. [sent-139, score-0.489]
71 However, Fisher score is supervised, while the other two are unsupervised. [sent-141, score-0.378]
72 By using variance, the four features are sorted as F3, F1, F4, F2. [sent-143, score-0.144]
73 Laplacian score (with k ≥ 15) sorts these four features as F3, F4, F1, F2. [sent-144, score-0.53]
74 Laplacian score (with 3 ≤ k < 15) sorts these four features as F4, F3, F1, F2. [sent-145, score-0.53]
75 Therefore, the feature F3 is ranked above F4 since the variance of F3 is greater than that of F4. [sent-147, score-0.231]
76 By using Fisher score, the four features are sorted as F3, F4, F1, F2. [sent-148, score-0.144]
77 This indicates that Laplacian score (unsupervised) achieved the same result as Fisher score (supervised). [sent-149, score-0.779]
78 2 Face Clustering on PIE In this section, we apply our feature selection algorithm to face clustering. [sent-151, score-0.21]
79 By using Laplacian score, we select a subset of features which are the most useful for discrimination. [sent-152, score-0.103]
80 For each given number k, k classes were randomly selected from the face database. [sent-166, score-0.104]
81 feature selection using variance and Laplacian score are used to select the features. [sent-170, score-0.658]
82 The K-means was then performed in the selected feature subspace. [sent-171, score-0.148]
83 2 Evaluation Metrics The clustering result is evaluated by comparing the obtained label of each data point with that provided by the data corpus. [sent-175, score-0.144]
84 Two metrics, the accuracy (AC) and the normalized mutual information metric (M I) are used to measure the clustering performance [6]. [sent-176, score-0.155]
85 Given a data point xi , let ri and si be the obtained cluster label and the label provided by the data corpus, respectively. [sent-177, score-0.214]
86 4 1000 0 200 400 600 800 1000 0 200 Number of features Number of features 400 600 800 1000 0. [sent-193, score-0.206]
87 35 1000 200 Number of features 400 600 800 1000 Number of features 0. [sent-207, score-0.206]
88 55 0 200 Number of features (c) 30 classes 400 600 800 1000 Number of features (d) 68 classes Figure 2: Clustering performance versus number of features maps each cluster label ri to the equivalent label from the data corpus. [sent-237, score-0.504]
89 3 Results We compared Laplacian score with data variance for clustering. [sent-246, score-0.515]
90 Note that, we did not compare with Fisher score because it is supervised and the label information is not available in the clustering experiments. [sent-247, score-0.524]
91 As can be seen, in all these cases, our algorithm performs much better than using variance for feature selection. [sent-251, score-0.231]
92 This indicates that feature selection is capable of enhancing clustering performance. [sent-254, score-0.251]
93 In Figure 3, we show the selected features in the image domain for each test (k=5, 10, 30, 68), using our algorithm, data variance and Fisher score. [sent-255, score-0.261]
94 As can be seen, Laplacian score provides better approximation to Fisher score than data variance. [sent-258, score-0.773]
95 Both Laplacian score (a) Variance (b) Laplacian Score (c) Fisher Score Figure 3: Selected features in the image domain, k = 5, 10, 30, 68. [sent-259, score-0.481]
96 662 and Fisher score have the brightest pixels in the area of two eyes, nose, mouth, and face contour. [sent-373, score-0.451]
97 This indicates that even though our algorithm is unsupervised, it can discover the most discriminative features to some extent. [sent-374, score-0.126]
98 5 Conclusions In this paper, we propose a new filter method for feature selection which is independent to any learning tasks. [sent-375, score-0.16]
99 It can be performed in either supervised or unsupervised fashion. [sent-376, score-0.114]
100 Experiments on Iris data set and PIE face data set demonstrate the effectiveness of our algorithm. [sent-378, score-0.106]
wordName wordTfidf (topN-words)
[('fr', 0.55), ('score', 0.378), ('laplacian', 0.311), ('lfr', 0.248), ('ni', 0.233), ('fri', 0.23), ('fisher', 0.174), ('ft', 0.146), ('sij', 0.129), ('frj', 0.124), ('variance', 0.12), ('lr', 0.112), ('feature', 0.111), ('features', 0.103), ('iris', 0.092), ('dfr', 0.088), ('ls', 0.088), ('fi', 0.085), ('ar', 0.075), ('dii', 0.074), ('petal', 0.071), ('clustering', 0.068), ('cj', 0.068), ('ci', 0.065), ('lter', 0.064), ('mutual', 0.062), ('unsupervised', 0.062), ('wrapper', 0.062), ('locality', 0.059), ('graph', 0.052), ('si', 0.05), ('face', 0.05), ('selection', 0.049), ('preserving', 0.046), ('class', 0.045), ('label', 0.042), ('pie', 0.042), ('nearest', 0.042), ('clusters', 0.04), ('ac', 0.039), ('supervised', 0.036), ('sepal', 0.035), ('sfr', 0.035), ('niyogi', 0.034), ('classes', 0.033), ('ij', 0.033), ('xiaofei', 0.031), ('dp', 0.03), ('put', 0.03), ('lf', 0.028), ('ri', 0.028), ('please', 0.026), ('sorts', 0.026), ('accuracy', 0.025), ('cropped', 0.025), ('matrix', 0.023), ('four', 0.023), ('pixels', 0.023), ('indicates', 0.023), ('eigenmaps', 0.022), ('effectiveness', 0.022), ('neighbor', 0.022), ('weight', 0.022), ('brightness', 0.022), ('selected', 0.021), ('eyes', 0.021), ('prefer', 0.021), ('points', 0.021), ('neighbors', 0.02), ('observation', 0.02), ('scenarios', 0.019), ('xj', 0.019), ('follows', 0.019), ('uci', 0.019), ('seeks', 0.019), ('evaluate', 0.018), ('sorted', 0.018), ('metrics', 0.018), ('separable', 0.018), ('images', 0.018), ('extraction', 0.018), ('classi', 0.018), ('xi', 0.018), ('identity', 0.018), ('corpus', 0.017), ('spectral', 0.017), ('data', 0.017), ('visualization', 0.017), ('belongs', 0.017), ('tests', 0.017), ('nodes', 0.017), ('check', 0.016), ('importance', 0.016), ('performed', 0.016), ('weighted', 0.016), ('li', 0.015), ('mouth', 0.015), ('sc', 0.015), ('holland', 0.015), ('rayleigh', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
2 0.11394371 166 nips-2005-Robust Fisher Discriminant Analysis
Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd
Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1
3 0.10601389 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
Author: Sridhar Mahadevan, Mauro Maggioni
Abstract: We investigate the problem of automatically constructing efficient representations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particular, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigenfunctions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying representation and policies are simultaneously learned.
4 0.094244108 189 nips-2005-Tensor Subspace Analysis
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: Previous work has demonstrated that the image variations of many objects (human faces in particular) under variable lighting can be effectively modeled by low dimensional linear spaces. The typical linear subspace learning algorithms include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locality Preserving Projection (LPP). All of these methods consider an n1 × n2 image as a high dimensional vector in Rn1 ×n2 , while an image represented in the plane is intrinsically a matrix. In this paper, we propose a new algorithm called Tensor Subspace Analysis (TSA). TSA considers an image as the second order tensor in Rn1 ⊗ Rn2 , where Rn1 and Rn2 are two vector spaces. The relationship between the column vectors of the image matrix and that between the row vectors can be naturally characterized by TSA. TSA detects the intrinsic local geometrical structure of the tensor space by learning a lower dimensional tensor subspace. We compare our proposed approach with PCA, LDA and LPP methods on two standard databases. Experimental results demonstrate that TSA achieves better recognition rate, while being much more efficient. 1
5 0.08593031 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
6 0.084299944 33 nips-2005-Bayesian Sets
7 0.079763524 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
8 0.076263383 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
9 0.073191196 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
10 0.072547741 79 nips-2005-Fusion of Similarity Data in Clustering
11 0.07046359 102 nips-2005-Kernelized Infomax Clustering
12 0.06822861 117 nips-2005-Learning from Data of Variable Quality
13 0.068184875 110 nips-2005-Learning Depth from Single Monocular Images
14 0.066462085 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
15 0.066204384 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
16 0.060297694 126 nips-2005-Metric Learning by Collapsing Classes
17 0.060224432 45 nips-2005-Conditional Visual Tracking in Kernel Space
18 0.058278956 178 nips-2005-Soft Clustering on Graphs
19 0.055682819 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
20 0.055380844 75 nips-2005-Fixing two weaknesses of the Spectral Method
topicId topicWeight
[(0, 0.163), (1, 0.079), (2, -0.05), (3, 0.013), (4, -0.152), (5, 0.019), (6, 0.091), (7, 0.007), (8, -0.002), (9, -0.024), (10, 0.033), (11, -0.046), (12, 0.041), (13, -0.057), (14, -0.018), (15, -0.07), (16, -0.066), (17, -0.042), (18, -0.12), (19, -0.082), (20, 0.003), (21, 0.07), (22, 0.042), (23, 0.04), (24, 0.11), (25, -0.066), (26, 0.145), (27, 0.077), (28, -0.023), (29, -0.087), (30, -0.058), (31, 0.027), (32, -0.002), (33, 0.095), (34, -0.193), (35, 0.034), (36, -0.066), (37, -0.174), (38, 0.037), (39, 0.062), (40, 0.1), (41, 0.142), (42, -0.175), (43, -0.077), (44, 0.014), (45, -0.0), (46, 0.229), (47, 0.075), (48, 0.13), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95986992 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
2 0.52979696 166 nips-2005-Robust Fisher Discriminant Analysis
Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd
Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1
3 0.51582712 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
4 0.49222773 189 nips-2005-Tensor Subspace Analysis
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: Previous work has demonstrated that the image variations of many objects (human faces in particular) under variable lighting can be effectively modeled by low dimensional linear spaces. The typical linear subspace learning algorithms include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locality Preserving Projection (LPP). All of these methods consider an n1 × n2 image as a high dimensional vector in Rn1 ×n2 , while an image represented in the plane is intrinsically a matrix. In this paper, we propose a new algorithm called Tensor Subspace Analysis (TSA). TSA considers an image as the second order tensor in Rn1 ⊗ Rn2 , where Rn1 and Rn2 are two vector spaces. The relationship between the column vectors of the image matrix and that between the row vectors can be naturally characterized by TSA. TSA detects the intrinsic local geometrical structure of the tensor space by learning a lower dimensional tensor subspace. We compare our proposed approach with PCA, LDA and LPP methods on two standard databases. Experimental results demonstrate that TSA achieves better recognition rate, while being much more efficient. 1
5 0.41394731 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
6 0.36784032 59 nips-2005-Dual-Tree Fast Gauss Transforms
7 0.35731211 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
8 0.3469272 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
9 0.34683612 121 nips-2005-Location-based activity recognition
10 0.34677294 117 nips-2005-Learning from Data of Variable Quality
11 0.32903382 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
12 0.32075348 126 nips-2005-Metric Learning by Collapsing Classes
13 0.31780368 79 nips-2005-Fusion of Similarity Data in Clustering
14 0.30075127 102 nips-2005-Kernelized Infomax Clustering
15 0.29093385 71 nips-2005-Fast Krylov Methods for N-Body Learning
16 0.28781936 158 nips-2005-Products of ``Edge-perts
17 0.28618696 45 nips-2005-Conditional Visual Tracking in Kernel Space
18 0.26398516 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
19 0.2601243 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
20 0.2547507 151 nips-2005-Pattern Recognition from One Example by Chopping
topicId topicWeight
[(3, 0.061), (10, 0.023), (27, 0.017), (31, 0.02), (34, 0.076), (39, 0.011), (41, 0.017), (50, 0.016), (55, 0.023), (69, 0.049), (73, 0.509), (88, 0.05), (91, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.9251492 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
2 0.90777832 71 nips-2005-Fast Krylov Methods for N-Body Learning
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
3 0.86473471 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
4 0.81379414 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.8083365 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
6 0.6026758 102 nips-2005-Kernelized Infomax Clustering
7 0.56600112 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
8 0.54672158 189 nips-2005-Tensor Subspace Analysis
9 0.52427214 84 nips-2005-Generalization in Clustering with Unobserved Features
10 0.52033657 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
11 0.48900557 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
12 0.48531669 178 nips-2005-Soft Clustering on Graphs
13 0.48389366 77 nips-2005-From Lasso regression to Feature vector machine
14 0.47574672 98 nips-2005-Infinite latent feature models and the Indian buffet process
15 0.47426707 177 nips-2005-Size Regularized Cut for Data Clustering
16 0.46106023 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.44663605 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
18 0.43927166 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
19 0.4345364 133 nips-2005-Nested sampling for Potts models
20 0.43404841 79 nips-2005-Fusion of Similarity Data in Clustering