nips nips2003 nips2003-59 knowledge-graph by maker-knowledge-mining

59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion


Source: pdf

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. [sent-4, score-0.274]

2 This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). [sent-5, score-0.281]

3 We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . [sent-6, score-0.366]

4 Such a dependence is a major drawback of LDA especially when the sample size is small. [sent-7, score-0.066]

5 The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. [sent-8, score-0.195]

6 Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. [sent-9, score-0.187]

7 1 Introduction In statistical pattern recognition, the high-dimensionality is a major cause of the practical limitations of many pattern recognition technologies. [sent-10, score-0.086]

8 In the past several decades, many dimensionality reduction techniques have been proposed. [sent-11, score-0.065]

9 Linear discriminant analysis (LDA, also called Fisher’s Linear Discriminant) [1] is one of the most popular linear dimensionality reduction method. [sent-12, score-0.152]

10 To maximize (1), the transformation matrix W must be constituted by the largest eigenvectors of S−1 Sb . [sent-15, score-0.16]

11 The purpose of LDA is to maximize the between-class scatter w while simultaneously minimizing the within-class scatter. [sent-16, score-0.238]

12 In the two-class case, the transformation matrix W is just a vector, which is in the same direction as the discriminant in the corresponding optimal Bayes classifier. [sent-18, score-0.127]

13 A major drawback of LDA is that it cannot be applied when S w is singular due to the small sample size problem [3]. [sent-20, score-0.117]

14 The small sample size problem arises whenever the number of samples is smaller than the dimensionality of samples. [sent-21, score-0.108]

15 For example, a 64 × 64 image in a face recognition system has 4096 dimensions, which requires more than 4096 training data to ensure that Sw is nonsingular. [sent-22, score-0.072]

16 [4] used the pseudo-inverse matrix S + instead w of the inverse matrix S−1 . [sent-26, score-0.089]

17 For the same purpose, Hong and Yang [5] tried to add a singular w value perturbation to Sw to make it nonsingular. [sent-27, score-0.067]

18 Neither of these methods are theoretically sound because Fisher’s criterion is not valid when Sw is singular. [sent-28, score-0.105]

19 When Sw is singular, any positive Sb makes Fisher’s criterion infinitely large. [sent-29, score-0.105]

20 Besides, it is also known that an eigenvector could be very sensitive to small perturbation if its corresponding eigenvalue is close to another eigenvalue of the same matrix [6]. [sent-31, score-0.095]

21 [7] modified Fisher’s criterion by using the total scatter matrix S t = Sb + Sw as the denominator instead of Sw . [sent-33, score-0.318]

22 It has been proven that the modified criterion is exactly equivalent to Fisher’s criterion. [sent-34, score-0.119]

23 However, when Sw is singular, the modified criterion reaches the maximum value (i. [sent-35, score-0.125]

24 Such an arbitrary transformation cannot guarantee the maximum class separability unless WT Sb W is maximized. [sent-38, score-0.119]

25 When S w is of full rank, the LDA+PCA method just calculates the maximum eigenvectors of S −1 Sb to t form the transformation matrix. [sent-42, score-0.078]

26 First, the data are transformed into the null space V0 of Sw . [sent-44, score-0.062]

27 Second, it tries to maximize the betweenclass scatter in V0 , which is accomplished by performing principal component analysis (PCA) on the between-class scatter matrix in V0 . [sent-45, score-0.46]

28 Although this method solves the small sample size problem, it is obviously suboptimal because it maximizes the between-class scatter in the null space of Sw instead of the original input space. [sent-46, score-0.305]

29 Besides, the performance of the LDA+PCA method drops significantly when n − c is close to the dimensionality D, where n is the number of samples and c is the number of classes. [sent-47, score-0.098]

30 The reason is that the dimensionality of the null space V0 is too small in this situation, and too much information is lost when we try to extract the discriminant vectors in V0 . [sent-48, score-0.2]

31 The instability problem is more severe for KFD because Sw in the (nonlinear) feature space F is always singular (the rank of Sw is n − c). [sent-52, score-0.129]

32 Of course, it has the same stability problem as that in [5] because eigenvectors are sensitive to small perturbation. [sent-54, score-0.054]

33 In this paper, a simpler, more efficient, and stable method is proposed to calculate the most discriminant vectors based on a new feature extraction criterion, the maximum margin criterion (MMC). [sent-59, score-0.378]

34 Based on MMC, new linear and nonlinear feature extractors are established. [sent-60, score-0.189]

35 It can be shown that MMC represents class separability better than PCA. [sent-61, score-0.069]

36 On the other hand, the new feature extractors derived above (based on MMC) do not suffer from the small sample size problem, which is known to cause serious stability problems for LDA (based on Fisher’s criterion). [sent-63, score-0.183]

37 Different from LDA+PCA, the new feature extractors based on MMC maximize the between-class scatter in the input space instead of the null space of Sw . [sent-64, score-0.423]

38 If some distance metric is used to measure the dissimilarity, we would hope that a pattern is close to those in the same class but far from those in different classes. [sent-83, score-0.099]

39 So, a good feature extractor should maximize the distances between classes after the transformation. [sent-84, score-0.22]

40 Therefore, we may define the feature extraction criterion as J= 1 2 c c pi pj d(Ci , Cj ) (2) i=1 j=1 1 We call (2) the maximum margin criterion (MMC). [sent-85, score-0.596]

41 Like the weighted pairwise Fisher’s criteria in [2], one may also define a weighted maximum margin criterion. [sent-87, score-0.07]

42 d(Ci , Cj ) = d(mi , mj ) (3) where mi and mj are the mean vectors of the class Ci and the class Cj , respectively. [sent-91, score-0.604]

43 However, (3) is not suitable since it neglects the scatter of classes. [sent-92, score-0.18]

44 Even if the distance between the mean vectors is large, it is not easy to separate two classes that have the large spread and overlap with each other. [sent-93, score-0.071]

45 By considering the scatter of classes, we define the interclass distance (or margin) as d(Ci , Cj ) = d(mi , mj ) − s(Ci ) − s(Cj ) (4) where s(Ci ) is some measure of the scatter of the class Ci . [sent-94, score-0.568]

46 In statistics, we usually use the generalized variance |Si | or overall variance tr(Si ) to measure the scatter of data. [sent-95, score-0.232]

47 In this paper, we use the overall variance tr(Si ) because it is easy to analyze. [sent-96, score-0.052]

48 Note that, by employing the overall/generalized variance, the expression (4) measures the “average margin” between two classes while the minimum margin is used in support vector machines (SVMs) [10]. [sent-98, score-0.085]

49 On the other hand, a small tr(S w ) implies that every class has a small spread. [sent-101, score-0.068]

50 Thus, a large J indicates that patterns are close to each other if they are from the same class but are far from each other if they are from different classes. [sent-102, score-0.062]

51 Thus, this criterion may represent class separability better than PCA. [sent-103, score-0.174]

52 Recall that PCA tries to maximize the total scatter after a linear transformation. [sent-104, score-0.27]

53 But the data set with a large within-class scatter can also have a large total scatter even when it has a small between-class scatter because St = Sb + Sw . [sent-105, score-0.553]

54 Compared with LDA+PCA, we maximize the between-class scatter in input space rather than the null space of Sw when Sw is singular. [sent-107, score-0.301]

55 3 Linear Feature Extraction When performing dimensionality reduction, we want to find a (linear or nonlinear) mapping from the measurement space M to some feature space F such that J is maximized after the transformation. [sent-109, score-0.16]

56 In this section, we discuss how to find an optimal linear feature extractor. [sent-110, score-0.084]

57 We would like to maximize J(W) = tr(SW −SW ) w b where SW and SW are the between-class scatter matrix and within-class scatter matrix in w b the feature space F. [sent-113, score-0.546]

58 So, we have w J(W) = tr WT (Sb −Sw )W (8) In this formulation, we have the freedom to multiply W with some nonzero constant. [sent-115, score-0.154]

59 This means that we need solve the following constrained optimization d T wk (Sb −Sw )wk max k=1 T subject to wk wk − 1 = 0 k = 1, . [sent-122, score-1.182]

60 For example, we may require tr WT Sw W = 1 and then maximize tr WT Sb W . [sent-126, score-0.353]

61 The motivation for using the constraint wk wk = 1 is that it allows us to avoid calculating the inverse of Sw and thus the potential small sample size problem. [sent-129, score-0.859]

62 To solve the above optimization problem, we may introduce a Lagrangian d T T wk (Sb −Sw )wk − λk (wk wk − 1) L(wk , λk ) = (9) k=1 with multipliers λk . [sent-130, score-0.788]

63 The Lagrangian L has to be maximized with respect to λk and wk . [sent-131, score-0.422]

64 The condition that at the stationary point, the derivatives of L with respect to wk must vanish ∂L(wk , λk ) = ((Sb −Sw ) − λk I)wk = 0 k = 1, . [sent-132, score-0.394]

65 , d (10) ∂wk leads to (Sb −Sw )wk =λk wk k = 1, . [sent-135, score-0.394]

66 , d (11) which means that the λk ’s are the eigenvalues of Sb −Sw and the wk ’s are the corresponding eigenvectors. [sent-138, score-0.394]

67 Thus d d T wk (Sb −Sw )wk = J(W) = k=1 d T λ k wk wk = k=1 λk (12) k=1 Therefore, J(W) is maximized when W is composed of the first d largest eigenvectors of Sb − Sw . [sent-139, score-1.238]

68 Here, we need not calculate the inverse of Sw , which allows us to avoid the small sample size problem easily. [sent-140, score-0.093]

69 4 Nonlinear Feature Extraction with Kernel In this section, we follow the approach of nonlinear SVMs [10] to kernelize the above linear feature extractor. [sent-142, score-0.128]

70 More precisely, we first reformulate the maximum margin criterion in terms of only dot-product Φ(x), Φ(y) of input patterns. [sent-143, score-0.175]

71 Consider the maximum margin criterion in the feature space F d Φ T wk (Sb −SΦ )wk w J Φ (W) = k=1 where SΦ and SΦ are the between-class scatter matrix and within-class scatter maw b c c Φ T Φ Φ Φ Φ Φ trix in F, i. [sent-147, score-1.037]

72 , SΦ = b i=1 pi Si and i=1 pi (mi − m )(mi − m ) , Sw = ni (i) (i) ni (i) 1 1 Φ Φ Φ T Φ Si = ni j=1 (Φ(xj ) − mi )(Φ(xj ) − mi ) with mi = ni j=1 Φ(xj ), mΦ = c i=1 (i) pi mΦ , and xj is the pattern of class Ci that has ni samples. [sent-149, score-2.556]

73 i For us, an important fact is that each wk lies in the span of Φ(x1 ), Φ(x2 ), . [sent-150, score-0.394]

74 (k) n Therefore, we can find an expansion for wk in the form wk = l=1 αl Φ(xl ). [sent-154, score-0.788]

75 Using this Φ expansion and the definition of mi , we have   ni n 1 (k) (i) T wk m Φ = αl  Φ(xl ), Φ(xj )  i ni j=1 l=1 Replacing the dot-product by some kernel function k(x, y) and defining (mi )l = (k) (i) ni 1 Φ T T j=1 k(xl , xj ), we get wk mi = αk mi with (αk )l = αl . [sent-155, score-2.363]

76 Similarly, we have ni c c T T wk m Φ = w k pi m Φ = α T i k i=1 c i=1 with m = pi m i = α T m k i=1 T pi mi . [sent-156, score-1.337]

77 and i k d d c T wk S Φ wk = b T T pi (wk (mΦ − mΦ ))(wk (mΦ − mΦ ))T i i k=1 i=1 k=1 d c d pT αT (mi − m)(mi − m)T αk = i k = k=1 i=1 c i=1 where Sb = αT Sb αk k k=1 T pi (mi − m)(mi − m) . [sent-158, score-1.08]

78 (i) (i) 1 ni (i) ni T j=1 (wk (Φ(xj ) T Similarly, one can simplify WT SΦ W. [sent-159, score-0.516]

79 First, we have wk (Φ(xj ) − mΦ ) = αT (kj − w i k (i) (i) T mi ) with (kj )l = k(xl , xj ). [sent-160, score-0.691]

80 Thus, we obtain d d c T wk S Φ wk = w k=1 i=1 k=1 c d αT k = k=1 where Sw = space F is pi c i=1 pi i=1 T 1 1 T α Si (Ini − 1ni 1Ti )Si αk n ni k ni T 1 1 Si (Ini − 1ni 1Ti )Si n ni ni d αT Sw αk k αk = k=1 T 1 1 pi ni Si (Ini − ni 1ni 1Ti )Si . [sent-162, score-2.734]

81 So the maximum criterion in the feature n d αT (Sb − Sw )αk k J(W) = (13) k=1 Similar to the observations in Section 3, the above criterion is maximized by the largest eigenvectors of Sb − Sw . [sent-163, score-0.347]

82 Figure 1: Experimental results obtained using a linear SVM on the original data (RAW), and the data extracted by LDA+PCA, the linear feature extractor based on MMC (MMC) and the nonlinear feature extractor based on MMC (KMMC), which employs the Gaussian kernel with γ = 0. [sent-173, score-0.415]

83 5 Experiments To evaluate the performance of our new methods (both linear and nonlinear feature extractors), we ran both LDA+PCA and our methods on the ORL face dataset [11]. [sent-175, score-0.157]

84 First, we resized the images to 28 × 23 to save the experimental time. [sent-178, score-0.082]

85 As a control, we also trained and tested a linear SVM on the original data before its dimensionality was reduced. [sent-181, score-0.066]

86 Note that our methods can even achieve lower error rates than a linear SVM on the original data (without dimensionality reduction). [sent-188, score-0.066]

87 1(a) also shows that the kernelized (nonlinear) feature extractor based on MMC is significantly better than the linear one, in particular when the number of classes c is large. [sent-191, score-0.217]

88 1(b) shows that our linear feature extractor is about 4 times faster than LDA+PCA. [sent-194, score-0.176]

89 Note that our nonlinear feature extractor is also faster than LDA+PCA in this case although it is very time-consuming to calculate the kernel matrix in general. [sent-196, score-0.271]

90 An explanation of the speedup is that the kernel matrix size equals the number of samples, which is pretty small in this case. [sent-197, score-0.098]

91 Furthermore, our method performs much better than LDA+PCA when n − c is close to the dimensionality D. [sent-198, score-0.063]

92 Because the amount of training data was limited, we resized the images to 168 dimensions to create such a situation. [sent-199, score-0.068]

93 In this situation, the performance of LDA+PCA drops significantly because the null space of Sw has a small dimensionality. [sent-202, score-0.093]

94 When LDA+PCA tries to maximize the between-class scatter in this small null space, it loses a lot of information. [sent-203, score-0.308]

95 On the other hand, our method tries to maximize the between-class scatter in the original input space. [sent-204, score-0.247]

96 35 40 (a) Each class contains three training samples. [sent-223, score-0.06]

97 35 40 (b) Each class contains four training samples. [sent-225, score-0.06]

98 6 Conclusion In this paper, we proposed both linear and nonlinear feature extractors based on the maximum margin criterion. [sent-230, score-0.259]

99 Optimal discriminant plane for a small number of samples and design method of classifier on the plane. [sent-265, score-0.094]

100 A new LDA-based face recognition system which can solve the small sample size problem. [sent-284, score-0.102]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sw', 0.462), ('wk', 0.394), ('lda', 0.336), ('sb', 0.276), ('mi', 0.256), ('ni', 0.249), ('mmc', 0.249), ('scatter', 0.18), ('tr', 0.154), ('pi', 0.146), ('pca', 0.135), ('mj', 0.123), ('si', 0.115), ('criterion', 0.105), ('extractor', 0.092), ('fisher', 0.081), ('pj', 0.071), ('ini', 0.069), ('kmmc', 0.069), ('discriminant', 0.064), ('extractors', 0.061), ('feature', 0.061), ('wt', 0.053), ('ci', 0.053), ('margin', 0.05), ('null', 0.048), ('xl', 0.048), ('maximize', 0.045), ('nonlinear', 0.044), ('dimensionality', 0.043), ('class', 0.042), ('cj', 0.042), ('kfd', 0.042), ('xj', 0.041), ('extraction', 0.038), ('singular', 0.038), ('kj', 0.037), ('rd', 0.033), ('matrix', 0.033), ('ej', 0.031), ('besides', 0.031), ('transformation', 0.03), ('perturbation', 0.029), ('face', 0.029), ('maximized', 0.028), ('interclass', 0.028), ('orl', 0.028), ('orleans', 0.028), ('resized', 0.028), ('tian', 0.028), ('eigenvectors', 0.028), ('separability', 0.027), ('recognition', 0.025), ('constituted', 0.024), ('inverse', 0.023), ('linear', 0.023), ('tries', 0.022), ('reduction', 0.022), ('pattern', 0.022), ('classes', 0.022), ('calculate', 0.022), ('images', 0.022), ('sj', 0.021), ('sample', 0.021), ('overall', 0.02), ('maximum', 0.02), ('close', 0.02), ('speedup', 0.019), ('kernelized', 0.019), ('kernel', 0.019), ('save', 0.018), ('chen', 0.018), ('hong', 0.018), ('drops', 0.018), ('cc', 0.018), ('lagrangian', 0.018), ('vectors', 0.018), ('training', 0.018), ('simplify', 0.018), ('svm', 0.017), ('samples', 0.017), ('major', 0.017), ('variance', 0.016), ('rank', 0.016), ('liu', 0.016), ('easy', 0.016), ('suboptimal', 0.015), ('distance', 0.015), ('experimental', 0.014), ('proven', 0.014), ('drawback', 0.014), ('space', 0.014), ('situation', 0.014), ('size', 0.014), ('ef', 0.013), ('small', 0.013), ('stability', 0.013), ('employing', 0.013), ('raw', 0.013), ('purpose', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

2 0.12531817 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

Author: David Barber, Felix V. Agakov

Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1

3 0.10328972 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1

4 0.095288724 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

Author: David Kauchak, Sanjoy Dasgupta

Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1

5 0.085145742 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

Author: Xavier Carreras, Lluís Màrquez

Abstract: This work presents an architecture based on perceptrons to recognize phrase structures, and an online learning algorithm to train the perceptrons together and dependently. The recognition strategy applies learning in two layers: a filtering layer, which reduces the search space by identifying plausible phrase candidates, and a ranking layer, which recursively builds the optimal phrase structure. We provide a recognition-based feedback rule which reflects to each local function its committed errors from a global point of view, and allows to train them together online as perceptrons. Experimentation on a syntactic parsing problem, the recognition of clause hierarchies, improves state-of-the-art results and evinces the advantages of our global training method over optimizing each function locally and independently. 1

6 0.083062828 193 nips-2003-Variational Linear Response

7 0.082648203 73 nips-2003-Feature Selection in Clustering Problems

8 0.080385156 117 nips-2003-Linear Response for Approximate Inference

9 0.072746642 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

10 0.068640836 100 nips-2003-Laplace Propagation

11 0.068368837 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

12 0.061754692 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

13 0.058935851 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

14 0.058890618 66 nips-2003-Extreme Components Analysis

15 0.056576464 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

16 0.053391445 120 nips-2003-Locality Preserving Projections

17 0.053306181 128 nips-2003-Minimax Embeddings

18 0.05278904 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

19 0.051514916 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

20 0.051363092 151 nips-2003-PAC-Bayesian Generic Chaining


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.153), (1, -0.076), (2, -0.011), (3, -0.005), (4, 0.04), (5, 0.066), (6, 0.05), (7, -0.037), (8, -0.021), (9, -0.084), (10, -0.03), (11, -0.177), (12, -0.104), (13, -0.066), (14, 0.099), (15, 0.023), (16, 0.014), (17, 0.026), (18, -0.008), (19, -0.025), (20, 0.146), (21, -0.163), (22, 0.038), (23, 0.053), (24, 0.031), (25, 0.184), (26, -0.127), (27, -0.168), (28, -0.02), (29, -0.1), (30, 0.032), (31, -0.024), (32, 0.037), (33, 0.04), (34, -0.143), (35, -0.079), (36, 0.017), (37, 0.051), (38, -0.014), (39, -0.081), (40, -0.166), (41, -0.119), (42, 0.11), (43, 0.115), (44, -0.012), (45, -0.057), (46, -0.022), (47, 0.016), (48, -0.035), (49, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9694382 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

2 0.57072282 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1

3 0.47019607 66 nips-2003-Extreme Components Analysis

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

4 0.45023391 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

Author: Xavier Carreras, Lluís Màrquez

Abstract: This work presents an architecture based on perceptrons to recognize phrase structures, and an online learning algorithm to train the perceptrons together and dependently. The recognition strategy applies learning in two layers: a filtering layer, which reduces the search space by identifying plausible phrase candidates, and a ranking layer, which recursively builds the optimal phrase structure. We provide a recognition-based feedback rule which reflects to each local function its committed errors from a global point of view, and allows to train them together online as perceptrons. Experimentation on a syntactic parsing problem, the recognition of clause hierarchies, improves state-of-the-art results and evinces the advantages of our global training method over optimizing each function locally and independently. 1

5 0.43689343 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

6 0.4225103 193 nips-2003-Variational Linear Response

7 0.39786366 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

8 0.39714116 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

9 0.33883312 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

10 0.3372671 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

11 0.32407531 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

12 0.30971068 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

13 0.28881654 112 nips-2003-Learning to Find Pre-Images

14 0.28857082 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering

15 0.28455138 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

16 0.27384031 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

17 0.26739052 73 nips-2003-Feature Selection in Clustering Problems

18 0.26463795 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

19 0.26457182 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

20 0.2601018 120 nips-2003-Locality Preserving Projections


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (11, 0.02), (29, 0.012), (35, 0.023), (53, 0.576), (71, 0.07), (76, 0.031), (85, 0.058), (91, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99403512 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

Abstract: Existing source location and recovery algorithms used in magnetoencephalographic imaging generally assume that the source activity at different brain locations is independent or that the correlation structure is known. However, electrophysiological recordings of local field potentials show strong correlations in aggregate activity over significant distances. Indeed, it seems very likely that stimulus-evoked activity would follow strongly correlated time-courses in different brain areas. Here, we present, and validate through simulations, a new approach to source reconstruction in which the correlation between sources is modelled and estimated explicitly by variational Bayesian methods, facilitating accurate recovery of source locations and the time-courses of their activation. 1

2 0.99203753 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

same-paper 3 0.98681283 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

4 0.98611838 45 nips-2003-Circuit Optimization Predicts Dynamic Networks for Chemosensory Orientation in Nematode C. elegans

Author: Nathan A. Dunn, John S. Conery, Shawn R. Lockery

Abstract: The connectivity of the nervous system of the nematode Caenorhabditis elegans has been described completely, but the analysis of the neuronal basis of behavior in this system is just beginning. Here, we used an optimization algorithm to search for patterns of connectivity sufficient to compute the sensorimotor transformation underlying C. elegans chemotaxis, a simple form of spatial orientation behavior in which turning probability is modulated by the rate of change of chemical concentration. Optimization produced differentiator networks with inhibitory feedback among all neurons. Further analysis showed that feedback regulates the latency between sensory input and behavior. Common patterns of connectivity between the model and biological networks suggest new functions for previously identified connections in the C. elegans nervous system. 1

5 0.98350632 92 nips-2003-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal definition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoff - the information curve - in terms of the eigenvalue spectrum. 1

6 0.97378099 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

7 0.87089258 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

8 0.84324515 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

9 0.84305251 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

10 0.82009494 66 nips-2003-Extreme Components Analysis

11 0.8110559 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

12 0.81043798 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

13 0.79186875 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

14 0.78833979 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

15 0.78550965 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

16 0.78389233 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

17 0.77887928 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

18 0.77444625 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

19 0.77343541 120 nips-2003-Locality Preserving Projections

20 0.77222919 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data