nips nips2006 nips2006-83 knowledge-graph by maker-knowledge-mining

83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning


Source: pdf

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. [sent-4, score-1.082]

2 It extends the theory of support vector machine to unsupervised learning. [sent-5, score-0.137]

3 Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. [sent-6, score-1.161]

4 First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. [sent-7, score-1.265]

5 Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. [sent-8, score-1.362]

6 Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. [sent-9, score-0.631]

7 In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. [sent-10, score-0.558]

8 The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. [sent-11, score-1.858]

9 Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. [sent-13, score-0.356]

10 Finally, we show a formal connection between maximum margin clustering and spectral clustering. [sent-14, score-1.324]

11 We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. [sent-15, score-1.503]

12 A large number of algorithms have been developed for data clustering, including the k-means algorithm [3], mixture models [4], and spectral clustering [5, 6, 7, 8, 9]. [sent-17, score-0.766]

13 More recently, maximum margin clustering [1, 2] was proposed for data clustering and has shown promising performance. [sent-18, score-1.799]

14 The key idea of maximum margin clustering is to extend the theory of support vector machine to unsupervised learning. [sent-19, score-1.32]

15 However, despite its success, the following three major problems with maximum margin clustering has prevented it from being applied to real-world applications: • High computational cost. [sent-20, score-1.182]

16 The number of parameters in maximum margin clustering is quadratic in the number of examples. [sent-21, score-1.161]

17 Figure 1 shows the computational time (in seconds) of the maximum margin clustering algorithm with respect to different numbers of examples. [sent-23, score-1.182]

18 2 20 30 40 50 60 70 80 90 100 Kernel Width (% of data range) in RBF function (b) Clustering error versus kernel width Figure 2: Clustering error of spectral clustering using the RBF kernel with different kernel width. [sent-39, score-1.665]

19 , the difference between the maximum and the minimum distance) that is used for kernel width. [sent-42, score-0.412]

20 clearly see that the computational time increases dramatically when we apply the maximum margin clustering algorithm to even modest numbers of examples. [sent-43, score-1.182]

21 • Requiring clustering boundaries to pass through the origins. [sent-44, score-0.766]

22 One important assumption made by the maximum margin clustering in [1] is that the clustering boundaries will pass through the origins. [sent-45, score-1.927]

23 To this end, maximum margin clustering requires centralizing data points around the origins before clustering data. [sent-46, score-1.922]

24 It is important to note that centralizing data points at the origins does not guarantee clustering boundaries to go through origins, particularly when cluster sizes are unbalanced with one cluster significantly more popular than the other. [sent-47, score-0.905]

25 Figure 2(b) shows the clustering error of maximum margin clustering for the synthesized data of two overlapped Gaussians clusters (Figure 2(a)) using the RBF kernel with different kernel width. [sent-49, score-2.373]

26 We see that the performance of maximum margin clustering depends critically on the choice of kernel width. [sent-50, score-1.455]

27 The same problem is also observed in spectral clustering [10]. [sent-51, score-0.796]

28 Although a number of studies [8, 9, 10, 6] are devote to automatically identifying appropriate kernel matrices in clustering, they are either heuristic approaches or require additional labeled data. [sent-52, score-0.386]

29 In this paper, we propose “generalized maximum margin clustering” framework that resolves the above three problems simultaneously. [sent-53, score-0.558]

30 In particular, the proposed framework reformulates the problem of maximum margin clustering to include the bias term in the classification boundary, and therefore remove the assumption that clustering boundaries have to pass through the origins. [sent-54, score-1.98]

31 Furthermore, the new formulism reduces the number of parameters to be linear in the number of examples, and therefore significantly reduces the computational cost. [sent-55, score-0.206]

32 Finally, it is equipped with the capability of unsupervised kernel learning, and therefore, is able to determine the appropriate kernel matrix and clustering memberships simultaneously. [sent-56, score-1.378]

33 More interestingly, we will show that spectral clustering, such as the normalized cut algorithm, can be viewed as a special case of the generalized maximum margin clustering. [sent-57, score-0.95]

34 The remainder of the paper is organized as follows: Section 2 reviews the work of maximum margin clustering and kernel learning. [sent-58, score-1.427]

35 Section 3 presents the framework of generalized maximum margin clustering. [sent-59, score-0.657]

36 2 Related Work The key idea of maximum margin clustering is to extend the theory of support vector machine to unsupervised learning. [sent-62, score-1.32]

37 0 ≤ α ≤ C, α y = 0 (1) where K ∈ Rn×n is the kernel matrix and diag(y) stands for the diagonal matrix that uses the vector y as its diagonal elements. [sent-71, score-0.438]

38 To apply the above formulism to unsupervised learning, the maximum margin clustering approach relaxes class labels y to continuous variables, and searches for both y and α that maximizes the classification margin. [sent-72, score-1.552]

39 The first one relaxes yy into a positive semi-definitive (PSD) matrix M 0 whose diagonal elements are set to be 1. [sent-77, score-0.199]

40 The second relaxation sets λ = 0, which is equivalent to assuming that there is no bias term b in the expression of classification boundaries, or in other words, classification boundaries have to pass through the origins of data. [sent-78, score-0.304]

41 t M ◦K (e + ν − δ) ν ≥ 0, δ ≥ 0, M e+ν−δ t − 2Cδ e 0 0 (2) Finally, a few additional constraints of M are added to the above optimization problem to prevent skewed clustering sizes [1]. [sent-81, score-0.656]

42 Furthermore, by setting λ = 0, the maximum margin clustering algorithm requires clustering boundaries to pass through the origins of data, which is unsuitable for clustering data with unbalanced clusters. [sent-83, score-2.722]

43 Another important problem with the above maximum margin clustering is the difficulty in determining the appropriate kernel similarity matrix K. [sent-84, score-1.567]

44 Although many kernel based clustering algorithms set the kernel parameters manually, there are several studies devoted to automatic selection of kernel functions, in particular the kernel width for the RBF kernel, x −x 2 i. [sent-85, score-1.8]

45 [8] recommended choosing the kernel width as 10% to 20% of the range of the distance between samples. [sent-89, score-0.367]

46 [9] chose kernel width which provides the least distorted clusters by running the same clustering algorithm several times for each kernel width. [sent-92, score-1.236]

47 Although this approach seems to generate good results, it requires running seperate experiments for each kernel width, and therefore could be computationally intensive. [sent-93, score-0.266]

48 in [10] proposed a self-tuning spectral clustering algorithm that computes a different local kernel width for each data point xi . [sent-95, score-1.133]

49 In particular, the local kernel width for each xi is computed as the distance of xi to its kth nearest neighbor. [sent-96, score-0.367]

50 Although empirical study seems to show the effectiveness of this approach, it is unclear how to find the optimal k in computing the local kernel width. [sent-97, score-0.299]

51 As we will see in the experiment section, the clustering accuracy depends heavily on the choice of k. [sent-98, score-0.659]

52 Finally, we will briefly overview the existing work on kernel learning. [sent-99, score-0.266]

53 The representative approaches in this category include the kernel alignment [11, 12], semi-definitive programming [13], and spectral graph partitioning [6]. [sent-101, score-0.516]

54 Unlike these approaches, the proposed framework is designed for unsupervised kernel learning. [sent-102, score-0.403]

55 3 Generalized Maximum Margin Clustering and Unsupervised Kernel Learning We will first present the proposed clustering algorithm for hard margin, followed by the extension to soft margin and unsupervised kernel learning. [sent-103, score-1.483]

56 1 Hard Margin In the case of hard margin, the dual problem of SVM is almost identical to the problem in Eqn. [sent-105, score-0.147]

57 Following [13], we further convert the problem in (1) into its dual form: 1 (e + ν + λy)T diag(y)K −1 diag(y)(e + ν + λy) min ν ,y,λ 2 s. [sent-107, score-0.15]

58 Thus, the optimal solution found by the dual problem in (7) is not necessarily the optimal solution for the prime problem in (5). [sent-155, score-0.314]

59 Our hope is that although the solution found by the dual problem is not optimal for the prime problem, it is still a good solution for the prime problem in (5). [sent-156, score-0.34]

60 This is similar to the SDP relaxation made by the maximum margin clustering algorithm in (2) that relaxes a non-convex programming problem into a convex one. [sent-157, score-1.327]

61 However, unlike the relaxation made in (2) that increases the number of variables from n to n2 , the new formulism of maximum margin does not increase the number of parameters (i. [sent-158, score-0.779]

62 This is shown in Figure 1, in which the computational time of generalized maximum margin clustering is increased much slower than that of the maximum margin algorithm. [sent-161, score-1.839]

63 Remark II To avoid the high computational cost in estimating K −1 , we replace K −1 with its normalized graph Laplacian L(K) [15], which is defined as L(K) = I −D1/2 KD1/2 where n D is a diagonal matrix whose diagonal elements are computed as Di,i = j=1 Ki,j , i = ˜ 1, 2, . [sent-162, score-0.233]

64 This is equivalent to defining a kernel matrix K = L(K)† where † stands for the operator of pseudo inverse. [sent-166, score-0.333]

65 More interesting, we have the following theorem showing the relationship between generalized maximum margin clustering and the normalized cut. [sent-167, score-1.312]

66 The normalized cut algorithm is a special case of the generalized maximum margin clustering in (7) if the following conditions hold, i. [sent-169, score-1.39]

67 2 Soft Margin We extend the formulism in (7) to the case of soft margin by considering the following problem: n 1 2 (e + ν − δ + λy)T diag(y)K −1 diag(y)(e + ν − δ + λy) + Cδ min δi ν ,y,λ,δ 2 i=1 s. [sent-181, score-0.689]

68 ν ≥ 0, δ ≥ 0, y ∈ {+1, −1}n (8) where Cδ weights the importance of the clustering errors against the clustering margin. [sent-183, score-1.206]

69 , n (9) 2 2 By approximating (zi + δi )2 as zi + δi , we have the dual form of the above problem written as: n max γ∈Rn γi i=1 n s. [sent-189, score-0.19]

70 , n (10) The main difference between the above formulism and the formulism in (7) is the introduction of the upper bound Cδ for γ in the case of soft margin. [sent-194, score-0.413]

71 3 Unsupervised Kernel Learning As already pointed out, the performance of many clustering algorithms depend on the right choice of the kernel similarity matrix. [sent-197, score-0.94]

72 To address this problem, we extend the formulism in (10) by including the kernel learning mechanism. [sent-198, score-0.473]

73 In particular, we assume that a set of m kernel similarity matrices K1 , K2 , . [sent-199, score-0.334]

74 Our goal is to identify the linear m combination of kernel matrices, i. [sent-203, score-0.266]

75 , K = i=1 βi Ki , that leads to the optimal clustering accuracy. [sent-205, score-0.636]

76 Each Laplacian Li is constructed from the kernel similarity matrix Ki . [sent-220, score-0.336]

77 6 (c) Two Connected Circles Figure 3: Data distribution of the three synthesized datasets By solving the above problem, we are able to resolve both γ (corresponding to clustering memberships) and β (corresponding to kernel learning) simultaneously. [sent-261, score-1.008]

78 4 Experiment We tested the generalized maximum margin clustering algorithm on both synthetic datasets and real datasets from the UCI repository. [sent-262, score-1.503]

79 These four datasets comprise of 218, 180, 351, and 285 examples, respectively, and each example in these four datasets is represented by 17, 64, 35, and 32 features. [sent-265, score-0.252]

80 Since the “Digits” dataset consists of multiple classes, we further decompose it into four datasets of binary classes that include pairs of digits difficult to distinguish. [sent-266, score-0.205]

81 Both the normalized cut algorithm [8] and the maximum margin clustering algorithm [1] are used as the baseline. [sent-267, score-1.291]

82 The RBF kernel is used throughout this study to construct the kernel similarity matrices. [sent-268, score-0.575]

83 In our first experiment, we examine the optimal performance of each clustering algorithm by using the optimal kernel width that is acquired through an exhaustive search. [sent-269, score-1.036]

84 The optimal clustering errors of these three algorithms are summarized in the first three columns of Table 1. [sent-270, score-0.636]

85 It is clear that generalized maximum margin clustering algorithm achieve similar or better performance than both maximum margin clustering and normlized cut for most datasets when they are given the optimal kernel matrices. [sent-271, score-2.902]

86 Note that the results of maximum margin clustering are reported for a subset of samples(including 80 instances) in UCI datasets due to the out of memory problem. [sent-272, score-1.265]

87 Table 1: Clustering error (%) of normalized cut (NC), maximum margin clustering (MMC), generalized maximum margin clustering (GMMC) and self-tuning spectral clustering (ST). [sent-273, score-3.317]

88 5 Breast In the second experiment, we evaluate the effectiveness of unsupervised kernel learning. [sent-300, score-0.403]

89 Ten kernel matrices are created by using the RBF kernel with the kernel width varied from 10% to 100% of the range of distance between any two examples. [sent-301, score-0.924]

90 We compare the proposed unsupervised kernel learning to the self-tuning spectral clustering algorithm in [10]. [sent-302, score-1.169]

91 One of the problem with the self-tuning spectral clustering algorithm is that its clustering error usually depends on the parameter k, i. [sent-303, score-1.399]

92 , the number of nearest neighbor used for computing the kernel width. [sent-305, score-0.266]

93 To provide a full picture of the self-tuning spectral clustering, we vary k from 1 and 15 , and calculate both best and worst performance using different k. [sent-306, score-0.215]

94 The last three columns of Table 1 summarizes the clustering errors of generalized maximum margin clustering and self-tuning spectral clustering with both best and worst k. [sent-307, score-2.681]

95 First, observe the big gap between best and worst performance of self-tuning spectral clustering with different choice of k, which implies that this algorithm is sensitive to parameter k. [sent-308, score-0.877]

96 Second, for most datasets, generalized maximum margin clustering achieves similar performance as self-tuning spectral clustering with the best k. [sent-309, score-2.026]

97 Furthermore, for a number of datasets, the unsupervised kernel learning method achieves the performance close to the one using the optimal kernel width. [sent-310, score-0.702]

98 Both results indicate that the proposed algorithm for unsupervised kernel learning is effective in identifying appropriate kernels. [sent-311, score-0.443]

99 5 Conclusion In this paper, we proposed a framework for the generalized maximum margin clustering. [sent-312, score-0.657]

100 Our empirical study with three synthetic datasets and four UCI datasets shows the promising performance of our proposed algorithm. [sent-314, score-0.3]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clustering', 0.603), ('margin', 0.412), ('kernel', 0.266), ('formulism', 0.185), ('spectral', 0.163), ('ce', 0.156), ('maximum', 0.146), ('unsupervised', 0.137), ('di', 0.115), ('origins', 0.105), ('datasets', 0.104), ('width', 0.101), ('generalized', 0.099), ('boundaries', 0.094), ('erent', 0.087), ('gmmc', 0.079), ('nitive', 0.079), ('digits', 0.079), ('cut', 0.078), ('zi', 0.073), ('diag', 0.069), ('pass', 0.069), ('relaxes', 0.069), ('dual', 0.065), ('uci', 0.063), ('rbf', 0.062), ('prime', 0.059), ('centralizing', 0.053), ('mmc', 0.053), ('worst', 0.052), ('normalized', 0.052), ('ki', 0.05), ('unbalanced', 0.05), ('circles', 0.049), ('ectiveness', 0.046), ('laplacian', 0.044), ('soft', 0.043), ('similarity', 0.043), ('yy', 0.042), ('lansing', 0.042), ('overlapped', 0.042), ('stands', 0.04), ('appropriate', 0.04), ('diagonal', 0.039), ('breast', 0.039), ('ding', 0.039), ('memberships', 0.039), ('wi', 0.038), ('ionosphere', 0.037), ('unsuitable', 0.037), ('jin', 0.037), ('michigan', 0.037), ('relaxation', 0.036), ('vote', 0.035), ('east', 0.035), ('synthesized', 0.035), ('erence', 0.035), ('synthetic', 0.035), ('promising', 0.035), ('remark', 0.033), ('graph', 0.033), ('optimal', 0.033), ('solution', 0.032), ('studies', 0.032), ('programming', 0.031), ('sensitive', 0.031), ('problem', 0.03), ('st', 0.03), ('nc', 0.029), ('cristianini', 0.029), ('shi', 0.029), ('choice', 0.028), ('experiment', 0.028), ('convert', 0.028), ('min', 0.027), ('matrix', 0.027), ('xu', 0.027), ('furthermore', 0.027), ('li', 0.026), ('objective', 0.026), ('invariance', 0.026), ('ciency', 0.026), ('matrices', 0.025), ('advances', 0.024), ('automatically', 0.023), ('larson', 0.023), ('neufeld', 0.023), ('comparision', 0.023), ('elissee', 0.023), ('punishment', 0.023), ('reformulates', 0.023), ('optimization', 0.023), ('partitioning', 0.023), ('extend', 0.022), ('elements', 0.022), ('translation', 0.022), ('four', 0.022), ('hard', 0.022), ('max', 0.022), ('computational', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999827 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

2 0.35443521 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

3 0.31899336 80 nips-2006-Fundamental Limitations of Spectral Clustering

Author: Boaz Nadler, Meirav Galun

Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1

4 0.19992158 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

Author: Fei Sha, Lawrence K. Saul

Abstract: We study the problem of parameter estimation in continuous density hidden Markov models (CD-HMMs) for automatic speech recognition (ASR). As in support vector machines, we propose a learning algorithm based on the goal of margin maximization. Unlike earlier work on max-margin Markov networks, our approach is specifically geared to the modeling of real-valued observations (such as acoustic feature vectors) using Gaussian mixture models. Unlike previous discriminative frameworks for ASR, such as maximum mutual information and minimum classification error, our framework leads to a convex optimization, without any spurious local minima. The objective function for large margin training of CD-HMMs is defined over a parameter space of positive semidefinite matrices. Its optimization can be performed efficiently with simple gradient-based methods that scale well to large problems. We obtain competitive results for phonetic recognition on the TIMIT speech corpus.

5 0.17895012 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus

Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.

6 0.1767651 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

7 0.17523476 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

8 0.17251666 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

9 0.16458993 65 nips-2006-Denoising and Dimension Reduction in Feature Space

10 0.13892318 130 nips-2006-Max-margin classification of incomplete data

11 0.13664719 181 nips-2006-Stability of $K$-Means Clustering

12 0.1320999 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

13 0.11299483 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

14 0.11274765 79 nips-2006-Fast Iterative Kernel PCA

15 0.10888092 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

16 0.097780123 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

17 0.087783135 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

18 0.081061617 186 nips-2006-Support Vector Machines on a Budget

19 0.079679891 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

20 0.079172038 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.273), (1, 0.179), (2, 0.022), (3, 0.393), (4, 0.059), (5, -0.091), (6, -0.0), (7, 0.12), (8, 0.227), (9, -0.011), (10, -0.124), (11, 0.228), (12, -0.175), (13, 0.097), (14, 0.019), (15, 0.065), (16, 0.048), (17, 0.093), (18, 0.074), (19, 0.041), (20, 0.064), (21, 0.05), (22, -0.034), (23, 0.041), (24, 0.057), (25, 0.074), (26, -0.033), (27, 0.012), (28, 0.016), (29, 0.038), (30, 0.043), (31, 0.067), (32, 0.036), (33, -0.04), (34, -0.082), (35, 0.038), (36, -0.016), (37, 0.047), (38, -0.005), (39, -0.041), (40, -0.101), (41, 0.05), (42, 0.028), (43, 0.078), (44, 0.017), (45, 0.091), (46, -0.028), (47, -0.048), (48, -0.003), (49, -0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98735613 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

2 0.88187188 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

3 0.80968076 80 nips-2006-Fundamental Limitations of Spectral Clustering

Author: Boaz Nadler, Meirav Galun

Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1

4 0.72435743 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

Author: Ron Zass, Amnon Shashua

Abstract: In this paper we focus on the issue of normalization of the affinity matrix in spectral clustering. We show that the difference between N-cuts and Ratio-cuts is in the error measure being used (relative-entropy versus L1 norm) in finding the closest doubly-stochastic matrix to the input affinity matrix. We then develop a scheme for finding the optimal, under Frobenius norm, doubly-stochastic approximation using Von-Neumann’s successive projections lemma. The new normalization scheme is simple and efficient and provides superior clustering performance over many of the standardized tests. 1

5 0.63145357 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

Author: Jason V. Davis, Inderjit S. Dhillon

Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1

6 0.5645442 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

7 0.54275608 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

8 0.50034755 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

9 0.49323955 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

10 0.48785555 181 nips-2006-Stability of $K$-Means Clustering

11 0.45956895 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

12 0.44646031 82 nips-2006-Gaussian and Wishart Hyperkernels

13 0.44347212 79 nips-2006-Fast Iterative Kernel PCA

14 0.3973619 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

15 0.39669776 65 nips-2006-Denoising and Dimension Reduction in Feature Space

16 0.395771 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.39361325 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

18 0.39256835 130 nips-2006-Max-margin classification of incomplete data

19 0.38346156 158 nips-2006-PG-means: learning the number of clusters in data

20 0.34801036 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.112), (3, 0.033), (7, 0.109), (9, 0.12), (12, 0.016), (20, 0.018), (22, 0.094), (34, 0.15), (44, 0.063), (57, 0.063), (65, 0.1), (69, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93140745 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

Author: Michael C. Mozer, Michael Shettel, Michael P. Holmes

Abstract: Categorization is a central activity of human cognition. When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item influences category decisions for subsequent items. Specifically, when experimental subjects are shown an exemplar of some target category, the category prototype appears to be pulled toward the exemplar, and the prototypes of all nontarget categories appear to be pushed away. These push and pull effects diminish with experience, and likely reflect long-term learning of category boundaries. We propose and evaluate four principled probabilistic (Bayesian) accounts of context effects in categorization. In all four accounts, the probability of an exemplar given a category is encoded as a Gaussian density in feature space, and categorization involves computing category posteriors given an exemplar. The models differ in how the uncertainty distribution of category prototypes is represented (localist or distributed), and how it is updated following each experience (using a maximum likelihood gradient ascent, or a Kalman filter update). We find that the distributed maximum-likelihood model can explain the key experimental phenomena. Further, the model predicts other phenomena that were confirmed via reanalysis of the experimental data. Categorization is a key cognitive activity. We continually make decisions about characteristics of objects and individuals: Is the fruit ripe? Does your friend seem unhappy? Is your car tire flat? When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item influences category decisions for subsequent items. Intuitive naturalistic scenarios in which context effects occur are easy to imagine. For example, if one lifts a medium-weight object after lifting a light-weight or heavy-weight object, the medium weight feels heavier following the light weight than following the heavy weight. Although the object-contrast effect might be due to fatigue of sensory-motor systems, many context effects in categorization are purely cognitive and cannot easily be attributed to neural habituation. For example, if you are reviewing a set of conference papers, and the first three in the set are dreadful, then even a mediocre paper seems like it might be above threshold for acceptance. Another example of a category boundary shift due to context is the following. Suppose you move from San Diego to Pittsburgh and notice that your neighbors repeatedly describe muggy, somewhat overcast days as ”lovely.” Eventually, your notion of what constitutes a lovely day accommodates to your new surroundings. As we describe shortly, experimental studies have shown a fundamental link between context effects in categorization and long-term learning of category boundaries. We believe that context effects can be viewed as a reflection of a trial-to-trial learning, and the cumulative effect of these trial-to-trial modulations corresponds to what we classically consider to be category learning. Consequently, any compelling model of category learning should also be capable of explaining context effects. 1 Experimental Studies of Context Effects in Categorization Consider a set of stimuli that vary along a single continuous dimension. Throughout this paper, we use as an illustration circles of varying diameters, and assume four categories of circles defined ranges of diameters; call them A, B, C, and D, in order from smallest to largest diameter. In a classification paradigm, experimental subjects are given an exemplar drawn from one category and are asked to respond with the correct category label (Zotov, Jones, & Mewhort, 2003). After making their response, subjects receive feedback as to the correct label, which we’ll refer to as the target. In a production paradigm, subjects are given a target category label and asked to produce an exemplar of that category, e.g., using a computer mouse to indicate the circle diameter (Jones & Mewhort, 2003). Once a response is made, subjects receive feedback as to the correct or true category label for the exemplar they produced. Neither classification nor production task has sequential structure, because the order of trial is random in both experiments. The production task provides direct information about the subjects’ internal representations, because subjects are producing exemplars that they consider to be prototypes of a category, whereas the categorization task requires indirect inferences to be made about internal representations from reaction time and accuracy data. Nonetheless, the findings in the production and classification tasks mirror one another nicely, providing converging evidence as to the nature of learning. The production task reveals how mental representations shift as a function of trial-to-trial sequences, and these shifts cause the sequential pattern of errors and response times typically observed in the classification task. We focus on the production task in this paper because it provides a richer source of data. However, we address the categorization task with our models as well. Figure 1 provides a schematic depiction of the key sequential effects in categorization. The horizontal line represents the stimulus dimension, e.g., circle diameter. The dimension is cut into four regions labeled with the corresponding category. The category center, which we’ll refer to as the prototype, is indicated by a vertical dashed line. The long solid vertical line marks the current exemplar—whether it is an exemplar presented to subjects in the classification task or an exemplar generated by subjects in the production task. Following an experimental trial with this exemplar, category prototypes appear to shift: the target-category prototype moves toward the exemplar, which we refer to as a pull effect, and all nontarget-category prototypes move away from the exemplar, which we refer to as a push effect. Push and pull effects are assessed in the production task by examining the exemplar produced on the following trial, and in the categorization task by examining the likelihood of an error response near category boundaries. The set of phenomena to be explained are as follows, described in terms of the production task. All numerical results referred to are from Jones and Mewhort (2003). This experiment consisted of 12 blocks of 40 trials, with each category label given as target 10 times within a block. • Within-category pull: When a target category is repeated on successive trials, the exemplar generated on the second trial moves toward the exemplar generated on the first trial, with respect to the true category prototype. Across the experiment, a correlation coefficient of 0.524 is obtained, and remains fairly constant over trials. • Between-category push: When the target category changes from one trial to the next, the exemplar generated on the second trial moves away from the exemplar generated on the first trial (or equivalently, from the prototype of the target category on the first trial). Figure 2a summarizes the sequential push effects from Jones and Mewhort. The diameter of the circle produced on trial t is plotted as a function of the target category on trial t − 1, with one line for each of the four trial t targets. The mean diameter for each target category is subtracted out, so the absolute vertical offset of each line is unimportant. The main feature of the data to note is that all four curves have a negative slope, which has the following meaning: the smaller that target t − 1 is (i.e., the further to the left on the x axis in Figure 1), the larger the response to target t is (further to the right in Figure 1), and vice versa, reflecting a push away from target t − 1. Interestingly and importantly, the magnitude of the push increases with the ordinal distance between targets t − 1 and t. Figure 2a is based on data from only eight subjects and is therefore noisy, though the effect is statistically reliable. As further evidence, Figure 2b shows data from a categorization task (Zotov et al., 2003), where the y-axis is a different dependent measure, but the negative slope has the same interpretation as in Figure 2a. example Figure 1: Schematic depiction of sequential effects in categorization A B C D stimulus dimension C 0 B −0.02 −0.04 −0.06 −0.08 −0.1 0.06 0.04 0.04 response deviation D 0.02 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label (b) humans: classification (d) KFU−distrib 0.08 D previous category label C D 0.06 0.04 0.04 response deviation response deviation C B (f) MLGA−distrib 0.08 0.02 0 −0.02 −0.04 −0.06 −0.08 B A previous category label 0.06 A (e) MLGA−local 0.08 0.06 A 0.04 (c) KFU−local 0.08 response deviation response deviation 0.06 response bias from (a) production task of Jones and Mewhort (2003), (b) classification task of Zotov et al. (2003), and (c)-(f) the models proposed in this paper. The y axis is the deviation of the response from the mean, as a proportion of the total category width. The response to category A is solid red, B is dashed magenta, C is dash-dotted blue, and D is dotted green. (a) humans: production 0.08 Figure 2: Push effect data −0.1 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C D previous category label −0.1 A B C D previous category label • Push and pull effects are not solely a consequence of errors or experimenter feedback. In quantitative estimation of push and pull effects, trial t is included in the data only if the response on trial t − 1 is correct. Thus, the effects follow trials in which no error feedback is given to the subjects, and therefore the adjustments are not due to explicit error correction. • Push and pull effects diminish over the course of the experiment. The magnitude of push effects can be measured by the slope of the regression lines fit to the data in Figure 2a. The slopes get shallower over successive trial blocks. The magnitude of pull effects can be measured by the standard deviation (SD) of the produced exemplars, which also decreases over successive trial blocks. • Accuracy increases steadily over the course of the experiment, from 78% correct responses in the first block to 91% in the final block. This improvement occurs despite the fact that error feedback is relatively infrequent and becomes even less frequent as performance improves. 2 Four Models In this paper, we explore four probabilistic (Bayesian) models to explain data described in the previous section. The key phenomenon to explain turns out to be the push effect, for which three of the four models fail to account. Modelers typically discard the models that they reject, and present only their pet model. In this work, we find it useful to report on the rejected models for three reasons. First, they help to set up and motivate the one successful model. Second, they include several obvious candidates, and we therefore have the imperative to address them. Third, in order to evaluate a model that can explain certain data, one needs to know the degree to which the the data constrain the space of models. If many models exist that are consistent with the data, one has little reason to prefer our pet candidate. Underlying all of the models is a generative probabilistic framework in which a category i is represented by a prototype value, di , on the dimension that discriminates among the categories. In the example used throughout this paper, the dimension is the diameter of a circle (hence the notation d for the prototype). An exemplar, E, of category i is drawn from a Gaussian distribution with mean di and variance vi , denoted E ∼ N (di , vi ). Category learning involves determining d ≡ {di }. In this work, we assume that the {vi } are fixed and given. Because d is unknown at the start of the experiment, it is treated as the value of a random vector, D ≡ {Di }. Figure 3a shows a simple graphical model representing the generative framework, in which E is the exemplar and C the category label. To formalize our discussion so far, we adopt the following notation: P (E|C = c, D = d) ∼ N (hc d, vc ), (1) where, for the time being, hc is a unary column vector all of whose elements are zero except for element c which has value 1. (Subscripts may indicate either an index over elements of a vector, or an index over vectors. Boldface is used for vectors and matrices.) Figure 3: (a) Graphical model depicting selection of an exemplar, E, of a category, C, based on the prototype vector, D; (b) Dynamic version of model indexed by trials, t (a) D (b) C Dt-1 Ct-1 E Dt Ct Et-1 Et We assume that the prototype representation, D, is multivariate Gaussian, D ∼ N (Ψ, Σ), where Ψ and Σ encode knowledge—and uncertainty in the knowledge—of the category prototype structure. Given this formulation, the uncertainty in D can be integrated out: P (E|C) ∼ N (hc Ψ, hc ΣhT + vc ). c (2) For the categorization task, a category label can be assigned by evaluating the category posterior, P (C|E), via Bayes rule, Equation 1, and the category priors, P (C). In this framework, learning takes place via trial-to-trial adaptation of the category prototype distribution, D. In Figure 3b, we add the subscript t to each random variable to denote the trial, yielding a dynamic graphical model for the sequential updating of the prototype vector, Dt . (The reader should be attentive to the fact that we use subscripted indices to denote both trials and category labels. We generally use the index t to denote trial, and c or i to denote a category label.) The goal of our modeling work is to show that the sequential updating process leads to context effects, such as the push and pull effects discussed earlier. We propose four alternative models to explore within this framework. The four models are obtained via the Cartesian product of two binary choices: the learning rule and the prototype representation. 2.1 Learning rule The first learning rule, maximum likelihood gradient ascent (MLGA), attempts to adjust the prototype representation so as to maximize the log posterior of the category given the exemplar. (The category, C = c, is the true label associated with the exemplar, i.e., either the target label the subject was asked to produce, or—if an error was made—the actual category label the subject did produce.) Gradient ascent is performed in all parameters of Ψ and Σ: ∆ψi = ψ ∂ log(P (c|e)) and ∂ψi ∆σij = σ ∂ log(P (c|e)), ∂σij (3) where ψ and σ are step sizes. To ensure that Σ remains a covariance matrix, constrained gradient 2 steps are applied. The constraints are: (1) diagonal terms are nonnegative, i.e., σi ≥ 0; (2) offdiagonal terms are symmetric, i.e., σij = σji ; and (3) the matrix remains positive definite, ensured σ by −1 ≤ σiijj ≤ 1. σ The second learning rule, a Kalman filter update (KFU), reestimates the uncertainty distribution of the prototypes given evidence provided by the current exemplar and category label. To draw the correspondence between our framework and a Kalman filter: the exemplar is a scalar measurement that pops out of the filter, the category prototypes are the hidden state of the filter, the measurement noise is vc , and the linear mapping from state to measurement is achieved by hc . Technically, the model is a measurement-switched Kalman filter, where the switching is determined by the category label c, i.e., the measurement function, hc , and noise, vc , are conditioned on c. The Kalman filter also allows temporal dynamics via the update equation, dt = Adt−1 , as well as internal process noise, whose covariance matrix is often denoted Q in standard Kalman filter notation. We investigated the choice of A and R, but because they did not impact the qualitative outcome of the simulations, we used A = I and R = 0. Given the correspondence we’ve established, the KFU equations—which specify Ψt+1 and Σt+1 as a function of ct , et , Ψt , and Σt —can be found in an introductory text (e.g., Maybeck, 1979). Change to a category prototype for each category following a trial of a given category. Solid (open) bars indicate trials in which the exemplar is larger (smaller) than the prototype. 2.2 prototype mvt. Figure 4: 0.2 trial t −1: A 0 −0.2 0.2 trial t −1: B 0 A B C trial t D −0.2 0.2 trial t −1: C 0 A B C trial t D −0.2 0.2 trial t −1: D 0 A B C trial t D −0.2 A B C trial t D Representation of the prototype The prototype representation that we described is localist: there is a one-to-one correspondence between the prototype for each category i and the random variable Di . To select the appropriate prototype given a current category c, we defined the unary vector hc and applied hc as a linear transform on D. The identical operations can be performed in conjunction with a distributed representation of the prototype. But we step back momentarily to motivate the distributed representation. The localist representation suffers from a key weakness: it does not exploit interrelatedness constraints on category structure. The task given to experimental subjects specifies that there are four categories, and they have an ordering; the circle diameters associated with category A are smaller than the diameters associated with B, etc. Consequently, dA < dB < dC < dD . One might make a further assumption that the category prototypes are equally spaced. Exploiting these two sources of domain knowledge leads to the distributed representation of category structure. A simple sort of distributed representation involves defining the prototype for category i not as di but as a linear function of an underlying two-dimensional state-space representation of structure. In this state space, d1 indicates the distance between categories and d2 an offset for all categories. This representation of state can be achieved by applying Equation 1 and defining hc = (nc , 1), where nc is the ordinal position of the category (nA = 1, nB = 2, etc.). We augment this representation with a bit of redundancy by incorporating not only the ordinal positions but also the reverse ordinal positions; this addition yields a symmetry in the representation between the two ends of the ordinal category scale. As a result of this augmentation, d becomes a three-dimensional state space, and hc = (nc , N + 1 − nc , 1), where N is the number of categories. To summarize, both the localist and distributed representations posit the existence of a hidden-state space—unknown at the start of learning—that specifies category prototypes. The localist model assumes one dimension in the state space per prototype, whereas the distributed model assumes fewer dimensions in the state space—three, in our proposal—than there are prototypes, and computes the prototype location as a function of the state. Both localist and distributed representations assume a fixed, known {hc } that specify the interpretation of the state space, or, in the case of the distributed model, the subject’s domain knowledge about category structure. 3 Simulation Methodology We defined a one-dimensional feature space in which categories A-D corresponded to the ranges [1, 2), [2, 3), [3, 4), and [4, 5), respectively. In the human experiment, responses were considered incorrect if they were smaller than A or larger than D; we call these two cases out-of-bounds-low (OOBL) and out-of-bounds-high (OOBH). OOBL and OOBH were treated as two additional categories, resulting in 6 categories altogether for the simulation. Subjects and the model were never asked to produce exemplars of OOBL or OOBH, but feedback was given if a response fell into these categories. As in the human experiment, our simulation involved 480 trials. We performed 100 replications of each simulation with identical initial conditions but different trial sequences, and averaged results over replications. All prototypes were initialized to have the same mean, 3.0, at the start of the simulation. Because subjects had some initial practice on the task before the start of the experimental trials, we provided the models with 12 initial trials of a categorization (not production) task, two for each of the 6 categories. (For the MLGA models, it was necessary to use a large step size on these trials to move the prototypes to roughly the correct neighborhood.) To perform the production task, the models must generate an exemplar given a category. It seems natural to draw an exemplar from the distribution in Equation 2 for P (E|C). However, this distribu- tion reflects the full range of exemplars that lie within the category boundaries, and presumably in the production task, subjects attempt to produce a prototypical exemplar. Consequently, we exclude the intrinsic category variance, vc , from Equation 2 in generating exemplars, leaving variance only via uncertainty about the prototype. Each model involved selection of various parameters and initial conditions. We searched the parameter space by hand, attempting to find parameters that satisfied basic properties of the data: the accuracy and response variance in the first and second halves of the experiment. We report only parameters for the one model that was successful, the MLGA-Distrib: ψ = 0.0075, σ = 1.5 × 10−6 for off-diagonal terms and 1.5 × 10−7 for diagonal terms (the gradient for the diagonal terms was relatively steep), Σ0 = 0.01I, and for all categories c, vc = 0.42 . 4 4.1 Results Push effect The phenomenon that most clearly distinguishes the models is the push effect. The push effect is manifested in sequential-dependency functions, which plot the (relative) response on trial t as a function of trial t − 1. As we explained using Figures 2a,b, the signature of the push effect is a negatively sloped line for each of the different trial t target categories. The sequential-dependency functions for the four models are presented in Figures 2c-f. KFU-Local (Figure 2c) produces a flat line, indicating no push whatsoever. The explanation for this result is straightforward: the Kalman filter update alters only the variable that is responsible for the measurement (exemplar) obtained on that trial. That variable is the prototype of the target class c, Dc . We thought the lack of an interaction among the category prototypes might be overcome with KFU-Distrib, because with a distributed prototype representation, all of the state variables jointly determine the target category prototype. However, our intuition turned out to be incorrect. We experimented with many different representations and parameter settings, but KFU-Distrib consistently obtained flat or shallow positive sloping lines (Figure 2d). MLGA-Local (Figure 2e) obtains a push effect for neighboring classes, but not distant classes. For example, examining the dashed magenta line, note that B is pushed away by A and C, but is not affected by D. MLGA-Local maximizes the likelihood of the target category both by pulling the classconditional density of the target category toward the exemplar and by pushing the class-conditional densities of the other categories away from the exemplar. However, if a category has little probability mass at the location of the exemplar, the increase in likelihood that results from pushing it further away is negligible, and consequently, so is the push effect. MLGA-Distrib obtains a lovely result (Figure 2f)—a negatively-sloped line, diagnostic of the push effect. The effect magnitude matches that in the human data (Figure 2a), and captures the key property that the push effect increases with the ordinal distance of the categories. We did not build a mechanism into MLGA-Distrib to produce the push effect; it is somewhat of an emergent property of the model. The state representation of MLGA-Distrib has three components: d1 , the weight of the ordinal position of a category prototype, d2 , the weight of the reverse ordinal position, and d3 , an offset. The last term, d3 , cannot be responsible for a push effect, because it shifts all prototypes equally, and therefore can only produce a flat sequential dependency function. Figure 4 helps provide an intuition how d1 and d2 work together to produce the push effect. Each graph shows the average movement of the category prototype (units on the y-axis are arbitrary) observed on trial t, for each of the four categories, following presentation of a given category on trial t − 1. Positve values on the y axis indicate increases in the prototype (movement to the right in Figure 1), and negative values decreases. Each solid vertical bar represents the movement of a given category prototype following a trial in which the exemplar is larger than its current prototype; each open vertical bar represents movement when the exemplar is to the left of its prototype. Notice that all category prototypes get larger or smaller on a given trial. But over the course of the experiment, the exemplar should be larger than the prototype as often as it is smaller, and the two shifts should sum together and partially cancel out. The result is the value indicated by the small horizontal bar along each line. The balance between the shifts in the two directions exactly corresponds to the push effect. Thus, the model produce a push-effect graph, but it is not truly producing a push effect as was originally conceived by the experimentalists. We are currently considering empirical consequences of this simulation result. Figure 5 shows a trial-by-trial trace from MLGA-Distrib. (a) class prototype 2 50 100 150 200 250 300 350 400 6 4 2 0 50 100 150 200 250 300 350 400 450 −4 50 100 150 200 250 50 100 150 200 300 350 400 450 250 300 350 400 450 300 350 400 450 300 350 400 450 0.2 0 −0.2 50 100 150 200 250 (f) 1 −2 −6 0.6 (e) (c) 0 0.8 0.4 450 (b) posterior log(class variance) P(correct) 4 0 (d) 1 shift (+=toward −=away) example 6 0.8 0.6 0.4 50 100 150 200 250 Figure 5: Trial-by-trial trace of MLGA-Distrib. (a) exemplars generated on one run of the simulation; (b) the mean and (c) variance of the class prototype distribution for the 6 classes on one run; (d) mean proportion correct over 100 replications of the simulation; (e) push and pull effects, as measured by changes to the prototype means: the upper (green) curve is the pull of the target prototype mean toward the exemplar, and the lower (red) curve is the push of the nontarget prototype means away from the exemplar, over 100 replications; (f) category posterior of the generated exemplar over 100 replications, reflecting gradient ascent in the posterior. 4.2 Other phenomena accounted for MLGA-Distrib captures the other phenomena we listed at the outset of this paper. Like all of the other models, MLGA-Distrib readily produces a pull effect, which is shown in the movement of category prototypes in Figure 5e. More observably, a pull effect is manifested when two successive trials of the same category are positively correlated: when trial t − 1 is to the left of the true category prototype, trial t is likely to be to the left as well. In the human data, the correlation coefficient over the experiment is 0.524; in the model, the coefficient is 0.496. The explanation for the pull effect is apparent: moving the category prototype to the exemplar increases the category likelihood. Although many learning effects in humans are based on error feedback, the experimental studies showed that push and pull effects occur even in the absence of errors, as they do in MLGA-Distrib. The model simply assumes that the target category it used to generate an exemplar is the correct category when no feedback to the contrary is provided. As long as the likelihood gradient is nonzero, category prototypes will be shifted. Pull and push effects shrink over the course of the experiment in human studies, as they do in the simulation. Figure 5e shows a reduction in both pull and push, as measured by the shift of the prototype means toward or away from the exemplar. We measured the slope of MLGA-Distrib’s push function (Figure 2f) for trials in the first and second half of the simulation. The slope dropped from −0.042 to −0.025, as one would expect from Figure 5e. (These slopes are obtained by combining responses from 100 replications of the simulation. Consequently, each point on the push function was an average over 6000 trials, and therefore the regression slopes are highly reliable.) A quantitative, observable measure of pull is the standard deviation (SD) of responses. As push and pull effects diminish, SDs should decrease. In human subjects, the response SDs in the first and second half of the experiment are 0.43 and 0.33, respectively. In the simulation, the response SDs are 0.51 and 0.38. Shrink reflects the fact that the model is approaching a local optimum in log likelihood, causing gradients—and learning steps—to become smaller. Not all model parameter settings lead to shrink; as in any gradient-based algorithm, step sizes that are too large do not lead to converge. However, such parameter settings make little sense in the context of the learning objective. 4.3 Model predictions MLGA-Distrib produces greater pull of the target category toward the exemplar than push of the neighboring categories away from the exemplar. In the simulation, the magnitude of the target pull— measured by the movement of the prototype mean—is 0.105, contrasted with the neighbor push, which is 0.017. After observing this robust result in the simulation, we found pertinent experimental data. Using the categorization paradigm, Zotov et al. (2003) found that if the exemplar on trial t is near a category border, subjects are more likely to produce an error if the category on trial t − 1 is repeated (i.e., a pull effect just took place) than if the previous trial is of the neighboring category (i.e., a push effect), even when the distance between exemplars on t − 1 and t is matched. The greater probability of error translates to a greater magnitude of pull than push. The experimental studies noted a phenomenon termed snap back. If the same target category is presented on successive trials, and an error is made on the first trial, subjects perform very accurately on the second trial, i.e., they generate an exemplar near the true category prototype. It appears as if subjects, realizing they have been slacking, reawaken and snap the category prototype back to where it belongs. We tested the model, but observed a sort of anti snap back. If the model made an error on the first trial, the mean deviation was larger—not smaller—on the second trial: 0.40 versus 0.32. Thus, MLGA-Distrib fails to explain this phenomenon. However, the phenomenon is not inconsistent with the model. One might suppose that on an error trial, subjects become more attentive, and increased attention might correspond to a larger learning rate on an error trial, which should yield a more accurate response on the following trial. McLaren et al. (1995) studied a phenomenon in humans known as peak shift, in which subjects are trained to categorize unidimensional stimuli into one of two categories. Subjects are faster and more accurate when presented with exemplars far from the category boundary than those near the boundary. In fact, they respond more efficiently to far exemplars than they do to the category prototype. The results are characterized in terms of the prototype of one category being pushed away from the prototype of the other category. It seems straightforward to explain these data in MLGA-Distrib as a type of long-term push effect. 5 Related Work and Conclusions Stewart, Brown, and Chater (2002) proposed an account of categorization context effects in which responses are based solely on the relative difference between the previous and present exemplars. No representation of the category prototype is maintained. However, classification based solely on relative difference cannot account for a diminished bias effects as a function of experience. A long-term stable prototype representation, of the sort incorporated into our models, seems necessary. We considered four models in our investigation, and the fact that only one accounts for the experimental data suggests that the data are nontrivial. All four models have principled theoretical underpinnings, and they space they define may suggest other elegant frameworks for understanding mechanisms of category learning. The successful model, MLDA-Distrib, offers a deep insight into understanding multiple-category domains: category structure must be considered. MLGA-Distrib exploits knowledge available to subjects performing the task concerning the ordinal relationships among categories. A model without this knowledge, MLGA-Local, fails to explain data. Thus, the interrelatedness of categories appears to provide a source of constraint that individuals use in learning about the structure of the world. Acknowledgments This research was supported by NSF BCS 0339103 and NSF CSE-SMA 0509521. Support for the second author comes from an NSERC fellowship. References Jones, M. N., & Mewhort, D. J. K. (2003). Sequential contrast and assimilation effects in categorization of perceptual stimuli. Poster presented at the 44th Meeting of the Psychonomic Society. Vancouver, B.C. Maybeck, P.S. (1979). Stochastic models, estimation, and control, Volume I. Academic Press. McLaren, I. P. L., et al. (1995). Prototype effects and peak shift in categorization. JEP:LMC, 21, 662–673. Stewart, N. Brown, G. D. A., & Chater, N. (2002). Sequence effects in categorization of simple perceptual stimuli. JEP:LMC, 28, 3–11. Zotov, V., Jones, M. N., & Mewhort, D. J. K. (2003). Trial-to-trial representation shifts in categorization. Poster presented at the 13th Meeting of the Canadian Society for Brain, Behaviour, and Cognitive Science: Hamilton, Ontario.

2 0.89567089 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

Author: Johanna M. Zumer, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan

Abstract: We have developed a novel algorithm for integrating source localization and noise suppression based on a probabilistic graphical model of stimulus-evoked MEG/EEG data. Our algorithm localizes multiple dipoles while suppressing noise sources with the computational complexity equivalent to a single dipole scan, and is therefore more efficient than traditional multidipole fitting procedures. In simulation, the algorithm can accurately localize and estimate the time course of several simultaneously-active dipoles, with rotating or fixed orientation, at noise levels typical for averaged MEG data. Furthermore, the algorithm is superior to beamforming techniques, which we show to be an approximation to our graphical model, in estimation of temporally correlated sources. Success of this algorithm for localizing auditory cortex in a tumor patient and for localizing an epileptic spike source are also demonstrated. 1

same-paper 3 0.87259716 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

4 0.81180805 80 nips-2006-Fundamental Limitations of Spectral Clustering

Author: Boaz Nadler, Meirav Galun

Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1

5 0.80529916 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

6 0.80284494 65 nips-2006-Denoising and Dimension Reduction in Feature Space

7 0.8007369 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

8 0.79390299 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

9 0.79297757 158 nips-2006-PG-means: learning the number of clusters in data

10 0.78914297 163 nips-2006-Prediction on a Graph with a Perceptron

11 0.78674668 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks

12 0.7845493 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.78404635 117 nips-2006-Learning on Graph with Laplacian Regularization

14 0.78308403 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

15 0.78101361 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

16 0.78051949 79 nips-2006-Fast Iterative Kernel PCA

17 0.77807862 61 nips-2006-Convex Repeated Games and Fenchel Duality

18 0.77781099 167 nips-2006-Recursive ICA

19 0.77740777 203 nips-2006-implicit Online Learning with Kernels

20 0.77119356 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment