nips nips2003 nips2003-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 An iterative improvement procedure for hierarchical clustering David Kauchak Department of Computer Science University of California, San Diego dkauchak@cs. [sent-1, score-0.416]
2 edu Abstract We describe a procedure which finds a hierarchical clustering by hillclimbing. [sent-5, score-0.373]
3 The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. [sent-6, score-0.741]
4 1 Introduction A hierarchical clustering of n data points is a recursive partitioning of the data into 2, 3, 4, . [sent-8, score-0.397]
5 Each intermediate clustering is made more fine-grained by splitting one of its clusters. [sent-12, score-0.208]
6 It is natural to depict this process as a tree whose leaves are the data points and whose interior nodes represent intermediate clusters. [sent-13, score-0.679]
7 Such hierarchical representations are very popular – they depict a data set at multiple levels of granularity, simultaneously; they require no prior specification of the number of the clusters; and there are several simple heuristics for constructing them [2, 3]. [sent-14, score-0.281]
8 However, to the best of our knowledge, there is so far no procedure which specifically hillclimbs the space of hierarchical clusterings according to a precise objective function. [sent-16, score-0.38]
9 In particular, we seek an analogue of k-means for hierarchical clustering. [sent-18, score-0.241]
10 Taken literally this is possible only to a certain extent – the basic object we are dealing with is a tree rather than a partition – but k-means has closely informed many aspects of our procedure, and has determined our choice of objective function. [sent-19, score-0.151]
11 We use a canonical tree representation of a hierarchical clustering, in which the leaves are data points, and the interior nodes are ordered; such a clustering is specified completely by a tree structure and by an ordering of nodes. [sent-20, score-1.134]
12 Our cost function is a hierarchical extension of the k-means cost function, and is the same cost function which motivates average-linkage schemes. [sent-21, score-0.953]
13 The ordering of nodes is kept fixed, and one subtree is relocated. [sent-23, score-0.565]
14 This is the natural generalization of a standard heuristic clustering move in which a data point is transferred from one cluster to another. [sent-24, score-0.276]
15 The tree structure is kept fixed, and its interior nodes are reordered optimally. [sent-26, score-0.41]
16 We show that by exploiting properties of Euclidean distance (which underlies the k-means cost function and therefore ours as well), these tasks can be performed efficiently. [sent-27, score-0.314]
17 1 The model The space of trees A hierarchical clustering of n points contains n different clusterings, nested within each other. [sent-32, score-0.432]
18 Instead of a dendogram, it is convenient to use a rooted binary tree (shown below on the right) in which the leaves are data points and internal nodes have exactly two children, so there are 2n − 1 nodes overall. [sent-36, score-0.794]
19 These satisfy the property that the split number of a parent is less than that of its children; so the root is numbered 1. [sent-38, score-0.249]
20 The k-clustering is produced by removing the internal nodes numbered 1, 2, 3, . [sent-39, score-0.262]
21 , k − 1; each cluster consists of (the leaves in) one of the resulting connected components. [sent-42, score-0.224]
22 1 • 2-clustering: {a, b, e}, {c, d} 4 2 • 3-clustering: {a, b}, {e}, {c, d} a b e c d • 4-clustering: {a}, {b}, {e}, {c, d} e 3 a c d b Henceforth we will use “node i” to mean “the internal node with split number i”. [sent-43, score-0.26]
23 The maximal subtree rooted at this node is Ti ; the mean of its data points (leaves) is called µi . [sent-44, score-0.543]
24 To summarize, a hierarchical clustering is specified by: a binary tree with the data points at the leaves; and an ordering of the internal nodes. [sent-45, score-0.73]
25 , Sk , then the k-means cost function is k x − µ(Sj ) 2 , cost(Ck ) = j=1 x∈Sj where µ(S) is the mean of set S. [sent-50, score-0.249]
26 To evaluate a hierarchical clustering, we need to combine the costs of all n intermediate clusterings, and we do so in the most obvious way, by a linear combination. [sent-51, score-0.257]
27 We take the overall cost of the hierarchical clustering to be n wk · cost(Ck ), k=1 where the wk are non-negative weights which add up to one. [sent-52, score-1.084]
28 The default choice is to make all wk = 1/n, but in general the specific application will dictate the choice of weights. [sent-53, score-0.208]
29 A decreasing schedule w1 > w2 > w3 > · · · > wn places more emphasis upon coarser clusterings (ie. [sent-54, score-0.248]
30 small k); a setting wk = 1 singles out a particular intermediate clustering. [sent-55, score-0.259]
31 Although many features of our cost function are familiar from the simpler k-means setting, there is one which is worth pointing out. [sent-56, score-0.282]
32 Consider the set of six points shown here: Under the k-means cost function, it is clear what the best 2-clustering is (three points in each cluster). [sent-57, score-0.373]
33 In other words, the imposition of a hierarchical structure forces certain tradeoffs between the intermediate clusterings. [sent-59, score-0.257]
34 This particular feature is fundamental to hierarchical clustering, and in our cost function it is laid bare. [sent-60, score-0.455]
35 By adjusting the weights wk , the user can bias this tradeoff according to his or her particular needs. [sent-61, score-0.292]
36 This means that even when all the weights w k are identical, the smaller values of k contribute more to the cost function, and therefore, a procedure for minimizing this function must implicitly focus a little more on smaller k than on larger k. [sent-63, score-0.345]
37 wk = c · αk , where α < 1 and where c is a normalization constant. [sent-66, score-0.208]
38 Notice that any given subtree Tj can appear as an individual cluster in many of the clusterings Ck . [sent-67, score-0.526]
39 If π(j) denotes the parent of j, then Tj first appears as its own cluster in Cπ(j)+1 , and is part of all the successive clusterings up to and including Cj . [sent-68, score-0.311]
40 3 Relation to previous work The most commonly used heuristics for hierarchical clustering are agglomerative. [sent-71, score-0.373]
41 They work bottom-up, starting with each data point in its own cluster, and then repeatedly merging the two “closest” clusters until finally all the points are grouped together in one cluster. [sent-72, score-0.174]
42 Single linkage – the distance between two clusters S and T is taken to be the distance between their closest pair of points, ie. [sent-75, score-0.479]
43 Complete linkage uses the distance between the farthest pair of points, ie. [sent-78, score-0.354]
44 Average linkage seems to have now become a generic term encompassing at least three different measures of distance between clusters. [sent-81, score-0.354]
45 (a) (Sokal-Michener) µ(S) − µ(T ) 1 (b) |S|·|T | x∈S,y∈T x − y 2 (c) (Ward’s method) |S|·|T | |S|+|T | 2 µ(S) − µ(T ) 2 Average linkage appears to be the most widely used of these; for instance, it is a standard tool for analyzing gene expression data [1]. [sent-82, score-0.358]
46 The three average linkage distance functions are all trying to minimize something very much like our cost function. [sent-83, score-0.688]
47 In particular, Ward’s measure of the distance between two clusters is exactly the increase in k-means cost occasioned by merging those clusters. [sent-84, score-0.402]
48 3 Local moves Each element of the search space is a tree structure in which the data points are leaves and in which the interior nodes are ordered. [sent-86, score-0.664]
49 )2 /2n−1 (consider the sequence of n − 1 merge operations which create the tree from the data set). [sent-88, score-0.18]
50 keep the structure fixed and reorder the internal nodes optimally; 2. [sent-90, score-0.259]
51 keep the ordering of the internal nodes fixed and alter the structure by relocating some subtree. [sent-91, score-0.368]
52 For our first move – reordering internal nodes – we show that a previously-known scheduling algorithm [4] can be adapted to solve this task (in the case of uniform weights) in just O(n log n) time. [sent-95, score-0.512]
53 For the second move, we show that any given subtree can be relocated optimally in O(n) time, using just a single pass through the tree. [sent-96, score-0.394]
54 In particular, we write our cost function in three different, but completely equivalent, ways; and we switch back and forth between these: 1. [sent-98, score-0.273]
55 We define the cost of a subtree Ti to be cost(Ti ) = x∈Ti x − µi 2 (where the sum is over leaf nodes), that is, the cost of the single cluster rooted at point i. [sent-101, score-0.999]
56 Then the overall cost is a linear combination of subtree costs. [sent-102, score-0.586]
57 Specifically, it is n−1 Wπ(j),j · cost(Tj ), (1) j=1 where π(j) is the parent of node j and Wij = wi+1 + wi+2 + · · · + wj . [sent-103, score-0.158]
58 We annotate each tree edge (i, j) (i is the parent of j > i) by µi − µj 2 ; the overall cost is also a linear combination of these edge weights, specifically, Wk · nl · µk − µl 2 , (2) (k,l)∈T where Wk = w1 + w2 + · · · + wk and nl is the number of leaves in subtree Tl . [sent-105, score-1.42]
59 To give a hint for why these alternative formulations of the cost function are true, we briefly mention a simple “bias-variance” decomposition of squared Euclidean distance: Suppose S is a set of points with mean µS . [sent-107, score-0.336]
60 x∈S The graft In a graft move, an entire subtree is moved to a different location, as shown below. [sent-110, score-0.733]
61 denote split numbers of interior nodes; here the subtree Tj is moved. [sent-114, score-0.522]
62 Any two hierarchical clusterings are connected by a sequence of graft operations. [sent-118, score-0.553]
63 Suppose we want to move a subtree T j ; what is the best place for it? [sent-120, score-0.379]
64 Evaluating the cost of a hierarchical clustering takes O(n) time using equation (1) and doing a single, bottom-up pass. [sent-121, score-0.584]
65 The precise change in cost of any given subtree Tl on this path is easy to compute: Claim. [sent-126, score-0.586]
66 If subtree Tj is merged into Tl , then the cost of Tl goes up by nl nj ∆+ = cost(Tl ∪ Tj ) − cost(Tl ) = cost(Tj ) + · µl − µj l nl + n j Claim. [sent-127, score-0.794]
67 If subtree Tj ⊂ Tl is removed from Tl , then the cost of Tl changes by ni nl ∆− = cost(Tl − Tj ) − cost(Tl ) = −cost(Tj ) − · µl − µj l nl − n j 2 . [sent-128, score-0.818]
68 Using (1), the total change in cost from grafting Tj between a, b (as depicted above) can be found by adding terms of the form Wπ(l),l ∆± , for nodes l on the path between j and l a. [sent-130, score-0.408]
69 This suggests a two-pass algorithm for optimally relocating Tj : in the first pass over the tree, for each Tl , the potential cost change from adding/removing Tj is computed. [sent-131, score-0.385]
70 2 Reordering internal nodes Let Vint be the interior nodes of the tree; if there are n data points (leaves), then |Vint | = n − 1. [sent-135, score-0.515]
71 For any x ∈ Vint , let Tx be the maximal subtree rooted at x, which contains all the descendants of x. [sent-136, score-0.419]
72 If x has children y and z, then the goodness of split at x is the reduction in cost obtained by splitting cluster T x , cost(Tx ) − (cost(Ty ) + cost(Tz )), which we henceforth denote g(x) (for leaves g(x) = 0). [sent-138, score-0.71]
73 , n − 1} which – respects the precedence constraints of the tree: if x is the parent of y then σ(x) < σ(y). [sent-147, score-0.193]
74 – minimizes the overall cost of the hierarchical clustering. [sent-148, score-0.481]
75 Assuming uniform weights wk = 1/n, this cost can be seen (by manipulating equation (2)) to be 1 n σ(x)g(x). [sent-149, score-0.515]
76 We would like to schedule the good tasks (with high g(x)) early on; in the language of clustering, if there are particularly useful splits (which lead to well separated clusters), we would like to perform them early in the hierarchy. [sent-152, score-0.183]
77 The naive greedy solution – always pick the node with highest g(x), subject to precedence constraints – doesn’t work. [sent-154, score-0.23]
78 The reason: it is quite possible that a particular split has low g(x)-value, but that it leads to other splits of very high value. [sent-155, score-0.151]
79 A greedy algorithm would schedule this split very late; an algorithm with some “lookahead” capability would realize the value of this split and schedule it early. [sent-156, score-0.421]
80 Horn[4] has a scheduling algorithm which obtains the optimal ordering, in the case where all the weights wk are equal, and can be implemented in O(n log n) time. [sent-157, score-0.353]
81 wk = c · αk , where α < 1 and c is some normalization constant. [sent-159, score-0.208]
82 For each node x ∈ V , define r(x) to be 1 the maximum, over all subtrees T (not necessarily maximal) rooted at x, of |T | z∈T g(z) (in words, the average of g(·) over nodes of T ). [sent-161, score-0.419]
83 Once these r(x) are known, the optimal numbering is easy to find: pick nodes in decreasing order of r(·) while respecting the precedence constraints. [sent-163, score-0.343]
84 4 Experiments In the experiments, we used uniform weights wk = 1/n. [sent-174, score-0.266]
85 In each iteration of our procedure, we did a reordering of the nodes, and performed one graft – by trying each possible subtree (all O(n) of them), determining the optimal move for that subtree, and greedily picking the best move. [sent-175, score-0.763]
86 We would prefer a more efficient, randomized way to pick which subtree to graft – either completely randomly, or biased by a simple criterion like “amount it deviates from the center of its parent cluster”; this is future work. [sent-176, score-0.71]
87 The initial tree (b) is random and has a cost of 62. [sent-179, score-0.4]
88 A reordering (d), swapping 2 and 3, reduces the cost to 25. [sent-182, score-0.386]
89 5, and a further graft (e) and reordering (f) result in the final tree, which is optimal and has cost 21. [sent-183, score-0.597]
90 The initial greedy merger of points b, c gives a small early benefit but later turns out to be a bad idea; yet the resulting tree is only one graft away from being optimal. [sent-185, score-0.518]
91 Really bad cases for average linkage can be constructed by recursively compounding this simple instance. [sent-186, score-0.393]
92 Average linkage is often used in the analysis of gene expression data. [sent-188, score-0.358]
93 20 5600 18 5500 5400 14 12 5300 cost % improvement over average linkage 16 10 5200 8 6 5100 4 5000 2 0 4900 0 50 100 150 200 250 k 300 350 400 450 500 0 10 20 30 40 50 iterations 60 70 80 Figure 4: (a) On the left, a comparison with average linkage. [sent-189, score-0.703]
94 (b) On the right, the behavior of the cost function over the 80 iterations required for convergence. [sent-190, score-0.249]
95 We randomly chose clean subsets (no missing entries) of varying sizes from this data set, and tried the following on it: average linkage, our method initialized randomly, and our method initialized with average linkage. [sent-192, score-0.265]
96 First of all: our method, whether initialized randomly or with average linkage, systematically did better than average linkage, not only for the particular aggregate cost function we are using, but across the whole spectrum of values of k. [sent-194, score-0.451]
97 Figure 4(a), obtained on a 500-point data set, shows for each k, the percent by which the (induced) k-clustering found in our method (initialized with average linkage) improved upon that found by average linkage; the metric here is the k-means cost function. [sent-195, score-0.347]
98 This experiment also indicates that, in general, the output of average linkage has real scope for improvement. [sent-199, score-0.362]
99 Second, our method often took an order of magnitude (ten or more times) longer to converge if initialized randomly than if initialized with average linkage, even though better solutions were often found with random initialization. [sent-200, score-0.183]
100 Single-machine job sequencing with treelike precedence ordering and linear delay penalties. [sent-221, score-0.227]
wordName wordTfidf (topN-words)
[('linkage', 0.313), ('subtree', 0.311), ('tj', 0.279), ('cost', 0.249), ('tl', 0.233), ('graft', 0.211), ('wk', 0.208), ('hierarchical', 0.206), ('vint', 0.158), ('tree', 0.151), ('leaves', 0.145), ('reordering', 0.137), ('clusterings', 0.136), ('nodes', 0.133), ('makequeue', 0.132), ('clustering', 0.129), ('nl', 0.117), ('split', 0.111), ('ck', 0.104), ('interior', 0.1), ('precedence', 0.097), ('parent', 0.096), ('ordering', 0.095), ('subtrees', 0.092), ('internal', 0.087), ('scheduling', 0.087), ('clusters', 0.084), ('schedule', 0.083), ('rooted', 0.083), ('deletemax', 0.079), ('cluster', 0.079), ('moves', 0.073), ('move', 0.068), ('initialized', 0.067), ('ward', 0.063), ('points', 0.062), ('node', 0.062), ('weights', 0.058), ('children', 0.056), ('dendogram', 0.053), ('grafts', 0.053), ('kauchak', 0.053), ('relocating', 0.053), ('intermediate', 0.051), ('euclidean', 0.05), ('pass', 0.049), ('average', 0.049), ('ti', 0.048), ('dasgupta', 0.046), ('numbering', 0.046), ('gene', 0.045), ('improvement', 0.043), ('numbered', 0.042), ('henceforth', 0.042), ('distance', 0.041), ('splits', 0.04), ('priority', 0.039), ('reorder', 0.039), ('procedure', 0.038), ('heuristics', 0.038), ('pick', 0.038), ('aggregate', 0.037), ('depict', 0.037), ('trying', 0.036), ('queue', 0.035), ('analogue', 0.035), ('job', 0.035), ('nested', 0.035), ('tx', 0.035), ('list', 0.034), ('optimally', 0.034), ('greedy', 0.033), ('tried', 0.033), ('pointing', 0.033), ('diego', 0.033), ('notice', 0.032), ('horn', 0.032), ('bad', 0.031), ('prefer', 0.03), ('early', 0.03), ('decreasing', 0.029), ('operations', 0.029), ('linked', 0.028), ('merging', 0.028), ('leaf', 0.028), ('splitting', 0.028), ('return', 0.028), ('tradeoff', 0.026), ('sj', 0.026), ('path', 0.026), ('overall', 0.026), ('ef', 0.026), ('kept', 0.026), ('maximal', 0.025), ('squared', 0.025), ('removed', 0.024), ('union', 0.024), ('wi', 0.024), ('exploiting', 0.024), ('completely', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
2 0.16915534 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
3 0.11215192 111 nips-2003-Learning the k in k-means
Author: Greg Hamerly, Charles Elkan
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a significance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at confidence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian fit by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the significance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The first method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method finds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this confidence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a final answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is significant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not significant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we fix the significance level (α = 0.0001), while the BIC’s significance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overfits (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overfit by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overfitting the data. The BIC can be considered a likelihood ratio test, but with a significance level that cannot be fixed. The significance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the significance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspecific tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that finding the correct clustering does not always coincide with finding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overfit the data, choosing 20 unevenly distributed clusters. well at finding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at finding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overfitting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means finds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We define the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means finds 31 clusters in 30 seconds with a PQ score of 0.177. X-means finds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means finds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means finds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian fitness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the significance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at finding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to find the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to find the actual segmentation and they require k to be specified. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of fit and some comparisons. American Statistical Association, 69(347):730–737, September 1974.
4 0.11207295 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
5 0.10813574 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
6 0.095288724 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
7 0.094918169 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
8 0.093589701 189 nips-2003-Tree-structured Approximations by Expectation Propagation
9 0.090950377 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
10 0.086323291 73 nips-2003-Feature Selection in Clustering Problems
11 0.086095937 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
12 0.082387693 87 nips-2003-Identifying Structure across Pre-partitioned Data
13 0.080633096 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
14 0.078919046 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
15 0.073696323 99 nips-2003-Kernels for Structured Natural Language Data
16 0.071904644 172 nips-2003-Semi-Supervised Learning with Trees
17 0.063983418 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
18 0.062600642 152 nips-2003-Pairwise Clustering and Graphical Models
19 0.062570602 169 nips-2003-Sample Propagation
20 0.059344795 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
topicId topicWeight
[(0, -0.188), (1, -0.059), (2, -0.063), (3, 0.143), (4, -0.07), (5, 0.021), (6, -0.122), (7, 0.155), (8, -0.03), (9, 0.134), (10, 0.015), (11, -0.051), (12, -0.035), (13, -0.041), (14, 0.028), (15, 0.032), (16, -0.051), (17, 0.003), (18, 0.106), (19, -0.039), (20, -0.002), (21, -0.098), (22, -0.044), (23, 0.072), (24, 0.064), (25, 0.059), (26, 0.031), (27, 0.008), (28, 0.058), (29, -0.011), (30, -0.025), (31, 0.037), (32, 0.185), (33, 0.077), (34, -0.22), (35, -0.147), (36, -0.082), (37, 0.037), (38, -0.056), (39, -0.019), (40, -0.033), (41, -0.047), (42, 0.053), (43, 0.034), (44, 0.087), (45, -0.105), (46, 0.074), (47, -0.032), (48, -0.021), (49, -0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.96418893 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
2 0.58349389 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
Author: Ting liu, Andrew W. Moore, Alexander Gray
Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classification and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1
3 0.58324492 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
4 0.57597172 46 nips-2003-Clustering with the Connectivity Kernel
Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann
Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1
5 0.54269964 172 nips-2003-Semi-Supervised Learning with Trees
Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum
Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efficient computation of the optimal Bayesian classification function from the labeled examples. We test our approach on eight real-world datasets. 1
6 0.53344107 111 nips-2003-Learning the k in k-means
7 0.47312367 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
8 0.466952 73 nips-2003-Feature Selection in Clustering Problems
9 0.42379427 107 nips-2003-Learning Spectral Clustering
10 0.40859762 99 nips-2003-Kernels for Structured Natural Language Data
11 0.38857245 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
12 0.38118631 87 nips-2003-Identifying Structure across Pre-partitioned Data
13 0.36513391 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
14 0.35739416 189 nips-2003-Tree-structured Approximations by Expectation Propagation
15 0.3535718 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
16 0.34337044 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
17 0.32141227 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
18 0.30730471 152 nips-2003-Pairwise Clustering and Graphical Models
19 0.30267578 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems
20 0.29798123 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data
topicId topicWeight
[(0, 0.059), (11, 0.02), (29, 0.026), (30, 0.02), (35, 0.056), (53, 0.117), (69, 0.021), (71, 0.057), (76, 0.046), (84, 0.279), (85, 0.09), (91, 0.121), (99, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.83339304 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
Author: David Kauchak, Sanjoy Dasgupta
Abstract: We describe a procedure which finds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1
2 0.76141423 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
Author: David Barber, Felix V. Agakov
Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1
3 0.6140697 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
4 0.61090946 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
5 0.60690993 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
6 0.60466897 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
7 0.60297215 107 nips-2003-Learning Spectral Clustering
8 0.59830171 126 nips-2003-Measure Based Regularization
9 0.5981335 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
10 0.59724778 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
11 0.59720212 143 nips-2003-On the Dynamics of Boosting
12 0.59624708 81 nips-2003-Geometric Analysis of Constrained Curves
13 0.59432781 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
14 0.59404558 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
15 0.59384131 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
16 0.59371966 30 nips-2003-Approximability of Probability Distributions
17 0.59365523 68 nips-2003-Eye Movements for Reward Maximization
18 0.59363574 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
19 0.59349537 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
20 0.59223354 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data