nips nips2008 nips2008-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. [sent-5, score-0.992]
2 The algorithms work by maximizing the dependence between the taxonomy and the original data. [sent-6, score-0.446]
3 1 Introduction We address the problem of finding taxonomies in data: that is, to cluster the data, and to specify in a systematic way how the clusters relate. [sent-9, score-0.347]
4 One of the simpler methods that has been used extensively is agglomerative clustering [18]. [sent-11, score-0.258]
5 One specifies a distance metric and a linkage function that encodes the cost of merging two clusters, and the algorithm greedily agglomerates clusters, forming a hierarchy until at last the final two clusters are merged into the tree root. [sent-12, score-0.362]
6 A related alternate approach is divisive clustering, in which clusters are split at each level, beginning with a partition of all the data, e. [sent-13, score-0.238]
7 More recently, hierarchical topic models [7, 23] have been proposed to model the hierarchical cluster structure of data. [sent-17, score-0.286]
8 On the other hand, many kinds of data can be easily compared using a kernel function, which encodes the measure of similarity between objects based on their features. [sent-20, score-0.122]
9 Spectral clustering algorithms represent one important subset of clustering techniques based on kernels [24, 21]: the spectrum of an appropriately normalized similarity matrix is used as a relaxed solution to a partition problem. [sent-21, score-0.792]
10 Spectral techniques have the advantage of capturing global cluster structure of the data, but generally do not give a global solution to the problem of discovering taxonomic structure. [sent-22, score-0.256]
11 In the present work, we propose a novel unsupervised clustering algorithm, numerical taxonomy clustering, which both clusters the data and learns a taxonomy relating the clusters. [sent-23, score-1.183]
12 Our method works by maximizing a kernel measure of dependence between the observed data, and a product of the partition matrix that defines the clusters with a structure matrix that defines the relationship between individual clusters. [sent-24, score-0.65]
13 Aside from its simplicity and computational efficiency, our method has two important advantages over previous clustering approaches. [sent-26, score-0.258]
14 First, it represents a more informative visualization of the data than simple clustering, since the relationship between the clusters is also represented. [sent-27, score-0.237]
15 Second, we find the clustering performance is improved over methods that do not take cluster structure into account, and over methods that impose a cluster distance structure rather than learning it. [sent-28, score-0.703]
16 Several objectives that have been used for clustering are related to the objective employed here. [sent-29, score-0.322]
17 Bach and Jordan [3] proposed a modified spectral clustering objective that they then maximize either with respect to the kernel parameters or the data partition. [sent-30, score-0.529]
18 [10] proposed a normalized inner product between a kernel matrix and a matrix constructed from the labels, which can be used to learn kernel parameters. [sent-32, score-0.353]
19 The objective we use here is also a normalized inner product between a similarity matrix and a matrix constructed from the partition, but importantly, we include a structure matrix that represents the relationship between clusters. [sent-33, score-0.485]
20 [22], who used an objective that includes a fixed structure matrix and an objective based on the Hilbert-Schmidt Independence Criterion. [sent-35, score-0.257]
21 Their objective is not normalized, however, and they do not maximize with respect to the structure matrix. [sent-36, score-0.121]
22 In Section 2, we introduce a family of dependence measures with which one can interpret the objective function of the clustering approach. [sent-38, score-0.431]
23 The dependence maximization objective is presented in Section 3, and its relation to classical spectral clustering algorithms is explained in Section 3. [sent-39, score-0.701]
24 Important results for the optimization of the objective are presented in Sections 3. [sent-41, score-0.088]
25 The problem of numerical taxonomy and its relation to the proposed objective function is presented in Section 4, as well as the numerical taxonomy clustering algorithm. [sent-44, score-1.2]
26 2 Hilbert-Schmidt Independence Criterion In this section, we give a brief introduction to the Hilbert-Schmidt Independence Criterion (HSIC), which is a measure of the strength of dependence between two variables (in our case, following [22], these are the data before and after clustering). [sent-46, score-0.109]
27 Let F be a reproducing kernel Hilbert space of functions from X to R, where X is a separable metric space (our input domain). [sent-48, score-0.113]
28 We also define a second RKHS G with respect to the separable metric space Y, with feature map ψ and kernel ψ(y), ψ(y ) G = l(y, y ). [sent-50, score-0.113]
29 A measure of dependence is then the Hilbert-Schmidt norm of this operator (the sum of the squared 2 singular values), Cxy HS . [sent-53, score-0.109]
30 HSIC := Tr [Hn KHn L] , 3 where Hn = I − Dependence Maximization We now specify how the dependence criteria introduced in the previous section can be used in clustering. [sent-57, score-0.109]
31 We represent our data via an n × n Gram matrix M 0: in the simplest case, this 2 is the centered kernel matrix (M = Hn KHn ), but we also consider a Gram matrix corresponding to normalized cuts clustering (see Section 3. [sent-58, score-0.565]
32 Following [22], we define our output Gram matrix to be L = ΠY ΠT , where Π is an n × k partition matrix, k is the number of clusters, and Y is a positive definite matrix that encodes the relationship between clusters (e. [sent-60, score-0.443]
33 Our clustering quality is measured according to Tr M Hn ΠY ΠT Hn Tr [ΠY ΠT Hn ΠY ΠT Hn ] . [sent-63, score-0.258]
34 This criterion is very similar to the criterion introduced for use in kernel target alignment [10], the difference being the addition of centering matrices, Hn , as required by definition of the covariance. [sent-65, score-0.164]
35 We remark that the normalizing term Hn ΠY ΠT Hn HS was not needed in the structured clustering objective of [22]. [sent-66, score-0.34]
36 were interested only in solving for the partition matrix, Π, whereas we also wish to solve for Y : without normalization, the objective can always be improved by scaling Y arbitrarily. [sent-68, score-0.204]
37 In the remainder of this section, we address the maximization of Equation (1) under various simplifying assumptions: these results will then be used in our main algorithm in Section 4. [sent-69, score-0.125]
38 In order to more efficiently solve this difficult combinatorial problem, we make use of a spectral relaxation. [sent-72, score-0.2]
39 1 1 If we choose M = D− 2 AD− 2 where A is a similarity matrix, and D is the diagonal matrix such that Dii = j Aij , we can recover a centered version of the spectral clustering of [21]. [sent-78, score-0.508]
40 In fact, we wish to ignore the eigenvector with constant entries [24], so the centering matrix Hn does not alter the clustering solution. [sent-79, score-0.376]
41 2 Solving for Optimal Y 0 Given Π We now address the subproblem of solving for the optimal structure matrix, Y , subject only to positive semi-definiteness, for any Π. [sent-81, score-0.116]
42 We note that the maximization of Equation (1) is equivalent to the constrained optimization problem max Y Tr M Hn ΠY ΠT Hn , s. [sent-82, score-0.177]
43 † −1 ΠT Hn (see [6, 20]), we note that Equation (8) comBecause ΠT Hn Π ΠT Hn = Hk ΠT Π putes a normalized set kernel between the elements in each cluster. [sent-86, score-0.091]
44 Up to a constant normalization ˜ factor, Y ∗ is equivalent to Hk Y ∗ Hk where ˜∗ Yij = 1 Ni Nj ˜ Mικ , (9) ι∈Ci κ∈Cj Ni is the number of elements in cluster i, Ci is the set of indices of samples assigned to cluster i, ˜ and M = Hn M Hn . [sent-87, score-0.321]
45 1 Therefore, for the problem of simultaneously optimizing the structure matrix, Y 0, and the partition matrix, one can use the same spectral relaxation as in Equation (4), and use the resulting partition matrix to solve for the optimal assignment for Y using Equation (8). [sent-93, score-0.52]
46 This indicates that the optimal partition of the data is the same for Y given by Equation (8) and for Y = I. [sent-94, score-0.098]
47 4 Numerical Taxonomy In this section, we consolidate the results developed in Section 3 and introduce the numerical taxonomy clustering algorithm. [sent-96, score-0.687]
48 The algorithm allows us to simultaneously cluster data and learn a tree structure that relates the clusters. [sent-97, score-0.315]
49 The tree structure imposes constraints on the solution, which in turn affect the data partition selected by the clustering algorithm. [sent-98, score-0.544]
50 In the interests of interpretability, as well as the ability to influence clustering solutions by prior knowledge, we wish to explore the problem where additional constraints are imposed on the structure of Y . [sent-101, score-0.367]
51 In particular, we consider the case that Y is constrained to be generated by a tree metric. [sent-102, score-0.126]
52 By this, we mean that the distance between any two clusters is consistent with the path length along some fixed tree whose leaves are identified with the clusters. [sent-103, score-0.293]
53 For any positive semi-definite matrix Y , we can compute the distance matrix, D, given by the norm implied by the inner product that computes Y , by assigning Dij = Yii + Yjj − 2Yij . [sent-104, score-0.152]
54 It is sufficient, then, to reformulate the optimization problem given in Equation (1) to add the following constraints that characterize distances generated by a tree metric Dab + Dcd ≤ max (Dac + Dbd , Dad + Dbc ) ∀a, b, c, d, (11) where D is the distance matrix generated from Y . [sent-105, score-0.338]
55 The constraints in Equation (11) are known as the 4-point condition, and were proven in [8] to be necessary and sufficient for D to be a tree metric. [sent-106, score-0.127]
56 4 Optimization problems incorporating these constraints are combinatorial and generally difficult to solve. [sent-108, score-0.063]
57 The problem of numerical taxonomy, or fitting additive trees, is as follows: given a fixed distance matrix, D, that fulfills metric constraints, find the solution to min D − DT DT 2 (12) with respect to some norm (e. [sent-109, score-0.186]
58 While numerical taxonomy is in general NP hard, a great variety of approximation algorithms with feasible computational complexity have been developed [1, 2, 11, 15]. [sent-112, score-0.429]
59 Given a distance matrix that satisfies the 4-point condition, the associated unrooted tree that generated the matrix can be found in O(k 2 ) time, where k is equal to the number of clusters [25]. [sent-113, score-0.437]
60 We propose the following iterative algorithm to incorporate the 4-point condition into the optimization of Equation (1): Require: M 0 Ensure: (Π, Y ) ≈ (Π∗ , Y ∗ ) that solve Equation (1) with the constraints given in Equation (11) Initialize Y = I Initialize Π using the relaxation in Section 3. [sent-114, score-0.08]
61 By iterating the procedure, we can allow Π to reflect the fact that it should best fit the current estimate of the tree metric. [sent-116, score-0.093]
62 5 Experimental Results To illustrate the effectiveness of the proposed algorithm, we have performed clustering on two benchmark datasets. [sent-117, score-0.258]
63 The face dataset presented in [22] consists of 185 images of three different people, each with three different facial expressions. [sent-118, score-0.132]
64 The authors posited that this would be best represented by a ternary tree structure, where the first level would decide which subject was represented, and the second level would be based on facial expression. [sent-119, score-0.162]
65 In fact, their clustering algorithm roughly partitioned the data in this way when the appropriate structure matrix was imposed. [sent-120, score-0.406]
66 We will show that our algorithm is able to find a similar structure without supervision, which better represents the empirical structure of the data. [sent-121, score-0.114]
67 A Gaussian kernel was used with the normalization parameter set to the median squared distance between points in input space. [sent-123, score-0.13]
68 1 Performance Evaluation on the Face Dataset We first describe a numerical comparison on the face dataset [22] of the approach presented in Section 4 (where M = Hn KHn is assigned as in a HSIC objective). [sent-125, score-0.174]
69 We considered two alternative approaches: a classic spectral clustering algorithm [21], and the dependence maximization approach of Song et al. [sent-126, score-0.617]
70 Because the approach in [22] is not able to learn the structure of Y from the data, we have optimized the partition matrix for 8 different plausible hierarchical structures (Figure 1). [sent-128, score-0.25]
71 These have been constructed by truncating n-ary trees to the appropriate number of leaf nodes. [sent-129, score-0.071]
72 For the evaluation, we have made use of the fact that the desired partition of the data is known for the face dataset, which allows us to compare the predicted clusters to the ground truth labels. [sent-130, score-0.279]
73 html 5 partition matrix, we compute the conditional entropy of the true labels, l, given the cluster ids, c, H(l|c), which is related to mutual information by I(l; c) = H(l) − H(l|c). [sent-135, score-0.242]
74 As H(l) is fixed for a given dataset, argmaxc I(l; c) = argminc H(l|c), and H(l|c) ≥ 0 with equality only in the case that the clusters are pure [9]. [sent-136, score-0.159]
75 Our results show we have indeed recovered an appropriate tree structure without having to pre-specify the cluster similarity relations. [sent-139, score-0.324]
76 The clusters are identified with leaf nodes, and distances between the clusters are given by the minimum path length from one leaf to another. [sent-141, score-0.401]
77 2807 Table 1: Conditional entropy scores for spectral clustering [21], the clustering algorithm of [22], and the method presented here (last column). [sent-153, score-0.683]
78 The structure for spectral clustering is implicitly equivalent to that in Figure 1(h), as is apparent from the analysis in Section 3. [sent-155, score-0.464]
79 2 NIPS Paper Dataset For the NIPS dataset, we partitioned the documents into k = 8 clusters using the numerical taxonomy clustering algorithm. [sent-159, score-0.897]
80 To allow us to verify the clustering performance, we labeled each cluster using twenty informative words, as listed in Table 2. [sent-161, score-0.424]
81 We observe that not only are the clusters themselves well defined (e. [sent-163, score-0.159]
82 Only cluster c appears to be indistinct, and shows no clear theme. [sent-165, score-0.145]
83 Given its placement, we anticipate that it would merge with the remaining clusters for smaller k. [sent-166, score-0.159]
84 6 Conclusions and Future Work We have introduced a new algorithm, numerical taxonomy clustering, for simultaneously clustering data and discovering a taxonomy that relates the clusters. [sent-167, score-1.063]
85 The algorithm is based on a dependence 6 d e a c f b g h Figure 3: The taxonomy discovered for the NIPS Figure 2: Face dataset and the resulting taxon- dataset. [sent-168, score-0.518]
86 Words that represent the clusters are omy that was discovered by the algorithm given in Table 2. [sent-169, score-0.19]
87 maximization approach, with the Hilbert-Schmidt Independence Criterion as our measure of dependence. [sent-171, score-0.101]
88 We have shown several interesting theoretical results regarding dependence maximization clustering. [sent-172, score-0.21]
89 First, we established the relationship between dependence maximization and spectral clustering. [sent-173, score-0.385]
90 Second, we showed the optimal positive definite structure matrix takes the form of a set kernel, where sets are defined by cluster membership. [sent-174, score-0.293]
91 This result applied to the original dependence maximization objective indicates that the inclusion of an unconstrained structure matrix does not affect the optimal partition matrix. [sent-175, score-0.524]
92 Numerical taxonomy clustering allows us to optimize the constrained problem efficiently. [sent-177, score-0.628]
93 In our experiments on grouping facial expressions, numerical taxonomy clustering is more accurate than the existing approaches of spectral clustering and clustering with a fixed predefined structure. [sent-178, score-1.423]
94 We were also able to fit a taxonomy to NIPS papers that resulted in a reasonable and interpretable clustering by subject matter. [sent-179, score-0.656]
95 In both the facial expression and NIPS datasets, similar clusters are close together on the resulting tree. [sent-180, score-0.209]
96 We conclude that numerical taxonomy clustering is a useful tool both for improving the accuracy of clusterings and for the visualization of complex data. [sent-181, score-0.718]
97 Likewise, automatic selection of the number of clusters is an interesting area of future work. [sent-185, score-0.159]
98 We cannot expect to use the criterion in Equation (1) to select the number of clusters because increasing the size of Π and Y can never decrease the objective. [sent-186, score-0.198]
99 Another interesting line of work is to consider optimizing a clustering objective derived from the Hilbert-Schmidt Normalized Independence Criterion (HSNIC) [13]. [sent-188, score-0.345]
100 On the approximability of numerical taxonomy (fitting distances by tree metrics). [sent-198, score-0.543]
wordName wordTfidf (topN-words)
[('hn', 0.724), ('taxonomy', 0.337), ('clustering', 0.258), ('clusters', 0.159), ('spectral', 0.149), ('cluster', 0.145), ('dependence', 0.109), ('maximization', 0.101), ('tree', 0.093), ('numerical', 0.092), ('cxy', 0.087), ('hs', 0.081), ('partition', 0.079), ('tr', 0.078), ('matrix', 0.072), ('hsic', 0.071), ('objective', 0.064), ('equation', 0.062), ('kernel', 0.058), ('structure', 0.057), ('gram', 0.055), ('khn', 0.052), ('facial', 0.05), ('gretton', 0.05), ('hk', 0.05), ('nips', 0.05), ('dt', 0.048), ('taxonomies', 0.043), ('hierarchical', 0.042), ('papers', 0.042), ('face', 0.041), ('dataset', 0.041), ('distance', 0.041), ('blaschko', 0.04), ('farach', 0.04), ('yii', 0.04), ('yjj', 0.04), ('song', 0.039), ('criterion', 0.039), ('relaxed', 0.036), ('encodes', 0.035), ('prx', 0.035), ('taxonomic', 0.035), ('independence', 0.035), ('distant', 0.035), ('metric', 0.034), ('constraints', 0.034), ('normalized', 0.033), ('constrained', 0.033), ('documents', 0.032), ('discovered', 0.031), ('visualization', 0.031), ('normalization', 0.031), ('leaf', 0.031), ('dij', 0.03), ('similarity', 0.029), ('combinatorial', 0.029), ('centering', 0.028), ('kannan', 0.028), ('ist', 0.027), ('kernels', 0.027), ('relationship', 0.026), ('bach', 0.026), ('fukumizu', 0.026), ('np', 0.025), ('jordan', 0.025), ('simplifying', 0.024), ('optimization', 0.024), ('affect', 0.023), ('evolutionary', 0.023), ('optimizing', 0.023), ('solve', 0.022), ('planck', 0.021), ('inner', 0.021), ('informative', 0.021), ('solving', 0.021), ('distances', 0.021), ('grouping', 0.021), ('separable', 0.021), ('word', 0.021), ('constructed', 0.021), ('simultaneously', 0.02), ('relation', 0.02), ('discovering', 0.019), ('max', 0.019), ('additive', 0.019), ('optimal', 0.019), ('trees', 0.019), ('partitioned', 0.019), ('ni', 0.019), ('subject', 0.019), ('entropy', 0.018), ('normalizing', 0.018), ('expressions', 0.018), ('smith', 0.018), ('wish', 0.018), ('pages', 0.018), ('metrics', 0.018), ('smola', 0.018), ('product', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
2 0.14992157 218 nips-2008-Spectral Clustering with Perturbed Data
Author: Ling Huang, Donghui Yan, Nina Taft, Michael I. Jordan
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1
3 0.13856837 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
Author: Kilian Q. Weinberger, Olivier Chapelle
Abstract: Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond “flat” classification through incorporation of class hierarchies [4]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization. 1
4 0.13156213 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
Author: Shai Ben-David, Margareta Ackerman
Abstract: Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kleinberg, ([1]) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in [1]. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the object to be axiomatized. We show that principles like those formulated in Kleinberg’s axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure (CQM) is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how strong or conclusive the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (axioms) for such measures. Our axioms capture the principles expressed by Kleinberg’s axioms while retaining consistency. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we analyze the computational complexity of evaluating the quality of a given clustering and show that, for the proposed CQMs, it can be computed in polynomial time. 1
5 0.13030499 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
Author: Silvia Chiappa, Jens Kober, Jan R. Peters
Abstract: Motor primitives or motion templates have become an important concept for both modeling human motor control as well as generating robot behaviors using imitation learning. Recent impressive results range from humanoid robot movement generation to timing models of human motions. The automatic generation of skill libraries containing multiple motion templates is an important step in robot learning. Such a skill learning system needs to cluster similar movements together and represent each resulting motion template as a generative model which is subsequently used for the execution of the behavior by a robot system. In this paper, we show how human trajectories captured as multi-dimensional time-series can be clustered using Bayesian mixtures of linear Gaussian state-space models based on the similarity of their dynamics. The appropriate number of templates is automatically determined by enforcing a parsimonious parametrization. As the resulting model is intractable, we introduce a novel approximation method based on variational Bayes, which is especially designed to enable the use of efficient inference algorithms. On recorded human Balero movements, this method is not only capable of finding reasonable motion templates but also yields a generative model which works well in the execution of this complex task on a simulated anthropomorphic SARCOS arm.
6 0.12417263 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
7 0.12098478 48 nips-2008-Clustering via LP-based Stabilities
8 0.11730103 113 nips-2008-Kernelized Sorting
9 0.11622795 112 nips-2008-Kernel Measures of Independence for non-iid Data
10 0.11565187 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
11 0.10320012 170 nips-2008-Online Optimization in X-Armed Bandits
12 0.099414743 4 nips-2008-A Scalable Hierarchical Distributed Language Model
13 0.095442623 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
14 0.093906589 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
15 0.092584074 178 nips-2008-Performance analysis for L\ 2 kernel classification
16 0.086458057 193 nips-2008-Regularized Co-Clustering with Dual Supervision
17 0.080791764 107 nips-2008-Influence of graph construction on graph-based clustering measures
18 0.073766753 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
19 0.071260698 29 nips-2008-Automatic online tuning for fast Gaussian summation
20 0.070433743 143 nips-2008-Multi-label Multiple Kernel Learning
topicId topicWeight
[(0, -0.211), (1, -0.051), (2, 0.006), (3, 0.034), (4, 0.013), (5, -0.079), (6, 0.07), (7, -0.061), (8, 0.009), (9, -0.179), (10, 0.091), (11, -0.105), (12, -0.032), (13, 0.14), (14, -0.093), (15, 0.179), (16, -0.227), (17, 0.006), (18, 0.09), (19, -0.053), (20, -0.037), (21, 0.138), (22, -0.022), (23, 0.209), (24, 0.074), (25, -0.068), (26, 0.003), (27, -0.022), (28, -0.052), (29, 0.029), (30, 0.024), (31, 0.024), (32, 0.089), (33, 0.081), (34, 0.036), (35, 0.04), (36, -0.021), (37, 0.018), (38, 0.037), (39, 0.053), (40, 0.003), (41, -0.022), (42, 0.068), (43, 0.108), (44, 0.012), (45, -0.026), (46, -0.011), (47, -0.03), (48, -0.02), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.96768862 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
2 0.89609206 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
Author: Shai Ben-David, Margareta Ackerman
Abstract: Aiming towards the development of a general clustering theory, we discuss abstract axiomatization for clustering. In this respect, we follow up on the work of Kleinberg, ([1]) that showed an impossibility result for such axiomatization. We argue that an impossibility result is not an inherent feature of clustering, but rather, to a large extent, it is an artifact of the specific formalism used in [1]. As opposed to previous work focusing on clustering functions, we propose to address clustering quality measures as the object to be axiomatized. We show that principles like those formulated in Kleinberg’s axioms can be readily expressed in the latter framework without leading to inconsistency. A clustering-quality measure (CQM) is a function that, given a data set and its partition into clusters, returns a non-negative real number representing how strong or conclusive the clustering is. We analyze what clustering-quality measures should look like and introduce a set of requirements (axioms) for such measures. Our axioms capture the principles expressed by Kleinberg’s axioms while retaining consistency. We propose several natural clustering quality measures, all satisfying the proposed axioms. In addition, we analyze the computational complexity of evaluating the quality of a given clustering and show that, for the proposed CQMs, it can be computed in polynomial time. 1
3 0.81267887 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
Author: Deli Zhao, Xiaoou Tang
Abstract: Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we tackle the problem of clustering complex data of multiple distributions and multiple scales. To this end, we develop an algorithm named Zeta l-links (Zell) which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local l-links of samples. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A new mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of a cluster in determinantal form. The popularity character of a cluster is conceptualized as the global fusion of variations of such a structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the hierarchical agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data clustering, imagery pattern clustering, and image segmentation show the competitive performance of Zell. The 98.1% accuracy, in the sense of the normalized mutual information (NMI), is obtained on the FRGC face data of 16028 samples and 466 facial clusters. 1
4 0.74718738 48 nips-2008-Clustering via LP-based Stabilities
Author: Nikos Komodakis, Nikos Paragios, Georgios Tziritas
Abstract: A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, can automatically determine the number of clusters, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a centerbased clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm’s success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.
5 0.73375034 218 nips-2008-Spectral Clustering with Perturbed Data
Author: Ling Huang, Donghui Yan, Nina Taft, Michael I. Jordan
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1
6 0.59123254 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
7 0.53571719 29 nips-2008-Automatic online tuning for fast Gaussian summation
8 0.50078464 107 nips-2008-Influence of graph construction on graph-based clustering measures
9 0.45749015 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
10 0.44606799 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
11 0.43078467 4 nips-2008-A Scalable Hierarchical Distributed Language Model
12 0.42270994 193 nips-2008-Regularized Co-Clustering with Dual Supervision
13 0.41805223 113 nips-2008-Kernelized Sorting
14 0.37086609 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
15 0.36343586 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
16 0.35984102 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
17 0.34529689 112 nips-2008-Kernel Measures of Independence for non-iid Data
18 0.33300632 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
19 0.33101398 61 nips-2008-Diffeomorphic Dimensionality Reduction
20 0.32219312 219 nips-2008-Spectral Hashing
topicId topicWeight
[(6, 0.033), (7, 0.061), (12, 0.03), (28, 0.556), (57, 0.054), (59, 0.014), (62, 0.012), (63, 0.019), (71, 0.024), (77, 0.046), (83, 0.049)]
simIndex simValue paperId paperTitle
1 0.9973858 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking
Author: Nir Ailon
Abstract: The problem of ranking arises ubiquitously in almost every aspect of life, and in particular in Machine Learning/Information Retrieval. A statistical model for ranking predicts how humans rank subsets V of some universe U . In this work we define a statistical model for ranking that satisfies certain desirable properties. The model automatically gives rise to a logistic regression based approach to learning how to rank, for which the score and comparison based approaches are dual views. This offers a new generative approach to ranking which can be used for IR. There are two main contexts for this work. The first is the theory of econometrics and study of statistical models explaining human choice of alternatives. In this context, we will compare our model with other well known models. The second context is the problem of ranking in machine learning, usually arising in the context of information retrieval. Here, much work has been done in the discriminative setting, where different heuristics are used to define ranking risk functions. Our model is built rigorously and axiomatically based on very simple desirable properties defined locally for comparisons, and automatically implies the existence of a global score function serving as a natural model parameter which can be efficiently fitted to pairwise comparison judgment data by solving a convex optimization problem. 1
same-paper 2 0.99714041 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
3 0.99708331 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while also allowing for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound. At the heart of our method is a triangulated graph that we dynamically update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life datasets. Importantly, we also show that by using global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. 1
4 0.99587536 72 nips-2008-Empirical performance maximization for linear rank statistics
Author: Stéphan J. Clémençcon, Nicolas Vayatis
Abstract: The ROC curve is known to be the golden standard for measuring performance of a test/scoring statistic regarding its capacity of discrimination between two populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be seen as summaries of the ROC curve. This paper highlights the fact that many of these empirical criteria can be expressed as (conditional) linear rank statistics. We investigate the properties of empirical maximizers of such performance criteria and provide preliminary results for the concentration properties of a novel class of random variables that we will call a linear rank process. 1
5 0.99544108 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control
Author: Lavi Shpigelman, Hagai Lalazar, Eilon Vaadia
Abstract: Using machine learning algorithms to decode intended behavior from neural activity serves a dual purpose. First, these tools allow patients to interact with their environment through a Brain-Machine Interface (BMI). Second, analyzing the characteristics of such methods can reveal the relative significance of various features of neural activity, task stimuli, and behavior. In this study we adapted, implemented and tested a machine learning method called Kernel Auto-Regressive Moving Average (KARMA), for the task of inferring movements from neural activity in primary motor cortex. Our version of this algorithm is used in an online learning setting and is updated after a sequence of inferred movements is completed. We first used it to track real hand movements executed by a monkey in a standard 3D reaching task. We then applied it in a closed-loop BMI setting to infer intended movement, while the monkey’s arms were comfortably restrained, thus performing the task using the BMI alone. KARMA is a recurrent method that learns a nonlinear model of output dynamics. It uses similarity functions (termed kernels) to compare between inputs. These kernels can be structured to incorporate domain knowledge into the method. We compare KARMA to various state-of-the-art methods by evaluating tracking performance and present results from the KARMA based BMI experiments. 1
6 0.99541521 206 nips-2008-Sequential effects: Superstition or rational behavior?
7 0.99248564 126 nips-2008-Localized Sliced Inverse Regression
8 0.99178457 174 nips-2008-Overlaying classifiers: a practical approach for optimal ranking
9 0.99004972 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning
10 0.9833945 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
11 0.97392225 159 nips-2008-On Bootstrapping the ROC Curve
12 0.96690947 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields
13 0.95990622 211 nips-2008-Simple Local Models for Complex Dynamical Systems
14 0.95781839 107 nips-2008-Influence of graph construction on graph-based clustering measures
15 0.95770431 101 nips-2008-Human Active Learning
16 0.95702517 223 nips-2008-Structure Learning in Human Sequential Decision-Making
17 0.95520931 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
18 0.95507097 112 nips-2008-Kernel Measures of Independence for non-iid Data
19 0.95336998 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
20 0.94960088 40 nips-2008-Bounds on marginal probability distributions