nips nips2002 nips2002-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢
Reference: text
sentIndex sentText sentNum sentScore
1 edu ¡ Abstract Many algorithms rely critically on being given a good metric over their inputs. [sent-6, score-0.53]
2 For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. [sent-7, score-0.92]
3 For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar. [sent-8, score-0.076]
4 In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. [sent-10, score-0.853]
5 Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. [sent-11, score-0.543]
6 We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. [sent-12, score-0.553]
7 £ ¤¢ £ ¥¢ 1 Introduction The performance of many learning and datamining algorithms depend critically on their being given a good metric over the input space. [sent-13, score-0.53]
8 For instance, K-means, nearest-neighbors classifiers and kernel algorithms such as SVMs all need to be given good metrics that reflect reasonably well the important relationships between the data. [sent-14, score-0.271]
9 Worse, if an algorithm were to have clustered by topic, and if we instead wanted it to cluster by writing style, there are relatively few systematic mechanisms for us to convey this to a clustering algorithm, and we are often left tweaking distance metrics by hand. [sent-16, score-0.729]
10 In this paper, we are interested in the following problem: Suppose a user indicates that certain points in an input space (say, ) are considered by them to be “similar. [sent-17, score-0.126]
11 ” Can we automatically learn a distance metric over that respects these relationships, i. [sent-18, score-0.689]
12 For instance, in the documents example, we might hope that, by giving it pairs of documents judged to be written in similar styles, it would learn to recognize the critical features for determining style. [sent-21, score-0.292]
13 £ ¤¢ £ ¦¢ One important family of algorithms that (implicitly) learn metrics are the unsupervised ones that take an input dataset, and find an embedding of it in some space. [sent-22, score-0.348]
14 One feature distinguishing our work from these is that we will learn a full metric over the input space, rather than focusing only on (finding an embedding for) the points in the training set. [sent-24, score-0.719]
15 Our learned metric thus generalizes more easily to previously unseen data. [sent-25, score-0.564]
16 More importantly, methods such as LLE and MDS also suffer from the “no right answer” problem: For example, if MDS finds an embedding that fails to capture the structure important to a user, it is unclear what systematic corrective actions would be available. [sent-26, score-0.083]
17 ) As in our motivating clustering example, the methods we propose can also be used in a pre-processing step to help any of these unsupervised algorithms to find better solutions. [sent-28, score-0.243]
18 In the supervised learning setting, for instance nearest neighbor classification, numerous attempts have been made to define or learn either local or global metrics for classification. [sent-29, score-0.403]
19 ) This literature is too wide to survey here, but some relevant examples include [10, 5, 3, 6], and [1] also gives a good overview of some of this work. [sent-32, score-0.1]
20 If told that certain pairs are “similar” or “dissimilar,” they search for a clustering that puts the similar pairs into the same, and dissimilar pairs into different, clusters. [sent-36, score-0.64]
21 This gives a way of using similarity side-information to find clusters that reflect a user’s notion of meaningful clusters. [sent-37, score-0.156]
22 But similar to MDS and LLE, the (“instance-level”) constraints that they use do not generalize to previously unseen data whose similarity/dissimilarity to the training set is not known. [sent-38, score-0.054]
23 ¨ 1 ) (& ¨# ¨ ¡ © ¡ if and are similar How can we learn a distance metric between points and specifically, so that “similar” points end up close to each other? [sent-42, score-0.759]
24 Consider learning a distance metric of the form 5 £ ¥¢ Suppose we have some set of points pairs of them are “similar”: (1) that respects this; ¨ V( 5 E S Q( 5 E H B 5 E B 6 ( 5# ! [sent-43, score-0.786]
25 ¨ C6 8 DBGF2¨ DBCA'$4¨ @9 73$4¨ " (2) To ensure that this be a metric—satisfying non-negativity and the triangle inequality— we require that be positive semi-definite, . [sent-47, score-0.036]
26 1 Setting gives Euclidean distance; if we restrict to be diagonal, this corresponds to learning a metric in which the different axes are given different “weights”; more generally, parameterizes a family of Mahalanobis distances over . [sent-48, score-0.489]
27 2 Learning such a distance metric is also equivalent to finding a rescaling of a data that replaces each point with and applying the ¨ g$ S fe w s S ` X aYS Gxvtsrph y wu q i ¨ S S £ ¦¢ Technically, this also allows pseudometrics, where does not imply . [sent-49, score-0.675]
28 Note that, but putting the original dataset through a non-linear basis function and considering , non-linear distance metrics can also be learned. [sent-50, score-0.423]
29 2 c 6 dbS 1 xytw"3tsGvxytwU3tsGq y q y q q y q y q standard Euclidean metric to the rescaled data; this will later be useful in visualizing the learned metrics. [sent-51, score-0.536]
30 A simple way of defining a criterion for the desired metric is to demand that pairs of points in have, say, small squared distance between them: . [sent-52, score-0.734]
31 This is trivially solved with , which is not useful, and we add the constraint to ensure that does not collapse the dataset into a single point. [sent-53, score-0.103]
32 Here, can be a set of pairs of points known to be “dissimilar” if such information is explicitly available; otherwise, we may take it to be all pairs not in . [sent-54, score-0.261]
33 This gives the optimization problem: S ` 6 S ¨ ¡ ¡ ¡ f8 DBB & 1 ¨ E ¨ ( DBB & %%#$¨ #" ! [sent-55, score-0.029]
34 We also note that, while one might consider various alternatives to (4), “ ” would not be a good choice despite its giving a simple linear constraint. [sent-62, score-0.071]
35 3 S S 7 E (" ""¨ BB % ¤8 S ££ , we can derive SU ¨ T8 DBB & PE ¨ DBB S# f S C @ gV V V # gf g# g ! [sent-66, score-0.027]
36 S D§B¡ A6 S In the case that we want to learn a diagonal an efficient algorithm using the Newton-Raphson method. [sent-67, score-0.177]
37 1 The case of diagonal E E ` X S (" %# ¤ QC $" %# § F PR §IHE f8 DBB & PE ¨ DBB G ¨ F A( 6 ££ S# $V V V # g ! [sent-69, score-0.107]
38 S E 6 It is straightforward to show that minimizing (subject to ) is equivalent, up to a multiplication of by a positive constant, to solving the original problem (3–5). [sent-71, score-0.038]
39 2 The case of full In the case of learning a full matrix , the constraint that becomes slightly trickier to enforce, and Newton’s method often becomes prohibitively expensive (requiring time to invert the Hessian over parameters). [sent-74, score-0.146]
40 Using gradient descent and the idea of iterative projections (e. [sent-75, score-0.057]
41 Decomposing (always pos- sible since ), this gives , which we recognize as a Rayleighquotient like quantity whose solution is given by (say) solving the generalized eigenvector problem for the principal eigenvector, and setting . [sent-82, score-0.105]
42 4 To ensure that , which is true iff the diagonal elements are non-negative, we actually replace the Newton update by , where is a step-size parameter optimized via a line-search to give the largest downhill step subject to . [sent-83, score-0.196]
43 S ¡6 S S ¡ f ¨) ¤ S ¡ ¦ BB PE ¤ S B3 ¢ 8 §£¡¢ C ¡B ¡6 S § S B ¡ C ¨4¥S ¡ ¦ BB PE ¥S B3 £8 §£¢ ¡B ¡6 S § ) ¤ S ¤ B ¢ ¡ until converges until convergence frh frr r $ ¡ # Figure 1: Gradient ascent + Iterative projection algorithm. [sent-87, score-0.236]
44 is the Frobenius norm on v t w &%e y w tv a Xaq $ frr frr We pose the equivalent problem: ` X S ¡ © 6 0§ ¡ E 1 0 f8 B9' 4v¨ DBB %"! [sent-89, score-0.256]
45 (7) (8) to optimize (6), followed by the method of We will use a gradient ascent step on iterative projections to ensure that the constraints (7) and (8) hold. [sent-97, score-0.174]
46 Specifically, we will repeatedly take a gradient step , and then repeatedly project into and . [sent-98, score-0.112]
47 This gives the the sets algorithm shown in Figure 1. [sent-99, score-0.029]
48 5 The motivation for the specific choice of the problem formulation (6–8) is that projecting onto or can be done inexpensively. [sent-100, score-0.036]
49 Specifically, the first projection step involves minimizing a quadratic objective subject to a single linear constraint; the solution to this is easily found by solving (in time) a sparse system of linear equations. [sent-101, score-0.131]
50 The second projection step onto , the space of all positive-semi definite matrices, is done by first finding the diagonalization , is a diagonal matrix of ’s eigenvalues and the columns of where contains ’s corresponding eigenvectors, and taking , where . [sent-102, score-0.246]
51 D§B¡ @ 6 8 # £ ¤ 8 6 3 Experiments and Examples We begin by giving some examples of distance metrics learned on artificial data, and then show how our methods can be used to improve clustering performance. [sent-109, score-0.682]
52 1 Examples of learned distance metrics Consider the data shown in Figure 2(a), which is divided into two classes (shown by the different symbols and, where available, colors). [sent-111, score-0.394]
53 Suppose that points in each class are “similar” to each other, and we are given reflecting this. [sent-112, score-0.107]
54 6 Depending on whether we learn a diagonal or a full , we obtain: 1 3. [sent-113, score-0.25]
55 6 In the experiments with synthetic data, was a randomly sampled 1% of all pairs of similar points. [sent-126, score-0.094]
56 (b) Rescaling of data corresponding to learned diagonal . [sent-128, score-0.183]
57 3−class data (original) 3−class data projection (Newton) 2 0 z z 0 −2 2 −2 5 0 y −5 −5 5 5 0 0 −2 0 y x (a) −5 −5 2 5 0 2 0 y x (b) 0 −2 −2 x (c) Figure 3: (a) Original data. [sent-130, score-0.078]
58 (b) Rescaling corresponding to learned diagonal sponding to full . [sent-131, score-0.256]
59 As we see, the algorithm has successfully brought together the similar points, while keeping dissimilar ones apart. [sent-135, score-0.102]
60 Figure 3 shows a similar result for a case of three clusters whose centroids differ only in the x and y directions. [sent-136, score-0.171]
61 As we see in Figure 3(b), the learned diagonal metric correctly ignores the z direction. [sent-137, score-0.643]
62 Interestingly, in the case of a full , the algorithm finds a surprising projection of the data onto a line that still maintains the separation of the clusters well. [sent-138, score-0.291]
63 2 Application to clustering One application of our methods is “clustering with side information,” in which we learn a distance metric using similarity information, and cluster data using that metric. [sent-140, score-0.93]
64 Specifically, suppose we are given , and told that each pair means and belong to the same cluster. [sent-141, score-0.065]
65 Constrained K-means: K-means but subject to points assigned to the same cluster [12]. [sent-145, score-0.225]
66 K-means using the default Euclidean metric between points to define distortion (and ignoring ). [sent-147, score-0.563]
67 cluster centroids and always being E f8 DBB ¢ ¥4"¨ BB 3. [sent-148, score-0.166]
68 K-means + metric: K-means but with distortion defined using the distance metric learned from . [sent-149, score-0.649]
69 Constrained K-means + metric: Constrained K-means using the distance metric learned from . [sent-151, score-0.619]
70 1 ¦ t §y v xsu t utss q 7 This is implemented as the usual K-means, except if , then during the step in which points are assigned to cluster centroids , we assign both and to cluster . [sent-152, score-0.388]
71 More generally, if we imagine drawing an edge between each pair of points in , then all the points in each resulting connected component are constrained to lie in the same cluster, which we pick to be . [sent-153, score-0.287]
72 y ¨ t © "w ©© utsq § p vs i ¨ t %i a © w y © ''utsq e y§© y ¨ w ©© # v tsq © ¨ Original 2−class data Porjected 2−class data 0 z 10 0 z 10 −10 −10 20 y 20 20 0 −20 −20 y x (a) 0 −20 −20 x (b) K-means: Accuracy = 0. [sent-155, score-0.039]
73 5060 K-means + metric: Accuracy = 1 Constrained K-means + metric: Accuracy = 1 Figure 4: (a) Original dataset (b) Data scaled according to learned metric. [sent-157, score-0.17]
74 20 0 0 ’s result is ¨ # WV V V # 1 6 7 © £ © ) be the cluster to which point is assigned by an automatic clustering Let ( algorithm, and let be some “correct” or desired clustering of the data. [sent-163, score-0.584]
75 This is equivalent to the probability that for two points , drawn randomly from the dataset, our clustering agrees with the “true” clustering on whether and belong to same or different clusters. [sent-168, score-0.509]
76 As shown by the accuracy scores given in the figure, both K-means and constrained K-means failed to find good clusterings. [sent-170, score-0.245]
77 But by first learning a distance metric and then clustering according to that metric, we easily find the correct clustering separating the true clusters from each other. [sent-171, score-1.11]
78 Here, the “true clustering” is given by the data’s class labels. [sent-174, score-0.034]
79 9 We see that, in almost every problem, using a learned diagonal or full metric leads to significantly improved performance over naive K-means. [sent-177, score-0.716]
80 In this setting, we therefore modified the measure averaging not only , drawn uniformly at random, but from the same cluster (as determined by ) with chance 0. [sent-179, score-0.124]
81 9 was generated by picking a random subset of all pairs of points sharing the same class . [sent-183, score-0.201]
82 In the case of “little” side-information, the size of the subset was chosen so that the resulting number of resulting connected components (see footnote 7) would be very roughly 90% of the size of the original dataset. [sent-184, score-0.063]
83 5701 K-means + metric: Accuracy = 1 Constrained K-means + metric: Accuracy = 1 Figure 5: (a) Original dataset (b) Data scaled according to learned metric. [sent-193, score-0.17]
84 2 0 0 Kc=447 Kc=354 wine (N=168, C=3, d=12) Kc=269 0 Kc=187 balance (N=625, C=3, d=4) Kc=133 Kc=116 breast cancer (N=569, C=2, d=30) 1 1 1 0. [sent-207, score-0.076]
85 2 0 0 Kc=153 Kc=127 soy bean (N=47, C=4, d=35) Kc=548 0 Kc=400 protein (N=116, C=6, d=20) Kc=482 Kc=358 diabetes (N=768, C=2, d=8) 1 1 1 0. [sent-219, score-0.041]
86 2 0 0 Kc=41 Kc=34 Kc=92 0 Kc=61 Kc=694 Kc=611 Figure 6: Clustering accuracy on 9 UCI datasets. [sent-231, score-0.069]
87 In each panel, the six bars on the left correspond to an experiment with “little” side-information , and the six on the right to “much” side-information. [sent-232, score-0.108]
88 From left to right, the six bars in each set are respectively K-means, K-means diagonal metric, K-means full metric, Constrained K-means (C-Kmeans), C-Kmeans diagonal metric, and C-Kmeans full metric. [sent-233, score-0.431]
89 Performance on Protein dataset Performance on Wine dataset 0. [sent-242, score-0.134]
90 6 kmeans c−kmeans kmeans + metric (diag A) c−kmeans + metric (diag A) kmeans + metric (full A) c−kmeans + metric (full A) 0. [sent-249, score-2.644]
91 1 kmeans c−kmeans kmeans + metric (diag A) c−kmeans + metric (diag A) kmeans + metric (full A) c−kmeans + metric (full A) 0. [sent-251, score-2.644]
92 2 (a) ratio of constraints (b) s Figure 7: Plots of accuracy vs. [sent-256, score-0.095]
93 Here, the -axis gives the fraction of all pairs of points in the same class that are randomly sampled to be included in . [sent-258, score-0.23]
94 Not surprisingly, we also see that having more side-information in typically leads to metrics giving better clusterings. [sent-260, score-0.27]
95 Figure 7 also shows two typical examples of how the quality of the clusterings found increases with the amount of side-information. [sent-261, score-0.035]
96 , wine), our algorithm learns good diagonal and full metrics quickly with only a very small amount of side-information; for some others (e. [sent-264, score-0.483]
97 , protein), the distance metric, particularly the full metric, appears harder to learn and provides less benefit over constrained K-means. [sent-266, score-0.342]
98 4 Conclusions We have presented an algorithm that, given examples of similar pairs of points in , learns a distance metric that respects these relationships. [sent-267, score-0.853]
99 Our method is based on posing metric learning as a convex optimization problem, which allowed us to derive efficient, localoptima free algorithms. [sent-268, score-0.543]
100 We also showed examples of diagonal and full metrics learned from simple artificial examples, and demonstrated on artificial and on UCI datasets how our methods can be used to improve clustering performance. [sent-269, score-0.744]
wordName wordTfidf (topN-words)
[('metric', 0.46), ('dbb', 0.385), ('kc', 0.367), ('kmeans', 0.268), ('metrics', 0.235), ('clustering', 0.218), ('frr', 0.128), ('constrained', 0.116), ('diagonal', 0.107), ('clusters', 0.104), ('dissimilar', 0.102), ('cluster', 0.099), ('rescaling', 0.098), ('bb', 0.094), ('pairs', 0.094), ('distance', 0.083), ('projection', 0.078), ('learned', 0.076), ('respects', 0.076), ('wine', 0.076), ('points', 0.073), ('full', 0.073), ('learn', 0.07), ('accuracy', 0.069), ('centroids', 0.067), ('dataset', 0.067), ('mds', 0.065), ('pe', 0.06), ('newton', 0.058), ('user', 0.053), ('posing', 0.051), ('wagstaff', 0.051), ('lle', 0.049), ('diag', 0.047), ('tw', 0.045), ('embedding', 0.043), ('protein', 0.041), ('indistinguishable', 0.041), ('systematic', 0.04), ('classi', 0.039), ('vs', 0.039), ('told', 0.038), ('original', 0.038), ('six', 0.037), ('ensure', 0.036), ('good', 0.036), ('bar', 0.036), ('neighbor', 0.036), ('cally', 0.036), ('onto', 0.036), ('examples', 0.035), ('giving', 0.035), ('documents', 0.034), ('class', 0.034), ('uw', 0.034), ('colors', 0.034), ('critically', 0.034), ('fe', 0.034), ('bars', 0.034), ('gradient', 0.033), ('nearest', 0.033), ('learns', 0.032), ('convex', 0.032), ('euclidean', 0.032), ('iterate', 0.031), ('answer', 0.03), ('ascent', 0.03), ('visually', 0.03), ('distortion', 0.03), ('style', 0.029), ('instance', 0.029), ('gives', 0.029), ('unseen', 0.028), ('writing', 0.028), ('xy', 0.028), ('subject', 0.028), ('gf', 0.027), ('locally', 0.027), ('say', 0.027), ('according', 0.027), ('repeatedly', 0.027), ('suppose', 0.027), ('constraints', 0.026), ('nding', 0.026), ('principal', 0.026), ('clustered', 0.026), ('topic', 0.026), ('chance', 0.025), ('connected', 0.025), ('assigned', 0.025), ('recognize', 0.025), ('berkeley', 0.025), ('eigenvector', 0.025), ('step', 0.025), ('iterative', 0.024), ('scores', 0.024), ('empirically', 0.024), ('desired', 0.024), ('ip', 0.023), ('meaningful', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢
2 0.20406401 27 nips-2002-An Impossibility Theorem for Clustering
Author: Jon M. Kleinberg
Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1
3 0.13931715 98 nips-2002-Going Metric: Denoising Pairwise Data
Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1
4 0.13655664 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
5 0.11917821 53 nips-2002-Clustering with the Fisher Score
Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).
6 0.11316101 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
7 0.10773071 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
8 0.10766652 124 nips-2002-Learning Graphical Models with Mercer Kernels
9 0.10626076 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
10 0.10349868 83 nips-2002-Extracting Relevant Structures with Side Information
11 0.10262294 90 nips-2002-Feature Selection in Mixture-Based Clustering
12 0.099602222 188 nips-2002-Stability-Based Model Selection
13 0.092466101 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
14 0.083863825 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
15 0.079544306 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
16 0.078234367 113 nips-2002-Information Diffusion Kernels
17 0.074733183 115 nips-2002-Informed Projections
18 0.071093827 87 nips-2002-Fast Transformation-Invariant Factor Analysis
19 0.06669192 40 nips-2002-Bayesian Models of Inductive Generalization
20 0.058700748 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
topicId topicWeight
[(0, -0.196), (1, -0.096), (2, 0.031), (3, 0.002), (4, -0.093), (5, 0.078), (6, -0.063), (7, -0.21), (8, -0.163), (9, 0.205), (10, 0.065), (11, 0.044), (12, -0.101), (13, 0.032), (14, 0.001), (15, 0.038), (16, -0.031), (17, -0.039), (18, -0.028), (19, 0.055), (20, 0.015), (21, -0.058), (22, 0.04), (23, 0.051), (24, 0.072), (25, -0.03), (26, 0.051), (27, 0.093), (28, 0.021), (29, -0.08), (30, -0.012), (31, -0.048), (32, 0.101), (33, 0.188), (34, 0.048), (35, 0.043), (36, 0.005), (37, -0.004), (38, 0.025), (39, -0.006), (40, -0.002), (41, -0.086), (42, 0.072), (43, -0.024), (44, 0.031), (45, -0.043), (46, -0.002), (47, 0.032), (48, -0.122), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.96542341 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢
2 0.82029915 27 nips-2002-An Impossibility Theorem for Clustering
Author: Jon M. Kleinberg
Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1
3 0.71595728 98 nips-2002-Going Metric: Denoising Pairwise Data
Author: Volker Roth, Julian Laub, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: Pairwise data in empirical sciences typically violate metricity, either due to noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multi-dimensional scaling (MDS) that allows us to apply a variety of classical machine learning and signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step. Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation. We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1
4 0.61756432 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
Author: Nicholas P. Hughes, David Lowe
Abstract: We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SS TRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropic distribution, and we provide a theoretical justification for this observation.
5 0.61354345 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
Author: Dmitry Y. Pavlov, David M. Pennock
Abstract: We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic— conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users.
6 0.58459061 188 nips-2002-Stability-Based Model Selection
7 0.56528926 83 nips-2002-Extracting Relevant Structures with Side Information
8 0.52926099 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
9 0.51376599 53 nips-2002-Clustering with the Fisher Score
10 0.44005254 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
11 0.41549763 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
12 0.40926871 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
13 0.39985043 142 nips-2002-Maximum Likelihood and the Information Bottleneck
14 0.39661705 107 nips-2002-Identity Uncertainty and Citation Matching
15 0.39555299 124 nips-2002-Learning Graphical Models with Mercer Kernels
16 0.37452817 40 nips-2002-Bayesian Models of Inductive Generalization
17 0.37373206 190 nips-2002-Stochastic Neighbor Embedding
18 0.36752898 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
19 0.35327128 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
20 0.34526879 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
topicId topicWeight
[(1, 0.246), (14, 0.034), (23, 0.015), (42, 0.09), (54, 0.176), (55, 0.042), (57, 0.015), (67, 0.012), (68, 0.025), (74, 0.099), (92, 0.032), (98, 0.109)]
simIndex simValue paperId paperTitle
1 0.90908486 87 nips-2002-Fast Transformation-Invariant Factor Analysis
Author: Anitha Kannan, Nebojsa Jojic, Brendan J. Frey
Abstract: Dimensionality reduction techniques such as principal component analysis and factor analysis are used to discover a linear mapping between high dimensional data samples and points in a lower dimensional subspace. In [6], Jojic and Frey introduced mixture of transformation-invariant component analyzers (MTCA) that can account for global transformations such as translations and rotations, perform clustering and learn local appearance deformations by dimensionality reduction. However, due to enormous computational requirements of the EM algorithm for learning the model, O( ) where is the dimensionality of a data sample, MTCA was not practical for most applications. In this paper, we demonstrate how fast Fourier transforms can reduce the computation to the order of log . With this speedup, we show the effectiveness of MTCA in various applications - tracking, video textures, clustering video sequences, object recognition, and object detection in images. ¡ ¤ ¤ ¤ ¤
2 0.87022626 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
Author: Jakob Heinzle, Alan Stocker
Abstract: We report a system that classifies and can learn to classify patterns of visual motion on-line. The complete system is described by the dynamics of its physical network architectures. The combination of the following properties makes the system novel: Firstly, the front-end of the system consists of an aVLSI optical flow chip that collectively computes 2-D global visual motion in real-time [1]. Secondly, the complexity of the classification task is significantly reduced by mapping the continuous motion trajectories to sequences of ’motion events’. And thirdly, all the network structures are simple and with the exception of the optical flow chip based on a Winner-Take-All (WTA) architecture. We demonstrate the application of the proposed generic system for a contactless man-machine interface that allows to write letters by visual motion. Regarding the low complexity of the system, its robustness and the already existing front-end, a complete aVLSI system-on-chip implementation is realistic, allowing various applications in mobile electronic devices.
same-paper 3 0.8285988 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
Author: Eric P. Xing, Michael I. Jordan, Stuart Russell, Andrew Y. Ng
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many “plausible” ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider “similar.” For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, , learns a distance metric over if desired, dissimilar) pairs of points in that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance. £ ¤¢ £ ¥¢
4 0.70892596 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
5 0.70695561 53 nips-2002-Clustering with the Fisher Score
Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).
6 0.70622796 137 nips-2002-Location Estimation with a Differential Update Network
7 0.70265466 124 nips-2002-Learning Graphical Models with Mercer Kernels
8 0.70038575 190 nips-2002-Stochastic Neighbor Embedding
9 0.70028615 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
10 0.70016778 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions
11 0.69899529 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
12 0.69837892 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
13 0.69791812 169 nips-2002-Real-Time Particle Filters
14 0.69780529 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
15 0.69727629 3 nips-2002-A Convergent Form of Approximate Policy Iteration
16 0.69723374 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
17 0.69613534 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
18 0.69590831 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
19 0.6955781 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
20 0.69510525 96 nips-2002-Generalized² Linear² Models