nips nips2000 nips2000-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Naftali Tishby, Noam Slonim
Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Data clustering by Markovian relaxation and the Information Bottleneck Method N aft ali Tishby and N oam Slonim School of Computer Science and Engineering and Center for Neural Computation * The Hebrew University, Jerusalem, 91904 Israel email: {tishby. [sent-1, score-0.643]
2 il Abstract We introduce a new, non-parametric and principled, distance based clustering method. [sent-5, score-0.248]
3 This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. [sent-6, score-0.43]
4 The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. [sent-7, score-0.691]
5 The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. [sent-8, score-0.52]
6 These clusters capture the information about the initial point of the relaxation in the most effective way. [sent-9, score-0.868]
7 The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. [sent-10, score-0.183]
8 1 Introduction Data clustering is one of the most fundamental pattern recognition problems, with numerous algorithms and applications. [sent-11, score-0.164]
9 Yet, the problem itself is ill-defined: the goal is to find a "reasonable" partition of data points into classes or clusters. [sent-12, score-0.161]
10 What is meant by "reasonable" depends on the application, the representation of the data, and the assumptions about the origins of the data points, among other things. [sent-13, score-0.042]
11 One important class of clustering methods is for cases where the data is given as a matrix of pairwise distances or (dis) similarity measures. [sent-14, score-0.72]
12 Often these distances come from empirical measurement or some complex process, and there is no direct access , or even precise definition, of the distance function. [sent-15, score-0.289]
13 In many cases this distance does not form a metric, or it may even be non-symmetric. [sent-16, score-0.084]
14 Such data does not necessarily come as a sample of some meaningful distribution and even the issue of generalization and sample to sample fluctuations is not well defined. [sent-17, score-0.168]
15 Algorithms that use only the pairwise distances, without explicit use of the distance measure itself, employ statistical mechanics analogies [3] or collective graph theoretical properties [6] , etc. [sent-18, score-0.429]
16 The points are then grouped based on some global criteria, such as connected components , small cuts, or minimum alignment energy. [sent-19, score-0.119]
17 Such algorithms are sometimes computationally inefficient and in most cases it is difficult to interpret the resulting 'Work supported in part by the US-Israel binational science foundation (BSF) and by the Human Frontier Science Project (HFSP). [sent-20, score-0.066]
18 , it is hard to determine a common property to all the points in one cluster - other than that the clusters "look reasonable" . [sent-25, score-0.435]
19 A second class of clustering methods is represented by the generalized vector quantization (VQ) algorithm. [sent-26, score-0.164]
20 Gaussian distributions) to the points in each cluster, such that an average (known) distortion between the data points and their corresponding representative is minimized. [sent-29, score-0.375]
21 This type of algorithms may rely on theoretical frameworks, such as rate distortion theory, and provide much better interpretation for the resulting clusters. [sent-30, score-0.259]
22 VQ type algorithms can also be more computationally efficient since they require the calculation of distances, or distortion, between the data and the centroid models only, not between every pair of data points. [sent-31, score-0.123]
23 On the other hand, they require the knowledge of the distortion function and thus make specific assumptions about the underlying structure or model of the data. [sent-32, score-0.095]
24 In this paper we present a new, information theoretic combination of pairwise clustering with meaningful and intuitive interpretation for the resulting clusters. [sent-33, score-0.753]
25 In addition, our algorithm provides a clear and objective figure of merit for the clusters - without making any assumption on the origin or structure of the data points. [sent-34, score-0.253]
26 2 Pairwise distances and Markovian relaxation The first step of our algorithm is to turn the pairwise distance matrix into a Markov process, through the following simple intuition. [sent-35, score-1.001]
27 Assign a state of a Markov chain to each of the data points and transition probabilities between the states/points as a function of their pairwise distances. [sent-36, score-0.549]
28 Thus the data can be considered as a directed graph with the points as nodes and the pairwise distances, which need not be symmetric or form a metric, on the arcs of the graph. [sent-37, score-0.506]
29 , the length of a trajectory on the graph is the sum of the arc-lengths. [sent-40, score-0.195]
30 Probabilities, on the other hand, are multiplicative for independent events, so if we want the probability of a (random) trajectory on the graph to be naturally related to its length , the transition probabilities between points should be exponential in their distance. [sent-41, score-0.576]
31 Denoting by d(Xi' Xj) the pairwise distance between the points Xi and Xj, 1 then the transition probability that our Markov chain move from the point Xj at time t to the point X i at time t + 1, Pi,j == P(Xi(t + l)lxj(t)), is chosen as, P(Xi(t + l)lxj(t)) ex: exp( ->"d(Xi,Xj)) , (1) where >. [sent-42, score-0.744]
32 -1 is a length scaling factor that equals the mean pairwise distance of the k nearest neighbors to the point Xi. [sent-44, score-0.397]
33 The details of this rescaling are not so important for the final results, and a similar exponentiation of the distances, without our probabilistic interpretation, was performed in other clustering works (see e. [sent-45, score-0.197]
34 A proper normalization of each row is required to turn this matrix into a stochastic transition matrix. [sent-48, score-0.227]
35 Given this transition matrix, one can imagine a random walk starting at every point on the graph. [sent-49, score-0.437]
36 Specifically, the probability distribution of the positions of a random walk, starting at Xj after t time steps, is given by the j-th row of the t -th iteration of the I-step transition matrix. [sent-50, score-0.276]
37 Denoting by pt the t-step transition matrix , pt = (P)t , is indeed the t-th power of the I-step transition probability matrix. [sent-51, score-0.594]
38 The probability of a random walk starting at Xj at time 0, to be at Xi at time t is thus, p(xi(t) IXj(O)) --------------------------- = Ptj . [sent-52, score-0.326]
39 (2) 1 Henceforth we take the number of data points to be n and the point indices run implicitly from 1 to n unless stated otherwise. [sent-53, score-0.206]
40 If we assume that all the given pairwise distances are finite we obtain in this way an ergodic Markov process with a single stationary distribution, denoted by 7r. [sent-54, score-0.561]
41 This distribution is a right-eigenvector of the t-step transition matrix (for every t), since, 7ri = 2: j Pi,j7rj . [sent-55, score-0.234]
42 During the dynamics of the Markov process any initial state distribution is going to relax to this final stationary distribution and the information about the initial point of a random walk is completely lost. [sent-59, score-0.659]
43 Rate of Information loss for example 1 Colon data, rate of Informallon loss vs clusters ac:curac. [sent-60, score-0.478]
44 15 10 100,(1) 20 15 Figure 1: On the left shown an example of data, consisting of 150 points in 2D. [sent-66, score-0.119]
45 On the middle, we plot the rate of information loss, - d~~t) , during the relaxation. [sent-67, score-0.151]
46 Notice that the algorithm has no prior information about circles or ellipses. [sent-68, score-0.096]
47 The rate of the information loss is slow when the "random walks" stabilize on some sub structures of the data - our proposed clusters. [sent-69, score-0.445]
48 On the right we plot the rate of information loss for the colon cancer data, and the accuracy of the obtained clusters for different relaxation times, with the original classes. [sent-70, score-1.012]
49 1 Relaxation of the mutual information The natural way to quantify the information loss during this relaxation process is by the mutual information between the initial point variable , X(O) = {Xj(O)} and the point of the random walk at time t, X(t) = {Xi(t)}. [sent-72, score-1.754]
50 The DKL is the Kulback-Liebler divergence [4], defined as: DKL [Pllq] == 2: y p(y) log ~ which is the information theoretic measure P; of similarity of distributions. [sent-74, score-0.254]
51 Since all the rows j relax to 7r this divergence goes to zero as t --+ 00. [sent-75, score-0.188]
52 While it is clear that the information about the initial point, I(t), decays monotonically (exponentially asymptotically) to zero, the rate of this decay at finite t conveys much information on the structure of the data points. [sent-76, score-0.433]
53 Consider, as a simple example, the planer data points shown in figure 1, with d(Xi,Xj) = (Xi - Xj)2 + (Yi - Yj)2. [sent-77, score-0.161]
54 As can be seen, the rate of information loss about the initial point of the random walk, - d~~t ) , while always positive - slows down at specific times during the relaxation. [sent-78, score-0.494]
55 These relaxation locations indicate the formation of quasi-stable structures on the graph. [sent-79, score-0.499]
56 At these relaxation times the transition probability matrix is approximately a projection matrix (satisfying p2t ,:::: pt) where the almost invariant subgraphs correspond to the clusters. [sent-80, score-0.837]
57 These approximate stationary transitions correspond to slow information loss, which can be identified by derivatives of the information loss at time t. [sent-81, score-0.537]
58 Another way to see this phenomena is by observing the rows of pt, which are the conditional distributions p(x;(t)lxj(O)). [sent-82, score-0.209]
59 The rows that are almost indistinguishable, following the partial relaxation , correspond to points Xj with similar conditional distribution on the rest of the graph at time t. [sent-83, score-0.893]
60 Such points should belong to the same structure, or cluster on the graph. [sent-84, score-0.26]
61 This can be seen directly by observing the matrix pt during the relaxation , as shown in figure 2. [sent-85, score-0.672]
62 The darker colors correspond to higher probability density in every row. [sent-87, score-0.152]
63 Since the points are ordered by the 3 ellipses, 50 in each ellipse, it is easy to see the clear emergence of 3 blocks of conditional distributions - the rows of the matrix - during the relaxation process. [sent-88, score-0.847]
64 For very large t there is complete relaxation and all the rows equal the stationary distribution of the process. [sent-89, score-0.595]
65 The best correlation between the resulting clusters and the original ellipses (i. [sent-90, score-0.268]
66 the relaxation process, are precisely the desirable m eaningful clusters. [sent-93, score-0.446]
67 The remaining question pertains to the correct way to group the initial points into clusters that capture the information about the position on the graph after t-steps. [sent-94, score-0.647]
68 In other words, can we replace the initial point with an initial cluster, that enables prediction of the location on the graph at time t, with similar accuracy? [sent-95, score-0.371]
69 The answer to this question is naturally provided via the recently introduced information bottleneck method [12, 11]. [sent-96, score-0.36]
70 3 Clusters that preserve information The problem of self-organization of the members of a set X based on the similarity of the conditional distributions of the members of another set, Y , {p(Ylx)} , was first introduced in (9) and was termed "distributional clustering" . [sent-97, score-0.35]
71 This question was recently shown in (12) to b e a specific case of a much more fundamental problem: What are the features of the variable X that are relevant to the prediction of another, relevance, variable Y? [sent-98, score-0.181]
72 Surprisingly, this variational principle yields an exact formal solution for the conditional distributions p(ylx), p(xlx), and p(x). [sent-100, score-0.14]
73 This constrained information optimization problem was called in (12) Th e Information Bottleneck Method. [sent-101, score-0.096]
74 The original approach to the solution of the resulting equations, used already in [9], was based on an analogy with the "deterministic annealing" (DA) approach to clustering (see [10, 8]). [sent-102, score-0.197]
75 This is a top-down hierarchical algorithm that starts from a single cluster and undergoes a cascade of cluster splits which are determined stochastically (as phase transitions) into a "soft" (fuzzy) tree of clusters. [sent-103, score-0.282]
76 1 The information bottleneck method Given any two non-independent random variables, X and Y, the objective of the information bottleneck method is to extract a compact representation of the variable X, denoted here by X, with minimal loss of mutual information to another, relevance, variable Y. [sent-107, score-1.208]
77 More specifically, we want to find a (possibly stochastic) map, p(x lx ), that maximizes the mutual information to the relevance variable I(X;Y) , under a constraint on the (lossy) coding length of X via X, I(X; X). [sent-108, score-0.637]
78 In other words , we want to find an efficient representation of the variable X, X, such that the predictions of Y from X through X will be as close as possible to the direct prediction of Y from X. [sent-109, score-0.127]
79 This minimization yields directly the following (self-consistent) equations for the map p(xlx) , and for p(ylx) and p(x): = ~~~) exp (- (3 DKL (P(ylx )llp(ylx))) p(ylx) = 2:x p(Ylx)p(xlx)~ p(x) = 2: xp(x lx )p(x) p(xlx) { (6) where Z((3, x) is a normalization function. [sent-111, score-0.182]
80 The familiar Kulback-Liebler divergence , DKL(P(ylx)llp(ylx) )' em erges here from the variational principle. [sent-112, score-0.101]
81 These equations can be solved by iterations that are proved to converge for any finite value of (3 (see [12]). [sent-113, score-0.032]
82 The Lagrange multiplier /3 has the natural interpretation of inverse temperature, which suggests deterministic annealing to explore the hierarchy of solutions in X. [sent-114, score-0.263]
wordName wordTfidf (topN-words)
[('relaxation', 0.446), ('pairwise', 0.23), ('ylx', 0.223), ('bottleneck', 0.196), ('mutual', 0.195), ('clusters', 0.175), ('distances', 0.17), ('clustering', 0.164), ('lxj', 0.155), ('xlx', 0.155), ('walk', 0.142), ('cluster', 0.141), ('dkl', 0.133), ('loss', 0.124), ('transition', 0.124), ('pt', 0.121), ('points', 0.119), ('lx', 0.118), ('graph', 0.115), ('annealing', 0.1), ('information', 0.096), ('xj', 0.095), ('distortion', 0.095), ('meaningful', 0.091), ('distance', 0.084), ('rows', 0.084), ('xi', 0.081), ('colon', 0.077), ('interpretation', 0.076), ('markov', 0.075), ('initial', 0.074), ('matrix', 0.071), ('llp', 0.067), ('vq', 0.067), ('stationary', 0.065), ('theoretic', 0.063), ('members', 0.06), ('ixj', 0.06), ('ellipses', 0.06), ('variable', 0.057), ('relevance', 0.057), ('process', 0.056), ('py', 0.056), ('rate', 0.055), ('random', 0.055), ('structures', 0.053), ('relax', 0.052), ('markovian', 0.052), ('divergence', 0.052), ('conditional', 0.05), ('denoting', 0.05), ('multiplier', 0.05), ('variational', 0.049), ('correspond', 0.047), ('times', 0.045), ('point', 0.045), ('da', 0.045), ('lagrange', 0.043), ('similarity', 0.043), ('slow', 0.042), ('trajectory', 0.042), ('data', 0.042), ('reasonable', 0.041), ('distributions', 0.041), ('denoted', 0.04), ('every', 0.039), ('accuracy', 0.039), ('want', 0.039), ('length', 0.038), ('metric', 0.038), ('constraint', 0.037), ('deterministic', 0.037), ('clear', 0.036), ('question', 0.036), ('come', 0.035), ('transitions', 0.035), ('decay', 0.034), ('observing', 0.034), ('probabilities', 0.034), ('distributional', 0.033), ('binational', 0.033), ('colors', 0.033), ('aft', 0.033), ('ellipse', 0.033), ('exponentiation', 0.033), ('frontier', 0.033), ('henceforth', 0.033), ('hfsp', 0.033), ('ptj', 0.033), ('stabilize', 0.033), ('unconditioned', 0.033), ('probability', 0.033), ('resulting', 0.033), ('equations', 0.032), ('time', 0.032), ('normalization', 0.032), ('capture', 0.032), ('naturally', 0.032), ('starting', 0.032), ('prediction', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
Author: Naftali Tishby, Noam Slonim
Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1
2 0.15108427 12 nips-2000-A Support Vector Method for Clustering
Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik
Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.
3 0.13016097 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
4 0.10373721 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
Author: Koby Crammer, Yoram Singer
Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.
5 0.093710072 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
Author: Christopher K. I. Williams
Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1
6 0.090024792 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
7 0.089206263 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
8 0.082418524 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
9 0.080926985 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
10 0.078062087 62 nips-2000-Generalized Belief Propagation
11 0.075966708 117 nips-2000-Shape Context: A New Descriptor for Shape Matching and Object Recognition
12 0.074773699 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
13 0.074123643 36 nips-2000-Constrained Independent Component Analysis
14 0.073005907 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
15 0.071550429 74 nips-2000-Kernel Expansions with Unlabeled Examples
16 0.071301185 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
17 0.071123071 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
18 0.070757031 134 nips-2000-The Kernel Trick for Distances
19 0.070432566 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations
20 0.070251428 120 nips-2000-Sparse Greedy Gaussian Process Regression
topicId topicWeight
[(0, 0.236), (1, 0.005), (2, 0.057), (3, -0.002), (4, 0.029), (5, 0.087), (6, 0.01), (7, 0.049), (8, -0.099), (9, 0.192), (10, -0.098), (11, 0.145), (12, 0.125), (13, 0.036), (14, -0.017), (15, 0.008), (16, -0.108), (17, -0.05), (18, 0.139), (19, 0.035), (20, -0.114), (21, 0.117), (22, -0.119), (23, 0.118), (24, 0.107), (25, 0.113), (26, -0.168), (27, -0.104), (28, 0.014), (29, -0.12), (30, 0.092), (31, 0.066), (32, 0.209), (33, 0.2), (34, 0.187), (35, 0.201), (36, 0.077), (37, -0.03), (38, 0.057), (39, 0.0), (40, -0.055), (41, 0.116), (42, 0.035), (43, -0.028), (44, 0.067), (45, 0.001), (46, -0.048), (47, -0.106), (48, 0.011), (49, -0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.96807927 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
Author: Naftali Tishby, Noam Slonim
Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1
2 0.70046675 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
3 0.61623502 12 nips-2000-A Support Vector Method for Clustering
Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik
Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.
4 0.34906429 62 nips-2000-Generalized Belief Propagation
Author: Jonathan S. Yedidia, William T. Freeman, Yair Weiss
Abstract: Belief propagation (BP) was only supposed to work for tree-like networks but works surprisingly well in many applications involving networks with loops, including turbo codes. However, there has been little understanding of the algorithm or the nature of the solutions it finds for general graphs. We show that BP can only converge to a stationary point of an approximate free energy, known as the Bethe free energy in statistical physics. This result characterizes BP fixed-points and makes connections with variational approaches to approximate inference. More importantly, our analysis lets us build on the progress made in statistical physics since Bethe's approximation was introduced in 1935. Kikuchi and others have shown how to construct more accurate free energy approximations, of which Bethe's approximation is the simplest. Exploiting the insights from our analysis, we derive generalized belief propagation (GBP) versions ofthese Kikuchi approximations. These new message passing algorithms can be significantly more accurate than ordinary BP, at an adjustable increase in complexity. We illustrate such a new GBP algorithm on a grid Markov network and show that it gives much more accurate marginal probabilities than those found using ordinary BP. 1
5 0.3450616 148 nips-2000-`N-Body' Problems in Statistical Learning
Author: Alexander G. Gray, Andrew W. Moore
Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1
6 0.34454092 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
7 0.33264205 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
8 0.31801641 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
9 0.29938093 74 nips-2000-Kernel Expansions with Unlabeled Examples
10 0.29447895 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
11 0.29369634 44 nips-2000-Efficient Learning of Linear Perceptrons
12 0.29218203 117 nips-2000-Shape Context: A New Descriptor for Shape Matching and Object Recognition
13 0.28795525 23 nips-2000-An Adaptive Metric Machine for Pattern Classification
14 0.28277367 138 nips-2000-The Use of Classifiers in Sequential Inference
15 0.28100431 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
16 0.28047228 22 nips-2000-Algorithms for Non-negative Matrix Factorization
17 0.27855381 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
18 0.27151197 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
19 0.26892251 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
20 0.26609728 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
topicId topicWeight
[(10, 0.018), (17, 0.076), (32, 0.028), (33, 0.045), (35, 0.012), (55, 0.034), (62, 0.056), (65, 0.016), (67, 0.097), (75, 0.371), (76, 0.042), (81, 0.036), (90, 0.045), (91, 0.01), (97, 0.039)]
simIndex simValue paperId paperTitle
1 0.8818323 30 nips-2000-Bayesian Video Shot Segmentation
Author: Nuno Vasconcelos, Andrew Lippman
Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.
same-paper 2 0.87539071 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
Author: Naftali Tishby, Noam Slonim
Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1
3 0.78579921 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
Author: Gert Cauwenberghs, Tomaso Poggio
Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental
4 0.53956473 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
5 0.45695651 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
Author: Predrag Neskovic, Philip C. Davis, Leon N. Cooper
Abstract: In this work, we introduce an Interactive Parts (IP) model as an alternative to Hidden Markov Models (HMMs). We t ested both models on a database of on-line cursive script. We show that implementations of HMMs and the IP model, in which all letters are assumed to have the same average width , give comparable results. However , in contrast to HMMs, the IP model can handle duration modeling without an increase in computational complexity. 1
6 0.43085638 12 nips-2000-A Support Vector Method for Clustering
7 0.4193351 146 nips-2000-What Can a Single Neuron Compute?
8 0.41339341 74 nips-2000-Kernel Expansions with Unlabeled Examples
9 0.40207583 52 nips-2000-Fast Training of Support Vector Classifiers
10 0.40206105 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
11 0.40121835 62 nips-2000-Generalized Belief Propagation
12 0.40080488 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
13 0.40032986 125 nips-2000-Stability and Noise in Biochemical Switches
14 0.39955366 134 nips-2000-The Kernel Trick for Distances
15 0.39939305 122 nips-2000-Sparse Representation for Gaussian Process Models
16 0.39401737 22 nips-2000-Algorithms for Non-negative Matrix Factorization
17 0.3939763 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
18 0.39208925 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
19 0.38976833 48 nips-2000-Exact Solutions to Time-Dependent MDPs
20 0.38940641 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition