nips nips2006 nips2006-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
Reference: text
sentIndex sentText sentNum sentScore
1 , k-means) model their inputs as a single sample drawn from a multivariate Gaussian. [sent-7, score-0.32]
2 However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. [sent-8, score-0.398]
3 Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. [sent-9, score-0.433]
4 Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. [sent-10, score-0.281]
5 In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. [sent-11, score-0.576]
6 We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. [sent-13, score-0.644]
7 k-means, principal components analysis, linear discriminant analysis, etc—model each input object as a single sample drawn from a multivariate Gaussian. [sent-16, score-0.398]
8 Multiple samples are also ubiquitous in time-series data such as sensor networks, where each sensor device continually monitors its environmental conditions (e. [sent-23, score-0.713]
9 For example, clustering sensor network devices has been used for optimizing routing of the network and also for discovering trends between sensor nodes. [sent-27, score-1.118]
10 In this paper, we consider the problem of clustering input objects, each of which can be represented by a multivariate Gaussian distribution. [sent-30, score-0.576]
11 The “distance” between two Gaussians can be quantified in an information theoretic manner, in particular by their differential relative entropy. [sent-31, score-0.341]
12 Interestingly, the differential relative entropy between two multivariate Gaussians can be expressed as the convex combination of two Bregman divergences—a Mahalanobis distance between mean vectors and a Burg matrix divergence between the covariance matrices. [sent-32, score-1.229]
13 We develop an EM style clustering algorithm and show that the optimal cluster parameters can be cheaply determined via a simple, closed-form solution. [sent-33, score-0.511]
14 Our algorithm is a Bregman-like clustering method that clusters both means and covariances of the distributions in a unified framework. [sent-34, score-0.517]
15 First, we present results from synthetic data experiments, and show that incorporating second order information can dramatically increase clustering accuracy. [sent-36, score-0.29]
16 Next, we apply our algorithm to a real-world sensor network dataset comprised of 52 sensor devices that measure temperature, humidity, light, and voltage. [sent-37, score-0.739]
17 Finally, we use our algorithm as a statistical debugging tool by clustering the behavior of functions in a program across a set of known software bugs. [sent-38, score-0.589]
18 The multivariate Gaussian distribution is the multivariate generalization of the standard univariate case. [sent-40, score-0.57]
19 The probability density function (pdf) of a d-dimensional multivariate Gaussian is parameterized by mean vector µ and positive definite covariance matrix Σ: 1 1 p(x|µ, Σ) = − (x − µ)T Σ−1 (x − µ) , d 1 exp 2 (2π) 2 |Σ| 2 where |Σ| is the determinant of Σ. [sent-41, score-0.551]
20 The Bregman divergence [2] with respect to φ is defined as: Dφ (x, y) = φ(x) − φ(y) − (x − y)T ∇φ(y), where φ is a real-valued, strictly convex function defined over a convex set Q = dom(φ) ⊂ Rd such that φ is differentiable on the relative interior of Q. [sent-42, score-0.375]
21 Similarly, if φ(x) = xT AT Ax, for some arbitrary non-singular matrix A, then the resulting divergence is the Mahalanobis distance MS −1 (x, y) = (x−y)T S −1 (x−y), parameterized by the covariance matrix S, S −1 = AT A. [sent-44, score-0.498]
22 Alternately, if φ(x) = i (xi log xi − xi ), then the resulting divergence is the (unnormalized) relative entropy. [sent-45, score-0.304]
23 Then the corresponding F Bregman matrix divergence is the squared Frobenius norm, X − Y 2 . [sent-49, score-0.229]
24 , λd of the positive definite matrix X: φ(X) = − i log λi = − log |X|, the Burg entropy of the eigenvalues. [sent-53, score-0.315]
25 The resulting Burg matrix divergence is: B(X, Y ) = tr(XY −1 ) − log |XY −1 | − d. [sent-54, score-0.286]
26 (1) As we shall see later, the Burg matrix divergence will arise naturally in our application. [sent-55, score-0.26]
27 The Burg matrix divergence can also be written as B(X, Y ) = i j λi T (v wj )2 − γj i log i λi − d. [sent-69, score-0.286]
28 γi From the first term above, we see that the Burg matrix divergence is a function of the eigenvalues as well as of the eigenvectors of X and Y . [sent-70, score-0.306]
29 The differential entropy of a continuous random variable x with probability density function f is defined as h(f ) = − f (x) log f (x)dx. [sent-71, score-0.438]
30 It can be shown [3] that an n-bit quantization of a continuous random variable with pdf f has Shannon entropy approximately equal to h(f ) + n. [sent-72, score-0.207]
31 The continuous analog of the discrete relative entropy is the differential relative entropy. [sent-73, score-0.507]
32 Given a random variable x with pdf’s f and g, the differential relative entropy is defined as D(f ||g) = 3 f (x) log f (x) dx. [sent-74, score-0.501]
33 g(x) Clustering Multivariate Gaussians via Differential Relative Entropy Given a set of n multivariate Gaussians parameterized by mean vectors m1 , . [sent-75, score-0.391]
34 Each cluster j can be represented by a multivariate Gaussian parameterized by mean µj and covariance Σj . [sent-85, score-0.709]
35 Using differential relative entropy as the distance measure between Gaussians, the problem of clustering may be posed as the minimization (over all clusterings) of k D(p(x|mi , Si )||p(x|µj , Σj )). [sent-86, score-0.724]
36 1 Differential Relative Entropy and Multivariate Gaussians We first show that the differential entropy between two multivariate Gaussians can be expressed as a convex combination of a Mahalanobis distance between means and the Burg matrix divergence between covariance matrices. [sent-88, score-1.176]
37 Consider two multivariate Gaussians, parameterized by mean vectors m and µ, and covariances S and Σ, respectively. [sent-89, score-0.527]
38 We first note that the differential relative entropy can be expressed as D(f ||g) = f log f − f log g = −h(f ) − f log g. [sent-90, score-0.641]
39 The first term is just the negative differential entropy of p(x|m, S), which can be shown [3] to be: h(p(x|m, S)) = d 1 + log(2π)d |S|. [sent-91, score-0.381]
40 We now consider the problem of finding the optimal representative Gaussian for a set of c Gaussians with means m1 , . [sent-96, score-0.203]
41 αc such that i αi = 1, the optimal representative minimizes the cumulative differential relative entropy: p(x|µ∗ , Σ∗ ) = arg min p(x|µ,Σ) αi D(p(x|mi , Si )||p(x|µ, Σ)) = arg min p(x|µ,Σ) (6) i αi i 1 1 B(Si , Σ) + MΣ−1 (mi , µ) . [sent-106, score-0.443]
42 the Mahalanobis distance parameterized by some covariance matrix Σ). [sent-109, score-0.269]
43 (9) i Figure 1 illustrates optimal representatives of two 2-dimensional Gaussians with means marked by points A and B, and covariances outlined with solid lines. [sent-114, score-0.279]
44 The optimal Gaussian representatives are denoted with dotted covariances; the representative on the left uses weights, (αA = 2 , αB = 1 ), 3 3 while the representative on the right uses weights (αA = 1 , αB = 2 ). [sent-115, score-0.321]
45 As we can see from equation 3 3 (8), the optimal representative mean is the weighted average of the means of the constituent Gaussians. [sent-116, score-0.279]
46 Interestingly, the optimal covariance turns out to be the average of the constituent covariances plus rank one updates. [sent-117, score-0.331]
47 2 Algorithm Algorithm 1 presents our clustering algorithm for the case where each Gaussian has equal weight 1 αi = n . [sent-120, score-0.239]
48 Initially, cluster assignments are chosen (these can be assigned randomly). [sent-122, score-0.24]
49 First, the mean and covariance parameters for the cluster representative distributions are optimally computed given the cluster assignments. [sent-124, score-0.672]
50 Next, the cluster assignments π are updated for each input Gaussian. [sent-126, score-0.292]
51 This is done by assigning the ith Gaussian to the cluster j with representative Gaussian that is closest in differential relative entropy. [sent-127, score-0.604]
52 While the optimal mean of each representative is the average of the individual means, the optimal covariance is the average of the individual covariances plus rank-one corrections. [sent-129, score-0.486]
53 Lines 6 and 9 compute the optimal means and covariances for each cluster, which requires O(nd) and O(nd2 ) total work, respectively. [sent-133, score-0.226]
54 Line 12 computes the differential relative entropy between each input Gaussian and each cluster representative Gaussian. [sent-134, score-0.812]
55 As only the arg min over all Σj is needed, we can reduce the Burg matrix divergence computation (equation (1)) to tr(Si Σ−1 ) − log |Σ−1 |. [sent-135, score-0.286]
56 j j Once the inverse of each cluster covariance is computed (for a cost of O(kd3 )), the first term can be computed in O(d2 ) time. [sent-136, score-0.318]
57 The second term can similarly be computed once for each cluster for a total cost of O(kd3 ). [sent-137, score-0.203]
58 1 Synthetic Data Our synthetic datasets consist of a set of 200 objects, each of which consists of 30 samples drawn from one of k randomly generated d-dimensional multivariate Gaussians. [sent-148, score-0.371]
59 the mean), whereas our Gaussian clustering algorithm also incorporates second-order covariance information. [sent-165, score-0.354]
60 Here, we see that our algorithm achieves higher clustering quality for datasets composed of fourdimensional Gaussians with varied number of clusters (left), as well as for varied dimensionality of the input Guassians with k = 5 (right). [sent-166, score-0.414]
61 generated by choosing a mean vector uniformly at random from the unit simplex and randomly selecting a covariance matrix from the set of matrices with eigenvalues 1, 2, . [sent-167, score-0.288]
62 Figure 2 (left) shows the clustering quality as a function of the number of clusters when the dimensionality of the input Gaussians is fixed (d = 4). [sent-173, score-0.414]
63 Figure 2 (right) gives clustering quality for five clusters across a varying number of dimensions. [sent-174, score-0.39]
64 As can be seen in Figure 2, our multivariate Gaussian clustering algorithm yields significantly higher NMI values than k-means for all experiments. [sent-176, score-0.524]
65 An open question in sensor networks research is how to minimize communication costs between the sensors and the base station: wireless communication requires a relatively large amount of power, a limited resource on current sensor devices (which are usually battery powered). [sent-179, score-0.946]
66 A recently proposed sensor network system, BBQ [4], reduces communication costs by modelling sensor network data at each sensor device using a time-varying multivariate Gaussian and transmitting only model parameters to the base station. [sent-180, score-1.448]
67 We apply our multivariate Gaussian clustering algorithm to cluster sensor devices from the Intel Lab at Berkeley [8]. [sent-181, score-1.106]
68 Clustering has been used in sensor network applications to determine efficient routing schemes, as well as for discovering trends between groups of sensor devices. [sent-182, score-0.756]
69 The Intel sensor network consists of 52 working sensors, each of which monitors ambient temperature, humidity, light levels, and voltage every thirty seconds. [sent-183, score-0.444]
70 Conditioned on time, the sensor readings can be fit quite well by a multivariate Gaussian. [sent-184, score-0.631]
71 Figure 3 shows the results of our multivariate Gaussian clustering algorithm applied to this sensor network data. [sent-185, score-0.884]
72 For each device, we compute the sample mean and covariance from sensor readings between noon and 2pm each day, for 36 total days. [sent-186, score-0.499]
73 The second cluster (denoted by ‘2’ in figure 3) has the largest variance among all clusters: many of the sensors for this cluster are located in high traffic areas, including the large conference room at the top of the lab, and the smaller tables in the bottom of the lab. [sent-188, score-0.47]
74 Interestingly, this cluster shows very high co-variation between humidity and voltage. [sent-190, score-0.305]
75 Finally, cluster three has a moderate level of total variation, with relatively low light levels. [sent-193, score-0.243]
76 The cluster is primarily located in the center and the right of lab, away from outside windows. [sent-194, score-0.203]
77 Figure 3: To reduce communication costs in sensor networks, each sensor device may be modelled by a multivariate Gaussian. [sent-195, score-1.036]
78 The above plot shows the results of applying our algorithm to cluster sensors into three groups, denoted by labels ‘1’, ‘2’, and ‘3’. [sent-196, score-0.267]
79 3 Statistical Debugging Leveraging program runtime statistics for the purpose of software debugging has received recent research attention [12]. [sent-198, score-0.322]
80 Here we apply our algorithm to cluster functional behavior patterns over A software bugs in the LTEX document preparation program. [sent-199, score-0.484]
81 We model this distribution as a 4-dimensional multivariate Gaussian, one dimension for each bug. [sent-206, score-0.285]
82 Clustering these function counts can yield important debugging insight to assist a software engineer in understanding error dependent program behavior. [sent-211, score-0.322]
83 Figure 4 shows three covariance matrices from a sample clustering of eight clusters. [sent-212, score-0.396]
84 Cluster (b) shows a high dependency between bugs 1 and 4, while cluster (c) exhibits high covariation between bugs 1 and 3, and between bugs 2 and 4. [sent-218, score-0.784]
85 00 A Figure 4: Covariance matrices for three clusters discovered by clustering functional behavior of the LTEX doc- ument preparation program. [sent-267, score-0.437]
86 5 Related Work In this work, we showed that the differential relative entropy between two multivariate Gaussian distributions can be expressed as a convex combination of the Mahalanobis distance between their mean vectors and the Burg matrix divergence between their covariances. [sent-269, score-1.114]
87 This is in contrast to information theoretic clustering [5], where each input is taken to be a probability distribution over some finite set. [sent-270, score-0.344]
88 The differential entropy between two multivariate Gaussians wass considered in [10] in the context of solving Gaussian mixture models. [sent-274, score-0.666]
89 Although an algebraic expression for this differential entropy was given in [10], no connection to the Burg matrix divergence was made there. [sent-275, score-0.61]
90 Our algorithm is based on the standard expectation-maximization style clustering algorithm [6]. [sent-276, score-0.266]
91 Although the closed-form updates used by our algorithm are similar to those employed by a Bregman clustering algorithm [1], we note that the computation of the optimal covariance matrix (equation (9)) involves the optimal mean vector. [sent-277, score-0.521]
92 In [9], the problem of clustering Gaussians with respect to the symmetric differential relative entropy, D(f ||g)+D(g||f ) is considered in the context of learning HMM parameters for speech recognition. [sent-278, score-0.556]
93 The resulting algorithm, however, is much more computationally expensive than ours; whereas in our method, the optimal means and covariance parameters can be computed via a simple closed form solution. [sent-279, score-0.205]
94 Here, only one source is assumed, and thus clustering is not needed. [sent-282, score-0.239]
95 6 Conclusions We have presented a new algorithm for the problem of clustering multivariate Gaussian distributions. [sent-283, score-0.524]
96 Our algorithm is derived in an information theoretic context, which leads to interesting connections with the differential entropy between multivariate Gaussians, and Bregman divergences. [sent-284, score-0.719]
97 Unlike existing clustering algorithms, our algorithm optimizes both first and second order information in the data. [sent-285, score-0.239]
98 We have demonstrated the use of our method on sensor network data and a statistical debugging application. [sent-286, score-0.565]
99 A divisive information-theoretic feature clustering algorithm for text classification. [sent-318, score-0.239]
100 On divergence based clustering of normal distributions and its application to HMM adaptation. [sent-348, score-0.423]
wordName wordTfidf (topN-words)
[('sensor', 0.308), ('burg', 0.289), ('multivariate', 0.285), ('clustering', 0.239), ('gaussians', 0.227), ('differential', 0.225), ('bregman', 0.22), ('debugging', 0.205), ('cluster', 0.203), ('divergence', 0.184), ('bugs', 0.179), ('entropy', 0.156), ('ltex', 0.153), ('mahalanobis', 0.151), ('covariances', 0.136), ('mi', 0.124), ('covariance', 0.115), ('representative', 0.113), ('tr', 0.107), ('humidity', 0.102), ('clusters', 0.094), ('divergences', 0.076), ('si', 0.073), ('devices', 0.071), ('software', 0.068), ('parameterized', 0.068), ('nmi', 0.067), ('gaussian', 0.066), ('sensors', 0.064), ('relative', 0.063), ('dhillon', 0.061), ('lab', 0.06), ('austin', 0.057), ('log', 0.057), ('representatives', 0.053), ('device', 0.053), ('theoretic', 0.053), ('network', 0.052), ('input', 0.052), ('communication', 0.051), ('compilation', 0.051), ('navel', 0.051), ('pdf', 0.051), ('synthetic', 0.051), ('convex', 0.051), ('program', 0.049), ('means', 0.048), ('eigenvalues', 0.048), ('temperature', 0.047), ('matrix', 0.045), ('covariation', 0.044), ('davis', 0.044), ('monitors', 0.044), ('intel', 0.043), ('matrices', 0.042), ('optimal', 0.042), ('distance', 0.041), ('entropic', 0.041), ('pervasive', 0.041), ('light', 0.04), ('mean', 0.038), ('xy', 0.038), ('optima', 0.038), ('readings', 0.038), ('constituent', 0.038), ('assignments', 0.037), ('rated', 0.036), ('texas', 0.036), ('alternately', 0.036), ('drawn', 0.035), ('preparation', 0.034), ('wireless', 0.034), ('routing', 0.032), ('le', 0.031), ('naturally', 0.031), ('speaker', 0.031), ('traf', 0.031), ('movie', 0.031), ('costs', 0.031), ('interestingly', 0.03), ('users', 0.03), ('discovering', 0.03), ('mutual', 0.029), ('quality', 0.029), ('speech', 0.029), ('eigenvectors', 0.029), ('discovered', 0.028), ('messages', 0.028), ('networks', 0.028), ('across', 0.028), ('style', 0.027), ('domains', 0.027), ('end', 0.027), ('mn', 0.026), ('hmm', 0.026), ('expressed', 0.026), ('object', 0.026), ('strictly', 0.026), ('trends', 0.026), ('sn', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
2 0.17523476 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
3 0.17323683 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
4 0.16262867 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
5 0.15539931 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
6 0.13805406 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
7 0.13790795 158 nips-2006-PG-means: learning the number of clusters in data
8 0.13339823 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
9 0.09320353 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data
10 0.092611551 69 nips-2006-Distributed Inference in Dynamical Systems
11 0.090575784 175 nips-2006-Simplifying Mixture Models through Function Approximation
12 0.090417154 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
13 0.086228691 116 nips-2006-Learning from Multiple Sources
14 0.086017251 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
15 0.081167623 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
16 0.073533572 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
17 0.071924269 181 nips-2006-Stability of $K$-Means Clustering
18 0.065051384 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
19 0.061605059 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
20 0.055686437 96 nips-2006-In-Network PCA and Anomaly Detection
topicId topicWeight
[(0, -0.223), (1, 0.063), (2, 0.03), (3, 0.182), (4, 0.024), (5, -0.122), (6, 0.129), (7, 0.054), (8, 0.095), (9, 0.06), (10, -0.039), (11, 0.161), (12, -0.027), (13, 0.172), (14, -0.031), (15, -0.098), (16, 0.104), (17, -0.035), (18, 0.164), (19, 0.093), (20, 0.037), (21, 0.006), (22, 0.037), (23, 0.127), (24, -0.136), (25, 0.007), (26, 0.002), (27, -0.011), (28, 0.033), (29, 0.054), (30, 0.049), (31, 0.04), (32, 0.155), (33, 0.045), (34, 0.026), (35, 0.063), (36, 0.085), (37, -0.031), (38, 0.035), (39, 0.028), (40, 0.013), (41, -0.077), (42, 0.09), (43, -0.069), (44, 0.026), (45, -0.077), (46, 0.117), (47, -0.096), (48, 0.027), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.9706158 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
2 0.64331198 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
3 0.62672478 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
4 0.61965054 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
5 0.61354375 158 nips-2006-PG-means: learning the number of clusters in data
Author: Yu Feng, Greg Hamerly
Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1
6 0.55435556 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
7 0.55036348 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
8 0.54449868 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
9 0.45419815 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
10 0.44601735 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
11 0.40806919 129 nips-2006-Map-Reduce for Machine Learning on Multicore
12 0.40274045 181 nips-2006-Stability of $K$-Means Clustering
13 0.39555746 69 nips-2006-Distributed Inference in Dynamical Systems
14 0.39222035 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data
15 0.36675492 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
16 0.32944101 5 nips-2006-A Kernel Method for the Two-Sample-Problem
17 0.32223237 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
18 0.32022935 98 nips-2006-Inferring Network Structure from Co-Occurrences
19 0.31771174 175 nips-2006-Simplifying Mixture Models through Function Approximation
20 0.30853787 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
topicId topicWeight
[(1, 0.06), (3, 0.058), (7, 0.076), (9, 0.073), (20, 0.015), (22, 0.081), (25, 0.261), (44, 0.112), (57, 0.062), (65, 0.049), (69, 0.052), (90, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.83511102 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
2 0.80953962 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
3 0.72371781 154 nips-2006-Optimal Change-Detection and Spiking Neurons
Author: Angela J. Yu
Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1
4 0.57447398 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
5 0.57211071 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
6 0.57159662 175 nips-2006-Simplifying Mixture Models through Function Approximation
7 0.56762481 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
8 0.56694746 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
9 0.56601799 139 nips-2006-Multi-dynamic Bayesian Networks
10 0.56443191 158 nips-2006-PG-means: learning the number of clusters in data
11 0.56403792 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
12 0.56326246 167 nips-2006-Recursive ICA
13 0.56239128 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
14 0.55977279 65 nips-2006-Denoising and Dimension Reduction in Feature Space
15 0.55942708 98 nips-2006-Inferring Network Structure from Co-Occurrences
16 0.55830783 117 nips-2006-Learning on Graph with Laplacian Regularization
17 0.55761975 20 nips-2006-Active learning for misspecified generalized linear models
18 0.55740196 159 nips-2006-Parameter Expanded Variational Bayesian Methods
19 0.55663657 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
20 0.55603588 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis