nips nips2005 nips2005-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. [sent-9, score-1.097]
2 The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. [sent-10, score-0.25]
3 The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. [sent-11, score-0.337]
4 The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. [sent-12, score-0.4]
5 Rather than learning a density model of the observations, our goal here will be to learn a mapping x → y from the observations to the latent codes (cluster labels) by optimizing a formal measure of coding efficiency. [sent-17, score-0.108]
6 The fundamental measure in this context is the mutual information def I(x, y) = H(x) − H(x|y) ≡ H(y) − H(y|x), (1) which indicates the decrease in uncertainty about the pattern x due to the knowledge of the underlying cluster label y (e. [sent-19, score-0.596]
7 In our case the encoder model is defined as M δ(x − x(m) )p(y|x), p(x, y) ∝ (2) m=1 where {x(m) |m = 1, . [sent-26, score-0.325]
8 Our goal is to maximize (1) with respect to parameters of a constrained encoding distribution p(y|x). [sent-30, score-0.139]
9 In contrast to most applications of the infomax principle (Linsker (1988)) in stochastic channels (e. [sent-31, score-0.229]
10 Indeed, had the code space been high-dimensional, computation of I(x, y) would have required evaluation of the generally intractable entropy of the mixture H(y), and approximations would have needed to be considered (e. [sent-34, score-0.122]
11 Maximization of the mutual information with respect to parameters of the encoder model effectively defines a discriminative unsupervised optimization framework, where the model is parameterized similarly to a conditionally trained classifier, but where the cluster allocations are generally unknown. [sent-37, score-1.03]
12 Training such models p(y|x) by maximizing the likelihood p(x) would be meaningless, as the cluster variables would marginalize out, which motivates also our information theoretic approach. [sent-38, score-0.41]
13 In this way we may extract soft cluster allocations directly from the training set, with no additional information about class labels, relevance patterns, etc. [sent-39, score-0.639]
14 This is an important difference from other clustering techniques making a recourse to information theory, which consider different channels and generally require additional information about relevance or irrelevance variables (cf Tishby et al. [sent-41, score-0.382]
15 Our infomax approach is in contrast with probabilistic methods based on likelihood maximization. [sent-43, score-0.229]
16 There the task of finding an optimal cluster allocation y for an observed pattern x may be viewed as an inference problem in generative models y → x, where the probability of the data p(x) = y p(y)p(x|y) is defined as a mixture of |y| processes. [sent-44, score-0.531]
17 The key idea of fitting such models to data is to find a constrained probability distribution p(x) which would be likely to generate the visible patterns {x(1) , . [sent-45, score-0.144]
18 , x(M ) } (this is commonly achieved by maximizing the marginal likelihood for deterministic parameters of the constrained distribution). [sent-48, score-0.137]
19 The unknown clusters y corresponding to each pattern x may then be assigned according to the posterior p(y|x) ∝ p(y)p(x|y). [sent-49, score-0.209]
20 In high dimensions |x| this restricts the class of generative distributions usually to (mixtures of) Gaussians whose mean is dependent (in a linear or non-linear way) on the latent cluster y. [sent-51, score-0.386]
21 If we are restricted to using mixtures of Gaussians to model this curved manifold, typically a very large number of mixture components will be required. [sent-53, score-0.175]
22 No such restrictions apply in the infomax case so that the mappings p(y|x) may be very complex, subject only to sensible clustering constraints. [sent-54, score-0.57]
23 2 Clustering in Nonlinear Encoder Models Arguably, there are at least two requirements which a meaningful cluster allocation procedure should satisfy. [sent-55, score-0.31]
24 Firstly, clusters should be, in some sense, locally smooth. [sent-56, score-0.141]
25 For example, each pair of source vectors should have a high probability of being assigned to the same cluster if the vectors satisfy specific geometric constraints. [sent-57, score-0.379]
26 Secondly, we may wish to avoid assigning unique cluster labels to outliers (or other constrained regions in the data space), so that under-represented regions in the data space are not over-represented in the code space. [sent-58, score-0.438]
27 Note that degenerate cluster allocations are generally suboptimal under the objective (1), as they would lead to a reduction in the marginal entropy H(y). [sent-59, score-0.62]
28 On the other hand, it is intuitive that maximization of the mutual information I(x, y) favors hard assignments of cluster labels to equiprobable data regions, as this would result in the growth in H(y) and reduction in H(y|x). [sent-60, score-0.47]
29 1 Learning Optimal Parameters Local smoothness and “softness” of the clusters may be enforced by imposing appropriate constraints on p(y|x). [sent-62, score-0.139]
30 A simple choice of the encoder is p(yj |x(i) ) ∝ exp{− x(i) − wj 2 /sj + bj }, (3) |x| where the cluster centers wj ∈ R , the dispersions sj , and the biases bj are the encoder parameters to be learned. [sent-63, score-1.727]
31 Clearly, under the encoding distribution (3) patterns x lying close to specific centers wj in the data space will tend to be clustered similarly. [sent-64, score-0.377]
32 In principle, we could consider other choices of p(y|x); however (3) will prove to be particularly convenient for the kernelized extensions. [sent-65, score-0.228]
33 Learning the optimal cluster allocations corresponds to maximizing (1) with respect to the encoder parameters (3). [sent-66, score-0.995]
34 The gradients are given by ∂I(x, y) ∂wj ∂I(x, y) ∂sj = = 1 M 1 M M p(yj |x(m) ) m=1 M p(yj |x(m) ) m=1 Analogously, we get ∂I(x, y)/∂bj = M m=1 (x(m) − wj ) (m) αj sj x(m) − wj 2s2 j (m) p(yj |x(m) )αj 2 (m) αj (4) . [sent-67, score-0.595]
35 Expressions (4) and (5) have the form of the weighted EM updates for isotropic (m) Gaussian mixtures, with the weighting coefficients αj defined as (m) def αj def = αj (x(m) ) = log p(yj |x(m) ) − KL p(y|x(m) ) p(y|x) p(yj ) p(x) ˜ , (6) where KL defines the Kullback-Leibler divergence (e. [sent-69, score-0.344]
36 , |y|, the gradients (4) are identical to those obtained by maximizing the log-likelihood of a Gaussian mixture model (up (m) to irrelevant constant pre-factors). [sent-78, score-0.203]
37 Generally, however, the coefficients αj will be functions of wl , sl , and bl for all cluster labels l = 1, . [sent-79, score-0.359]
38 In practice, we may impose a simple construction ensuring that sj > 0, for example by assuming that sj = exp{˜j } where sj ∈ R. [sent-83, score-0.585]
39 For this case, we may re-express s ˜ the gradients for the variances as ∂I(x, y)/∂˜j = sj ∂I(x, y)/∂sj . [sent-84, score-0.29]
40 Expressions (4) s and (5) may then be used to perform gradient ascent on I(x, y) for wj , sj , and bj , ˜ where j = 1, . [sent-85, score-0.466]
41 After training, the optimal cluster allocations may be assigned according to the encoding distribution p(y|x). [sent-89, score-0.7]
42 2 Infomax Clustering with Kernelized Encoder Models We now extend (3) by considering a kernelized parameterization of a nonlinear encoder. [sent-91, score-0.228]
43 Let us assume that the source patterns x(i) , x(j) have a high probability of being assigned to the same cluster if they lie close to a specific cluster center in some feature space. [sent-92, score-0.882]
44 One choice of the encoder distribution for this case is p(yj |x(i) ) ∝ exp{− φ(x(i) ) − wj 2 /sj + bj }, (7) where φ(x(i) ) ∈ R|φ| is the feature vector corresponding to the source pattern x(i) , and wj ∈ R|φ| is the (unknown) cluster center in the feature space. [sent-93, score-1.23]
45 The feature space may be very high- or even infinite-dimensional. [sent-94, score-0.087]
46 Since each cluster center wi ∈ R|φ| lives in the same space as the projected sources φ(x(i) ), it is representable in the basis of the projections as M ⊥ αmj φ(x(m) ) + wj , wj = (8) m=1 ˜⊥ where wi ∈ R|φ| is orthogonal to the span of φ(x1 ), . [sent-95, score-0.648]
47 , φ(xM ), and {αmj } is a set of coefficients (here j and m index |y| codes and M patterns respectively). [sent-98, score-0.135]
48 Additionally, we will ensure positivity of the dispersions sj by considering a construction constraint sj = exp{˜j }, where sj ∈ R. [sent-101, score-0.597]
49 Objective (1) should be optimized with respect to the log-dispersions sj ≡ log(sj ), biases cj , and coordinates A ∈ RM ×|y| in the space spanned by the ˜ feature vectors {φ(x(i) )|i = 1, . [sent-106, score-0.323]
50 From (9) we get ∂I(x, y) ∂aj ∂I(x, y) ∂˜j s where p(x) ∝ ˜ M m−1 = = 1 p(yj |x) (k(x) − Kaj ) αj (x) sj 1 p(yj |x)fj (x)αj (x) p(x) , ˜ 2sj p(x) ˜ ∈ RM , (10) (11) δ(x−x(m) ) is the empirical distribution. [sent-110, score-0.184]
51 For a known Gram matrix K ∈ RM ×M , the gradients ∂I/∂aj , ∂I/∂˜j , and ∂I/∂cj given by expressions (10) – (12) may be s used in numerical optimization for the model parameters. [sent-112, score-0.184]
52 Note that the matrix multiplication in (10) is performed once for each aj , so that the complexity of computing the gradient is ∼ O(M 2 |y|) per iteration. [sent-113, score-0.114]
53 We also note that one could potentially optimize (1) by applying the iterative Arimoto-Blahut algorithm for maximizing the channel capacity (see e. [sent-114, score-0.103]
54 However, for any given constrained encoder it is generally difficult to derive closed-form updates for the parameters of p(y|x), which motivates a numerical optimization. [sent-117, score-0.454]
55 Learning Optimal Kernels Since we presume that explicit computations in R|φ| are expensive, we cannot compute the Gram matrix by trivially applying its definition K = {φ(xi )T φ(xj )}. [sent-118, score-0.118]
56 Instead, we may interpret scalar products in feature spaces as kernel functions φ(x(i) )T φ(x(j) ) = KΘ (x(i) , x(j) ; Θ), ∀x(i) , x(j) ∈ Rx , (13) where KΘ : Rx × Rx → R satisfies Mercer’s kernel properties (e. [sent-119, score-0.245]
57 We may now apply our unsupervised framework to implicitly learn the optimal nonlinear features by optimizing I(x, y) with respect to the parameters Θ of the kernel function KΘ . [sent-122, score-0.221]
58 The computational complexity of computing the updates for Θ is O(M |y|2 ), where M is the number of training patterns and |y| is the number of clusters (which is assumed to be small). [sent-124, score-0.237]
59 Note that in contrast to spectral methods (see e. [sent-125, score-0.11]
60 (2001)) neither the objective (1) nor its gradients require inversion of the Gram matrix K ∈ RM ×M or computations of its eigenvalue decomposition. [sent-128, score-0.187]
61 By substituting (16) into the general expression (14), we obtain the gradient of the mutual information with respect to the RBF kernel parameters. [sent-130, score-0.158]
62 3 Demonstrations We have empirically compared our kernelized information-theoretic clustering approach with Gaussian mixture, k-means, feature-space k-means, non-kernelized information-theoretic clustering (see Section 2. [sent-131, score-0.844]
63 1), and a multi-class spectral clustering method optimizing the normalized cuts. [sent-132, score-0.451]
64 The kernel parameters β of the RBF-kernelized encoding distribution were initialized at β0 = 2. [sent-136, score-0.172]
65 , |y|} (along s with the RBF kernel parameter β) were optimized by applying the scaled conjugate gradients. [sent-147, score-0.079]
66 We found that Gaussian mixtures trained by maximizing the likelihood usually resulted in highly stochastic cluster allocations; additionally, they led to a large variation in cluster sizes. [sent-148, score-0.77]
67 The Gaussian mixtures were initialized using k-means – other choices usually led to worse performance. [sent-149, score-0.099]
68 We also see that the k-means effectively breaks, as the similarly clustered points lie close to each other in R2 (according to the L2 -norm), but the allocated clusters are not locally smooth in t. [sent-150, score-0.239]
69 On the other hand, our method with the RBF-kernelized encoders typically led to locally smooth cluster allocations. [sent-151, score-0.436]
70 Gaussian mixture clustering for |y|=3 K−means clustering for |y|=3 KMI Clustering, β=0. [sent-152, score-0.695]
71 Left: Gaussian mixtures; middle: K-means; right: information-maximization for the (RBF-)kernelized encoder (the learned parameter β ≈ 0. [sent-186, score-0.325]
72 Light, medium, and dark-gray squares show the cluster colors corresponding to deterministic cluster allocations. [sent-188, score-0.62]
73 The color intensity of each training point x(m) is the average of the pure cluster intensities, weighted by the responsibilities p(yj |x(m) ). [sent-189, score-0.384]
74 Nearly indistinguishable dark colors of the Gaussian mixture clustering indicate soft cluster assignments. [sent-190, score-0.727]
75 Figure 2 shows typical results for spatially translated letters with |x| = 2, M = 150, and |y| = 2 (or |y| = 3), where we compare Gaussian mixture, feature-space k-means, the spectral method of Ng et al. [sent-191, score-0.141]
76 The results produced by our kernelized infomax method were generally stable under different initializations, provided that β0 was not too large or too small. [sent-194, score-0.529]
77 In contrast to Gaussian mixture, spectral, and feature-space k-means clustering, the clusters produced by kernelized infomax for the cases considered are arguably more anthropomorphically appealing. [sent-195, score-0.633]
78 Note that feature-space k-means, as well as the spectral method, presume that the kernel matrix K ∈ RM ×M is fixed and known (in the latter case, the Gram matrix defines the edge weights of the graph). [sent-196, score-0.307]
79 For illustration purposes, we show the results for the fixed Gram matrices with kernel parameters β set to the initial values β0 = 1 or the learned values β ≈ 0. [sent-197, score-0.119]
80 One may potentially improve the performance of these methods by running the algorithms several times (with different kernel parameters β), and choosing β which results in tightest clusters (Ng et al. [sent-199, score-0.341]
81 We were indeed able to apply the spectral method to obtain clusters for TA and T (for β ≈ 1. [sent-201, score-0.216]
82 In contrast, the kernelized infomax method typically resulted in meaningful cluster allocations (TT and A) after a single run of the algorithm (see Figure 2 (c)), with the results qualitatively consistent under a variety of initializations. [sent-204, score-1.0]
83 Additionally, we note that in situations when we used simpler encoder models (see expression (3)) or did not adapt parameters of the kernel functions, the extracted clusters were often more intuitive than those produced by rival methods, but inferior 2. [sent-205, score-0.579]
84 5 Gaussian mixture clustering for |y|=2 Feauture space K−means, β=0. [sent-206, score-0.387]
85 5 −3 −2 −1 0 1 2 3 (f) Figure 2: Learning cluster allocations for |y| = 2 and |y| = 3. [sent-248, score-0.543]
86 604 (the only pattern clustered differently (under identical initializations) is shown by ⊚); (c) kernelized infomax clustering for |y| = 2 (the inverse variance β of the RBF kernel varied from β0 = 1 (at the initialization) to β ≈ 0. [sent-252, score-0.936]
87 604 after convergence); (d) spectral clustering for |y| = 2 and β ≈ 0. [sent-253, score-0.418]
88 604; (e) kernelized infomax clustering for |y| = 3 with a fixed Gram matrix; (f ) kernelized infomax clustering for |y| = 3 started at β0 = 1 and reaching β ≈ 0. [sent-254, score-1.53]
89 Our results suggest that by learning kernel parameters we may often obtain higher values of the objective I(x, y), as well as more appealing cluster labeling (e. [sent-257, score-0.496]
90 Undoubtedly, a careful choice of the kernel function could potentially lead to an even better visualization of the locally smooth, non-degenerate structure. [sent-262, score-0.166]
91 4 Discussion The proposed information-theoretic clustering framework is fundamentally different from the generative latent variable clustering approaches. [sent-263, score-0.692]
92 Instead of explicitly parameterizing the data-generating process, we impose constraints on the encoder distributions, transforming the clustering problem to learning optimal discrete encodings of the unlabeled data. [sent-264, score-0.669]
93 Many possible parameterizations of such distributions may potentially be considered. [sent-265, score-0.085]
94 Our method suggests a formal information-theoretic procedure for learning optimal cluster allocations. [sent-267, score-0.346]
95 Moreover, the results suggest that in the cases considered the method favorably compares with the common generative clustering techniques, k-means, feature-space k-means, and the variants of the method which do not use nonlinearities or do not learn parameters of kernel functions. [sent-269, score-0.465]
96 A number of interesting interpretations of clustering approaches in feature spaces are possible. [sent-270, score-0.362]
97 (2004)) that spectral clustering methods optimizing normalized cuts (Shi and Malik (2000); Ng et al. [sent-272, score-0.482]
98 We are currently relating our method to the common spectral clustering approaches and a form of annealed weighted featurespace k-means. [sent-274, score-0.418]
99 We stress, however, that our information-maximizing framework suggests a principled way of learning optimal similarity matrices by adapting parameters of the kernel functions. [sent-275, score-0.155]
100 Finally, we expect that the proper information-theoretic interpretation of the encoder framework may facilitate extensions of the information-theoretic clustering method to richer families of encoder distributions. [sent-277, score-0.991]
wordName wordTfidf (topN-words)
[('encoder', 0.325), ('cluster', 0.31), ('clustering', 0.308), ('allocations', 0.233), ('infomax', 0.229), ('kernelized', 0.228), ('sj', 0.184), ('yj', 0.177), ('def', 0.172), ('wj', 0.169), ('gram', 0.154), ('agakov', 0.129), ('barber', 0.129), ('spectral', 0.11), ('coe', 0.108), ('clusters', 0.106), ('kmi', 0.103), ('erent', 0.103), ('patterns', 0.098), ('dhillon', 0.096), ('di', 0.087), ('cj', 0.085), ('bj', 0.08), ('rm', 0.08), ('kernel', 0.079), ('mutual', 0.079), ('mixture', 0.079), ('kaj', 0.078), ('aj', 0.074), ('gradients', 0.073), ('initializations', 0.072), ('guan', 0.067), ('ectively', 0.067), ('tishby', 0.063), ('mixtures', 0.06), ('thomas', 0.059), ('clustered', 0.057), ('fj', 0.055), ('feature', 0.054), ('rx', 0.054), ('encoding', 0.053), ('potentially', 0.052), ('encoders', 0.052), ('linsker', 0.052), ('nadal', 0.052), ('torkkola', 0.052), ('maximizing', 0.051), ('rbf', 0.05), ('theoretic', 0.049), ('labels', 0.049), ('cients', 0.047), ('cover', 0.046), ('constrained', 0.046), ('ng', 0.046), ('malik', 0.046), ('dispersions', 0.045), ('shi', 0.044), ('generally', 0.043), ('campbell', 0.041), ('responsibilities', 0.041), ('chechik', 0.041), ('principe', 0.041), ('arguably', 0.041), ('brunel', 0.041), ('lie', 0.041), ('matrix', 0.04), ('fisher', 0.04), ('computations', 0.04), ('parameters', 0.04), ('gaussian', 0.04), ('led', 0.039), ('additionally', 0.039), ('generative', 0.038), ('scholkopf', 0.038), ('presume', 0.038), ('expressions', 0.038), ('latent', 0.038), ('codes', 0.037), ('curved', 0.036), ('optimal', 0.036), ('locally', 0.035), ('assigned', 0.035), ('kl', 0.035), ('pattern', 0.035), ('source', 0.034), ('mj', 0.034), ('bach', 0.034), ('objective', 0.034), ('training', 0.033), ('may', 0.033), ('edinburgh', 0.033), ('optimizing', 0.033), ('maximization', 0.032), ('analogously', 0.031), ('kij', 0.031), ('exp', 0.031), ('et', 0.031), ('soft', 0.03), ('produced', 0.029), ('su', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
2 0.23503104 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
3 0.18760087 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
4 0.18213373 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
5 0.13802166 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
6 0.13525568 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
7 0.13521758 127 nips-2005-Mixture Modeling by Affinity Propagation
8 0.12470331 159 nips-2005-Q-Clustering
9 0.102084 84 nips-2005-Generalization in Clustering with Unobserved Features
10 0.099976279 177 nips-2005-Size Regularized Cut for Data Clustering
11 0.090728641 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
12 0.077554554 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
13 0.076040432 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
14 0.070982486 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
15 0.070611037 45 nips-2005-Conditional Visual Tracking in Kernel Space
16 0.07046359 104 nips-2005-Laplacian Score for Feature Selection
17 0.069838852 126 nips-2005-Metric Learning by Collapsing Classes
18 0.066400349 133 nips-2005-Nested sampling for Potts models
19 0.063958332 114 nips-2005-Learning Rankings via Convex Hull Separation
20 0.063147381 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
topicId topicWeight
[(0, 0.237), (1, 0.12), (2, -0.111), (3, -0.038), (4, -0.287), (5, -0.131), (6, 0.072), (7, -0.066), (8, -0.159), (9, -0.013), (10, -0.128), (11, 0.01), (12, 0.045), (13, 0.054), (14, 0.066), (15, 0.109), (16, 0.081), (17, 0.034), (18, -0.036), (19, -0.094), (20, 0.035), (21, -0.047), (22, 0.063), (23, 0.003), (24, 0.109), (25, -0.034), (26, -0.031), (27, 0.005), (28, 0.016), (29, -0.023), (30, -0.038), (31, 0.009), (32, -0.062), (33, -0.036), (34, -0.065), (35, 0.013), (36, 0.135), (37, -0.137), (38, 0.054), (39, 0.101), (40, 0.043), (41, -0.158), (42, 0.014), (43, 0.001), (44, -0.002), (45, -0.004), (46, -0.02), (47, -0.044), (48, 0.029), (49, 0.132)]
simIndex simValue paperId paperTitle
same-paper 1 0.96869433 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
2 0.70212972 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
3 0.697052 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
4 0.68465889 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
5 0.67857128 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
6 0.6454019 79 nips-2005-Fusion of Similarity Data in Clustering
7 0.64135504 127 nips-2005-Mixture Modeling by Affinity Propagation
8 0.51606917 177 nips-2005-Size Regularized Cut for Data Clustering
9 0.49534369 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
10 0.4715398 71 nips-2005-Fast Krylov Methods for N-Body Learning
11 0.44865146 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
12 0.38219532 33 nips-2005-Bayesian Sets
13 0.35384005 104 nips-2005-Laplacian Score for Feature Selection
14 0.34435061 45 nips-2005-Conditional Visual Tracking in Kernel Space
15 0.32695618 103 nips-2005-Kernels for gene regulatory regions
16 0.32318896 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.32116795 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
18 0.30738902 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
19 0.30578896 4 nips-2005-A Bayesian Spatial Scan Statistic
20 0.29978344 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
topicId topicWeight
[(3, 0.072), (10, 0.035), (27, 0.034), (31, 0.049), (34, 0.104), (50, 0.011), (54, 0.211), (55, 0.03), (65, 0.02), (69, 0.049), (73, 0.123), (77, 0.018), (88, 0.083), (91, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.84961671 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
2 0.77645165 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
3 0.69332898 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
4 0.68490934 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.67513037 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
6 0.670196 71 nips-2005-Fast Krylov Methods for N-Body Learning
7 0.66013187 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
8 0.65946031 104 nips-2005-Laplacian Score for Feature Selection
9 0.6480757 177 nips-2005-Size Regularized Cut for Data Clustering
10 0.64430177 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
11 0.64323938 178 nips-2005-Soft Clustering on Graphs
12 0.64319015 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
13 0.64096534 98 nips-2005-Infinite latent feature models and the Indian buffet process
14 0.63944471 84 nips-2005-Generalization in Clustering with Unobserved Features
15 0.63913077 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
16 0.63645446 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
17 0.63440365 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
18 0.63381565 77 nips-2005-From Lasso regression to Feature vector machine
19 0.63249016 184 nips-2005-Structured Prediction via the Extragradient Method
20 0.62773997 133 nips-2005-Nested sampling for Potts models