nips nips2005 nips2005-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. [sent-4, score-0.577]
2 Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. [sent-5, score-0.407]
3 We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. [sent-6, score-0.133]
4 The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. [sent-7, score-0.351]
5 1 Introduction Data clustering is unsupervised classification of objects into groups based on their similarity [1]. [sent-8, score-0.185]
6 Often, it is desirable to have the clusters to match some labels that are unknown to the clustering algorithm. [sent-9, score-0.313]
7 In this context, a good data clustering is expected to have homogeneous labels in each cluster, under some constraints on the number or complexity of the clusters. [sent-10, score-0.24]
8 Since the clustering algorithm has no access to the labels, it is unclear how the algorithm can optimize the quality of the clustering. [sent-14, score-0.225]
9 Even worse, the clustering quality depends on the specific choice of the unobserved labels. [sent-15, score-0.444]
10 For example a good documents clustering with respect to topics is very different from a clustering with respect to authors. [sent-16, score-0.34]
11 In our setting, instead of trying to cluster by some “arbitrary” labels, we try to predict unobserved features from observed ones. [sent-17, score-0.526]
12 In this sense our target “labels” are yet other features that “happened” to be unobserved. [sent-18, score-0.123]
13 For example, when clustering fruits based on their observed features, such as shape, color and size, the target of clustering is to match unobserved features, such as nutritional value and toxicity. [sent-19, score-0.773]
14 We assume that the random selection of features is done uniformly and independently. [sent-23, score-0.173]
15 After the clustering, one of the unobserved features is randomly and uniformly selected to be a target label, i. [sent-25, score-0.447]
16 clustering performance is measured with respect to this feature. [sent-27, score-0.17]
17 Obviously, the clustering algorithm cannot be directly optimized for this specific feature. [sent-28, score-0.184]
18 The question is whether we can optimize the expected performance on the unobserved feature, based on the observed features alone. [sent-29, score-0.489]
19 In other words, can we find clusters that match as many unobserved features as possible? [sent-31, score-0.49]
20 We show that for any clustering algorithm, the average performance of the clustering with respect to the observed and unobserved features, is similar. [sent-33, score-0.687]
21 Hence we can indirectly optimize clustering performance with respect to the unobserved features, in analogy to generalization in supervised learning. [sent-34, score-0.537]
22 In order to quantify these results, we define two terms: the average observed information and the expected unobserved information. [sent-36, score-0.388]
23 The average observed information, denoted by Iob , is the average mutual information between T and each of the observed features. [sent-41, score-0.245]
24 In other words, if the observed features n 1 are {X1 , . [sent-42, score-0.183]
25 The expected unobserved information, denoted by Iun , is the expected value of the mutual information between T and a randomly selected unobserved feature, i. [sent-46, score-0.727]
26 It also states a uniform convergence in probability of |Iob − Iun | as the number of observed features increases. [sent-53, score-0.197]
27 Conceptually, the observed mean information, Iob , is analogous to the training error in standard supervised learning [3], whereas the unobserved information, Iun , is similar to the generalization error. [sent-54, score-0.441]
28 The second theorem states that under constraint on the number of clusters, and large enough number of observed features, one can achieve nearly the best possible performance, in terms of Iun . [sent-55, score-0.122]
29 The key difference is that in supervised learning, the set of features is fixed and the training instances (samples) are assumed to be randomly drawn from some distribution. [sent-58, score-0.221]
30 In our setting, the set of instances is fixed, but the set of observed features is assumed to be randomly selected. [sent-59, score-0.252]
31 Related work The idea of an information tradeoff between complexity and information on target variables is similar to the idea of the information bottleneck [4]. [sent-63, score-0.125]
32 But unlike the bottleneck method, here we are trying to maximize information on unobserved variables, using finite samples. [sent-64, score-0.349]
33 These variables are the observed features and their indexes are denoted by {q1 , . [sent-72, score-0.233]
34 The remaining L − n variables are the unobserved features. [sent-76, score-0.288]
35 A clustering algorithm has access only to the observed features over m instances {x[1], . [sent-77, score-0.431]
36 Shannon’s mutual information between two variables is a function of their joint distribuP (t,xj ) tion, defined as I(T ; Xj ) = t,xj P (t, xj ) log P (t)P (xj ) . [sent-86, score-0.248]
37 For a random j, this empirical mutual information is a random variable on its own. [sent-88, score-0.14]
38 The expected unobserved information, Iun , is defined as Iun = Ej {I(T ; Xj )}. [sent-93, score-0.295]
39 We can assume that the unobserved feature is with high probability from the unobserved set. [sent-94, score-0.575]
40 Equivalently, Iun can be the mean mutual information between 1 the clusters and each of the unobserved features, Iun = L−n j ∈{q1 ,. [sent-95, score-0.443]
41 / The goal of the clustering algorithm is to find cluster labels {t1 , . [sent-99, score-0.29]
42 , tm }, that maximize Iun , subject to a constraint on their complexity - henceforth considered as the number of clusters (k ≤ D) for simplicity, where D is an integer bound. [sent-102, score-0.172]
43 Similar to the generalization error in supervised learning, Iun cannot be estimated directly in the learning algorithm, but we may be able to bound the difference between the observed information Iob - our “training error” - and Iun - the “generalization error”. [sent-104, score-0.196]
44 To obtain generalization this bound should be uniform over all possible clusterings with a high probability over the randomly selected features. [sent-105, score-0.161]
45 ,tm } |Iob − Iun | > ǫ ≤ 2e−2nǫ 2 /(log k)2 +m log k ∀ǫ > 0 where the probability is over the random selection of the observed features. [sent-110, score-0.17]
46 , tm }, and a random feature j, the mutual information I(T ; Xj ) is a function of the random variable j, and hence I(T ; Xj ) is a random variable in itself. [sent-114, score-0.198]
47 Unlike the standard bounds in supervised learning, this bound increases with the number of instances (m), and decreases with increasing number of observed features (n). [sent-122, score-0.33]
48 This is because in our scheme the training size is not the number of instances, but rather the number of observed features (See Table 1). [sent-123, score-0.207]
49 Again, the probability is over the random selection of the observed features. [sent-131, score-0.138]
50 Proof outline: From the given m instances and any given cluster labels {t1 , . [sent-136, score-0.154]
51 This distribution is denoted by P (t, xj ) and the corresponding mutual information is denoted by IP (T ; Xj ). [sent-144, score-0.246]
52 Using Paninski [9] (proposition 1) it is easy to show that the bias between I(T ; Xj ) and its maximum likeliˆ hood estimation, based on P (t, xj ) is bounded as follows. [sent-156, score-0.145]
53 Note that the ˆob ), E(Iun ) are done over random selection of the subset of m′ instances, ˆ expectations E(I for a set of features that were randomly selected once. [sent-174, score-0.212]
54 ,tm } ˆ ˆ Iob − Iun > ǫ1 ≤ nǫ 4 log k − 2(log1 2 +m′ log k k) e ǫ1 (3) Lemma 2 is used, where Z represents the random selection of features, Y represents the ˆ ˆ random selection of m′ instances, f (y, z) = sup{t1 ,. [sent-185, score-0.166]
55 ,tm } |Iob − Iun | > ǫ1 + 2k maxj |Xj | m′ 2 ≤ nǫ 4 log k − 2(log1 2 +m′ log k k) e ǫ1 By selecting ǫ1 = ǫ/2, m′ = 4k maxj |Xj |/ǫ, we obtain theorem 1. [sent-193, score-0.194]
56 We can now return to the problem of specifying a clustering that maximizes Iun , using only the observed features. [sent-196, score-0.243]
57 ∗ Definition 1 Maximally achievable unobserved information: Let Iun,D be the maximum value of Iun that can be achieved by any clustering {t1 , . [sent-198, score-0.468]
58 ,tm }:k≤D} The clustering that achieves this value is called the best clustering. [sent-204, score-0.17]
59 The average observed ∗ information of this clustering is denoted by Iob,D . [sent-205, score-0.285]
60 Definition 2 Observed information maximization algorithm: Let IobMax be any clustering algorithm that, based on the values of observed features alone, selects the cluster labels {t1 , . [sent-206, score-0.493]
61 Let Iun,D be the expected unobserved information achieved by the IobMax algorithm. [sent-211, score-0.326]
62 Theorem 2 With the definitions above, nǫ2 + − ∗ ˜ Pr Iun,D ≤ Iun,D − ǫ ≤ 8(log k)e 32(log k)2 8k maxj |Xj | ǫ log k−log(ǫ/2) ∀ǫ > 0 (4) where the probability is over the random selection of the observed features. [sent-213, score-0.217]
63 Proof: We now define a bad clustering as a clustering whose expected unobserved infor∗ mation satisfies Iun ≤ Iun,D − ǫ. [sent-214, score-0.654]
64 Hence the probability that a bad clustering has a higher average observed information than the best clustering is upper bounded as in Theorem 2. [sent-217, score-0.51]
65 As a result of this theorem, when n is large enough, even an algorithm that knows the value of all the features (observed and unobserved) cannot find a clustering with the same complexity (k) which is significantly better than the clustering found by IobM ax algorithm. [sent-218, score-0.49]
66 We examine the difference between Iob and Iun as function of the number of observed features and the number of clusters used. [sent-220, score-0.275]
67 For example, collaborative filtering for movie ratings could make predictions about rating of movies by a user, given a partial list of ratings from this user and many other users. [sent-224, score-0.527]
68 Clustering methods are used for collaborative filtering by cluster users based on the similarity of their ratings (see e. [sent-225, score-0.249]
69 The rating of each movie is regarded as a feature. [sent-229, score-0.117]
70 We cluster users based on the set of observed features, i. [sent-230, score-0.185]
71 In our context, the goal of the clustering is to maximize the information between the clusters and unobserved features, i. [sent-233, score-0.585]
72 movies that have not yet been rated by any of the users. [sent-235, score-0.209]
73 By Theorem 2, given large enough number of rated movies, we can achieve the best possible clustering of users with respect to unseen movies. [sent-236, score-0.267]
74 In this region, no additional information (such as user age, taste, rating of more movies) beyond the observed features can improve Iun by more than some small ǫ. [sent-237, score-0.298]
75 The purpose of this section is not to suggest a new algorithm for collaborative filtering or compare it to other methods, but simply to illustrate our new theorems on empirical data. [sent-238, score-0.13]
76 It contains approximately 1 million ratings for 3900 movies by 6040 users. [sent-245, score-0.243]
77 We used only a subset consisting of 2400 movies by 4000 users. [sent-247, score-0.181]
78 Each movie is viewed as a feature, where the rating is the value of the feature. [sent-252, score-0.117]
79 We randomly split the 2400 movies into two groups, denoted by “A” and “B”, of 1200 movies (features) each. [sent-254, score-0.379]
80 We used a subset of the movies from group “A” as observed features and all movies from group “B” as the unobserved features. [sent-255, score-0.852]
81 We estimated Iun by the mean information between the clusters and ratings of movies from group “B”. [sent-257, score-0.378]
82 In (a) and (b) the number of movies is variable, and the number of clusters is fixed. [sent-272, score-0.26]
83 In (c) The number of observed movies is fixed (1200), and the number of clusters is variable. [sent-273, score-0.333]
84 We maximize the mutual information based on the empirical distribution of values that are present, and weight it by the probability of presence for this feature. [sent-281, score-0.137]
85 The weighting prevents ’overfitting’ to movies with few ratings. [sent-283, score-0.168]
86 Since the observed features were selected at random, the statistics of missing values of the observed and unobserved features are the same. [sent-284, score-0.676]
87 Greedy IobMax Algorithm We cluster the users using a simple greedy clustering algorithm . [sent-286, score-0.296]
88 ∗ In order to estimate Iun,D (see definition 1), we also ran the same algorithm, where all the features are available to the algorithm (i. [sent-290, score-0.137]
89 The algorithm finds clusters that maximize the mean mutual information on features from group "B". [sent-293, score-0.345]
90 For small n, the clustering ’overfits’ to the observed features. [sent-296, score-0.243]
91 For large n, Iun approaches ∗ to Iun,D , which means the IobM ax algorithm found nearly the best possible clustering as expected from the theorem 2. [sent-298, score-0.255]
92 4 Discussion and Summary We introduce a new learning paradigm: clustering based on observed features that generalizes to unobserved features. [sent-300, score-0.638]
93 Our results are summarized by two theorems that tell us how, without knowing the value of the unobserved features, one can estimate and maximize information between the clusters and the unobserved features. [sent-301, score-0.726]
94 The key assumption that enables us to prove the theorems is the random independent selection of the observed features. [sent-302, score-0.161]
95 The difference between the observed and unobserved information is large only for a small portion of all possible partitions into observed and unobserved features. [sent-304, score-0.73]
96 The importance of clustering which preserves information on unobserved features is that it enables us to learn new - previously unobserved - attributes from a small number of examples. [sent-306, score-0.879]
97 Suppose that after clustering fruits based on their observed features, we eat a chinaberry1 and thus, we ”observe” (by getting sick), the previously unobserved attribute of toxicity. [sent-307, score-0.592]
98 Assuming that in each cluster, all fruits have similar unobserved attributes, we can conclude that all fruits in the same cluster, i. [sent-308, score-0.392]
99 We can even relate the IobMax principle to cognitive clustering in sensory information processing. [sent-311, score-0.19]
100 assigning object names in language) may be based on a similar principle - find a representation (clusters) that contain significant information on as many observed features as possible, while still remaining simple. [sent-314, score-0.203]
wordName wordTfidf (topN-words)
[('iun', 0.67), ('iob', 0.529), ('unobserved', 0.274), ('clustering', 0.17), ('movies', 0.168), ('xj', 0.125), ('features', 0.11), ('iobmax', 0.094), ('clusters', 0.092), ('ratings', 0.075), ('observed', 0.073), ('cluster', 0.069), ('rating', 0.065), ('collaborative', 0.062), ('fruits', 0.059), ('mutual', 0.057), ('movie', 0.052), ('instances', 0.048), ('sup', 0.047), ('maxj', 0.047), ('users', 0.043), ('pr', 0.041), ('rated', 0.041), ('tm', 0.039), ('generalization', 0.038), ('clusterings', 0.037), ('ob', 0.037), ('labels', 0.037), ('lemma', 0.037), ('theorems', 0.037), ('theorem', 0.036), ('bound', 0.034), ('log', 0.032), ('attributes', 0.031), ('supervised', 0.031), ('user', 0.03), ('maximize', 0.029), ('selection', 0.028), ('ltering', 0.028), ('bottleneck', 0.026), ('upper', 0.024), ('chinaberries', 0.024), ('grouplens', 0.024), ('iobm', 0.024), ('movielens', 0.024), ('random', 0.023), ('group', 0.023), ('denoted', 0.022), ('expected', 0.021), ('randomly', 0.021), ('ip', 0.021), ('ej', 0.021), ('erm', 0.02), ('krupka', 0.02), ('information', 0.02), ('nitions', 0.02), ('bounded', 0.02), ('bad', 0.019), ('missing', 0.019), ('preferences', 0.017), ('alphabet', 0.017), ('amir', 0.017), ('empirical', 0.017), ('selected', 0.017), ('partitions', 0.016), ('access', 0.016), ('attribute', 0.016), ('objects', 0.015), ('proof', 0.015), ('ax', 0.014), ('indexes', 0.014), ('tishby', 0.014), ('variables', 0.014), ('label', 0.014), ('analogous', 0.014), ('probability', 0.014), ('match', 0.014), ('algorithm', 0.014), ('analogy', 0.013), ('un', 0.013), ('feature', 0.013), ('enough', 0.013), ('target', 0.013), ('hebrew', 0.013), ('subset', 0.013), ('scheme', 0.013), ('achievable', 0.013), ('outline', 0.013), ('ran', 0.013), ('uniformly', 0.012), ('bounds', 0.012), ('complexity', 0.012), ('achieved', 0.011), ('optimize', 0.011), ('generalizes', 0.011), ('nition', 0.011), ('decreases', 0.011), ('training', 0.011), ('increases', 0.011), ('labeled', 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
2 0.102084 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
3 0.098746106 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
Author: Rong Jin, Feng Kang, Chris H. Ding
Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1
4 0.087995686 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
5 0.082982704 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
6 0.068243369 33 nips-2005-Bayesian Sets
7 0.065668106 79 nips-2005-Fusion of Similarity Data in Clustering
8 0.052635055 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
9 0.050527498 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
10 0.048138384 177 nips-2005-Size Regularized Cut for Data Clustering
11 0.047470246 85 nips-2005-Generalization to Unseen Cases
12 0.042801775 127 nips-2005-Mixture Modeling by Affinity Propagation
13 0.042714704 104 nips-2005-Laplacian Score for Feature Selection
14 0.040314279 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
15 0.035023209 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
16 0.032266483 41 nips-2005-Coarse sample complexity bounds for active learning
17 0.031651948 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
18 0.03022738 98 nips-2005-Infinite latent feature models and the Indian buffet process
19 0.028331488 160 nips-2005-Query by Committee Made Real
20 0.027705248 95 nips-2005-Improved risk tail bounds for on-line algorithms
topicId topicWeight
[(0, 0.108), (1, 0.053), (2, -0.038), (3, -0.025), (4, -0.122), (5, -0.016), (6, 0.021), (7, 0.03), (8, -0.144), (9, 0.018), (10, -0.076), (11, 0.019), (12, 0.1), (13, 0.024), (14, 0.077), (15, 0.086), (16, -0.003), (17, -0.04), (18, -0.059), (19, -0.049), (20, 0.051), (21, -0.04), (22, 0.043), (23, -0.005), (24, -0.006), (25, -0.048), (26, -0.077), (27, 0.011), (28, 0.014), (29, -0.032), (30, -0.094), (31, -0.02), (32, 0.089), (33, 0.044), (34, -0.087), (35, -0.034), (36, 0.041), (37, -0.088), (38, 0.022), (39, 0.193), (40, 0.139), (41, 0.004), (42, -0.014), (43, -0.072), (44, -0.024), (45, 0.039), (46, -0.091), (47, 0.083), (48, 0.14), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.94501865 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
2 0.6791535 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
3 0.58941472 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
4 0.52097011 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
5 0.51785415 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
6 0.46730661 178 nips-2005-Soft Clustering on Graphs
7 0.45508221 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
8 0.43577117 104 nips-2005-Laplacian Score for Feature Selection
9 0.43290406 85 nips-2005-Generalization to Unseen Cases
10 0.37399355 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
11 0.31968743 127 nips-2005-Mixture Modeling by Affinity Propagation
12 0.30412063 160 nips-2005-Query by Committee Made Real
13 0.30199549 117 nips-2005-Learning from Data of Variable Quality
14 0.29928288 71 nips-2005-Fast Krylov Methods for N-Body Learning
15 0.29044405 112 nips-2005-Learning Minimum Volume Sets
16 0.28093421 177 nips-2005-Size Regularized Cut for Data Clustering
17 0.27822986 95 nips-2005-Improved risk tail bounds for on-line algorithms
18 0.25717086 59 nips-2005-Dual-Tree Fast Gauss Transforms
19 0.25198272 121 nips-2005-Location-based activity recognition
20 0.24750175 77 nips-2005-From Lasso regression to Feature vector machine
topicId topicWeight
[(3, 0.1), (10, 0.051), (11, 0.01), (27, 0.03), (31, 0.026), (34, 0.075), (43, 0.308), (55, 0.023), (69, 0.028), (73, 0.079), (88, 0.053), (91, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.75873613 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
2 0.46674186 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
3 0.46398735 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
4 0.44918588 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
Author: Zhenyue Zhang, Hongyuan Zha
Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efficiency and effectiveness of the proposed methods. 1
5 0.44689983 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire
Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.
6 0.44401219 102 nips-2005-Kernelized Infomax Clustering
7 0.44391933 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
8 0.44169718 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
9 0.43806025 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
10 0.43779376 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
11 0.43606028 159 nips-2005-Q-Clustering
12 0.43493673 41 nips-2005-Coarse sample complexity bounds for active learning
13 0.43415713 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
14 0.43345451 160 nips-2005-Query by Committee Made Real
15 0.43293083 74 nips-2005-Faster Rates in Regression via Active Learning
16 0.43292019 178 nips-2005-Soft Clustering on Graphs
17 0.43265203 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
18 0.43225512 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
19 0.43200266 58 nips-2005-Divergences, surrogate loss functions and experimental design
20 0.43167534 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning