nips nips2006 nips2006-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . [sent-6, score-0.773]
2 This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. [sent-9, score-0.141]
3 The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. [sent-10, score-0.153]
4 For the prediction of missing values (collaborative filtering) it improves the currently best known results. [sent-11, score-0.141]
5 For instance, in text analysis data, the rows of a matrix correspond to words, the columns to different documents, and entries indicate the number of occurrences of a particular word in a specific document. [sent-15, score-0.505]
6 In a matrix of gene expression data, rows correspond to genes, columns to various experimental conditions, and entries indicate expression levels of given genes in given conditions. [sent-16, score-0.868]
7 In movie rating data, rows correspond to viewers, columns to movies, and entries indicate ratings made by the viewers. [sent-17, score-0.637]
8 Finally, for financial data, rows correspond to stocks, columns to different time points, and each entry indicates a price change of a particular stock at a given time point. [sent-18, score-0.5]
9 Typically, a normalized words-documents table is used as an estimator of a words-documents joint probability distribution, where each entry estimates the probability of finding a given word in a given document, whereas the words are assumed to be independent of each other [1, 2]. [sent-20, score-0.135]
10 By contrast, values in a financial data matrix are a general function of stocks and days and in particular might include negative numerical values. [sent-21, score-0.24]
11 Here, the data cannot be regarded as a sample from a joint probability distribution of stocks and days, even if a normalization is applied. [sent-22, score-0.183]
12 Although gene expression data can be considered as a repeatable experiment, very often different experimental conditions correspond to only one single column in the matrix. [sent-27, score-0.274]
13 Thus once again a single data point represents the joint statistics of three variables: gene, condition, and expression level. [sent-28, score-0.135]
14 Nevertheless, in most such cases there is a statistical relationship within rows and/or within columns of a matrix. [sent-29, score-0.29]
15 For instance, people with similar interests typically give similar ratings to movies and movies with similar characteristics give rise to similar rating profiles. [sent-30, score-0.273]
16 Such relationships can be exploited by clustering algorithms that group together similar rows and/or columns, and furthermore make it possible to complete the missing entries in the matrix [3]. [sent-31, score-0.555]
17 The existing clustering techniques can be classified into three major categories. [sent-32, score-0.119]
18 Instead of defining the clustering objective through a distortion measure or data generation process, the approach suggests using relevant variables. [sent-38, score-0.158]
19 A tradeoff between compression of the irrelevant and prediction of the relevant variables is then optimized using information theoretic principles. [sent-39, score-0.264]
20 Since the original work in [4], multiple studies have highlighted the theoretical and practical importance of the IB method, in particular in the context of cluster analysis [2, 5, 6, 7, 8]. [sent-41, score-0.099]
21 In practice a given cooccurrence table is treated as a finite sample out of a joint distribution of the variables; namely row and column indices. [sent-43, score-0.171]
22 To address this issue, in [9] a random walk over the data points is defined, serving to transform non co-occurrence data into a transition probability matrix that can be further analyzed via an IB algorithm. [sent-45, score-0.14]
23 In a more recent work [10] the suggestion is made to use the mutual information between different data points as part of a general information-theoretic treatment to the clustering problem. [sent-46, score-0.16]
24 The resulting algorithm, termed the Iclust algorithm, was demonstrated to be superior or comparable to 18 other commonly used clustering techniques over a wide range of applications where the input data cannot be interpreted as co-occurrences [10]. [sent-47, score-0.155]
25 However, both of the approaches have limitations – Iclust requires a sufficient amount of columns in the matrix for reliable estimation of the information relations between rows, and the Markovian Relaxation algorithm involves various non-trivial steps in the data pre-process [9]. [sent-48, score-0.232]
26 Here, we suggest an alternative approach, inspired by the multivariate IB framework [8]. [sent-49, score-0.116]
27 The multivariate IB principle expands the original IB work to handle situations where multiple systems of clusters are constructed simultaneously with respect to different variables and the input data may correspond to more than two variables. [sent-50, score-0.257]
28 While the multivariate IB was originally proposed for cooccurrence data, we argue here that this framework is rich enough to be rigorously applicable in the new situation. [sent-51, score-0.163]
29 The idea is simple and intuitive: we look for a compact grouping of rows and/or columns such that the product space defined by the resulting clusters is maximally informative about the matrix content, i. [sent-52, score-0.536]
30 We show that this problem can be posed and solved within the original multivariate IB framework. [sent-55, score-0.116]
31 Moreover, when missing values are present, the analysis suggests an information theoretic technique for their completion. [sent-57, score-0.104]
32 For gene expression data and financial data we obtain clusters of comparable quality (measured as coherence with manual labeling) to those obtained by state-of-the-art methods [10]. [sent-59, score-0.49]
33 For movie rating matrix completion, performance is superior to the best known alternatives in the collaborative filtering literature [11]. [sent-60, score-0.339]
34 1 Theory Problem Setting We henceforth denote the rows of a matrix by X, the columns by Y , and the matrix entries by Z (and small x, y and z for specific instances). [sent-62, score-0.532]
35 The number of rows is denoted by n, and the number of columns by m. [sent-63, score-0.29]
36 The partitions are defined by grouping of rows into clusters of rows C and grouping of columns into clusters of columns D. [sent-69, score-1.022]
37 The complexity of such partitions is measured by the sum of the weighted mutual information values nI(X; C) + mI(Y ; D). [sent-70, score-0.111]
38 For the hard partitions considered in the paper this sum is the number of bits required to describe the partition (see [12]). [sent-71, score-0.158]
39 In these terms, the goal is to find minimally complex partitions that preserve a given information level about the matrix values Z. [sent-73, score-0.13]
40 We first derive the relations between the quantities in the above optimization problem and then describe a sequential algorithm for its minimization. [sent-75, score-0.119]
41 p(z) ˆ We assume Z is a categorical variable, thus: p(z) = ˆ x,y:z(x,y)=z N 1 = N umber of entries equal to z , T otal number of populated entries 1x,y . [sent-80, score-0.684]
42 q(c|x)q(d|y)1x,y N umber of populated entries in section c, d , T otal number of populated entries N umber of entries equal to z in section c, d x,y:z(x,y)=z q(c|x)q(d|y) q (z|c, d) = ˆ = . [sent-81, score-1.099]
43 q(c|x)q(d|y)1x,y N umber of populated entries in section c, d x,y In the special case of complete data matrices 1x,y is identically 1 and q (c, d) may be decomposed ˆ 1 1 as: q (c, d) = q (c)ˆ(d). [sent-82, score-0.448]
44 2 x,y N = Sequential Optimization Given q(c|x) and q(d|y), one can calculate all the quantities defined above, and in particular the minimization functional Lmin = nI(X; C) + mI(Y ; D) − βI(C, D; Z) defined in equation (1). [sent-86, score-0.098]
45 Iteratively until convergence (no changes at step (b) are done) traverse all rows x and columns y of a matrix in a random order. [sent-91, score-0.35]
46 (b) Reassign it to a new cluster c∗ (or d∗ ), so that Lmin is minimized. [sent-93, score-0.099]
47 The new cluster may appear to be the old cluster, and then no change is counted. [sent-94, score-0.099]
48 The complexity of the algorithm is analyzed in the complementary material, where it is shown to be O(M (n + m)|C||D|), when M is the number of iterations required for convergence (usually 10-40) and |C|, |D| are cardinalities of the corresponding variables. [sent-98, score-0.115]
49 3 Minimal Description Length (MDL) Formulation The minimization functional Lmin has three free parameters that have to be externally determined: the tradeoff (or resolution) parameter β, and the cardinalities |C|, and |D|. [sent-100, score-0.238]
50 the desired number of clusters), there are cases when they also require optimization (as in the example of matrix completion in the next section). [sent-103, score-0.177]
51 The idea behind MDL is that models achieving better compression of the training data - when the compression includes a model description - also achieve better generalization on the test data. [sent-105, score-0.234]
52 The following compression scheme is defined: |C| row and |D| column clusters define |C||D| secN tions each getting roughly |C||D| samples. [sent-106, score-0.277]
53 And given the partition and the distributions q (z|c, d) the number of bits required to code the matrix entries is N H(Z|C, D) [12]. [sent-109, score-0.27]
54 2 Since H(Z|C, D) = H(Z) − I(C, D; Z) and H(Z) is constant the latter can be omitted from optimization, which results in total minimization functional |Z||C||D| N Fmdl = nI(X; C) + mI(Y ; D) − N I(C, D; Z) + log . [sent-111, score-0.098]
55 Multivariate IB searches for a meaningful structured partition of the data, defined by compression variables (in our case these are C and D). [sent-120, score-0.176]
56 The former specifies the relations between the input (data) variables – in our case, X, Y , and Z – and the compression variables. [sent-122, score-0.161]
57 A tradeoff between the multi-information preserved in the input structure I Gin (which we want to minimize) and the multi-information expressed by the target structure I Gout (which we want to maximize) is then optimized. [sent-124, score-0.093]
58 (The dashed link between X and Y in Gin appears when missing values are present in the matrix and may be chosen in any direction. [sent-129, score-0.164]
59 ) The corresponding optimization functional is: min q(c|x),q(d|y) = min q(c|x),q(d|y) I Gin (X; Y ; Z; C; D) − βI Gout (X; Y ; Z; C; D) I(X; Y ) + I(X, Y ; Z) + I(X; C) + I(Y ; D) − βI(C, D; Z). [sent-130, score-0.104]
60 By observing that I(X, Y ; Z) and I(X; Y ) are independent of q(c|x) and q(d|y) we can eliminate them from the above optimization and obtain exactly the optimization functional defined in (1), only with equal weighting of I(X; C) and I(Y ; D). [sent-131, score-0.15]
61 Thus the approach may be seen as a special case of the multivariate IB for the graphs Gin and Gout defined in Figure 1. [sent-132, score-0.116]
62 An important distinction should be made though: unlike the multivariate IB we do not require the existence of the joint probability distribution p(x, y, z). [sent-133, score-0.165]
63 5 Relation with Information Based Clustering A recent information-theoretic approach for cluster analysis is given in [10], and is known as information based clustering, abbreviated as Iclust. [sent-136, score-0.099]
64 The table provides the coherence of the achieved solutions for Nc = 5, 10, 15 and 20 row clusters. [sent-140, score-0.148]
65 For the ESR data an average coherence according to the three GOs is shown. [sent-142, score-0.114]
66 Importantly, in order to be able to evaluate I(Z1 ; Z2 |x1 , x2 ), I-Clust requires a sufficient amount of columns to be available, whereas our approach can operate with any amount of columns given. [sent-146, score-0.28]
67 , have many missing values, evaluating I(Z1 ; Z2 |x1 , x2 ) might be prohibitive as it requires a large enough intersection of non-missing observations for z1 and z2 . [sent-149, score-0.104]
68 On the contrary, the approach is designed to cope with this kind of data and resolves the problem by simultaneous grouping of rows and columns of the matrix to amplify statistics. [sent-151, score-0.429]
69 3 Applications We first compare our algorithm to I-Clust, as it was shown to be superior/comparable to 18 other commonly used clustering techniques over a wide range of application domains [10]. [sent-152, score-0.119]
70 Another application to a small dataset is provided in the supplementary material1 . [sent-154, score-0.113]
71 The multivariate IB is not directly applicable to all the provided examples. [sent-156, score-0.116]
72 For purposes of comparison we restrict our algorithm to cluster only the rows dimension of the matrix by setting the number of column clusters, |D|, equal to the number of columns, m. [sent-159, score-0.35]
73 ) For both applications we use exactly the same setting as [10], including row-wise quantization of the input data into five equally populated bins and choosing the same values for the β parameter. [sent-162, score-0.192]
74 The first dataset consists of gene expression levels of yeast genes in 173 various forms of environmental stress [16]. [sent-163, score-0.616]
75 Previous analysis identified a group of ≈ 300 stress-induced and ≈ 600 stress-repressed genes with “nearly identical but opposite patterns of expression in response to the environmental shifts” [17]. [sent-164, score-0.287]
76 These 900 genes were termed the yeast environmental stress response (ESR) module. [sent-165, score-0.365]
77 Following [10] we cluster the genes into |C| = 5, 10, 15, and 20 clusters. [sent-166, score-0.176]
78 il/∼seldin the biological significance of the results we consider the coherence [18] of the obtained clusters with respect to three Gene Ontologies (GOs) [19]. [sent-171, score-0.221]
79 Specifically, the coherence of a cluster is defined as the percentage of elements within this cluster that are given an annotation that was found to be significantly enriched in the cluster [18]. [sent-172, score-0.411]
80 The results achieved by our algorithm on this dataset are comparable to the results achieved by I-Clust in all the verified settings - see Table 1. [sent-173, score-0.092]
81 The second dataset is the day-to-day fractional changes in the price of the stocks in the Standard & Poor (S&P;) 500 list2 , during 273 trading days of 2003. [sent-174, score-0.299]
82 As with the gene expression data we take exactly the same setting used by [10] and cluster the stocks into |C| = 5, 10, 15 and 20 clusters. [sent-175, score-0.466]
83 To evaluate the coherence of the ensuing clusters we use the Global Industry Classification Standard3 , which classifies companies into four different levels, organized in a hierarchical tree: sector, industry group, industry, and subindustry. [sent-176, score-0.291]
84 As with the ESR dataset our results are comparable with the results of I-Clust for all the configurations - see Table 1. [sent-177, score-0.092]
85 2 Matrix Completion and Collaborative Filtering Here, we explore the full power of our algorithm in simultaneous grouping of rows and columns of a matrix. [sent-179, score-0.369]
86 A highly relevant application is matrix completion - given a matrix with missing values we would like to be able to complete it by utilizing similarities between rows and columns. [sent-180, score-0.484]
87 This problem is at the core of collaborative filtering applications, but may also appear in other fields. [sent-181, score-0.136]
88 The dataset consists of 100,000 ratings on a five-star scale for 1,682 movies by 943 users. [sent-183, score-0.203]
89 We stress that with this division the training data are extremely sparse - only 5% of the training matrix entries are populated, whereas 95% of the values are missing. [sent-185, score-0.253]
90 To find a “good” bi-clustering of the ratings matrix, minimization of Fmdl defined in (2) is done by scanning cluster cardinalities |C| and |D| and optimizing Lmin as defined in (3) for each fixed pair of |C|, |D|. [sent-186, score-0.302]
91 To convert the distributions q (z|c, d) we obtained in our clustering procedure to concrete predictions we take ˆ the median of z values within each section c, d. [sent-191, score-0.119]
92 The root mean squared error (RMSE) measured for the same clustering with a mean of z values within each section c, d taken for prediction yields 0. [sent-198, score-0.156]
93 As well, we improve on the results of prediction of missing values (collaborative filtering). [sent-208, score-0.141]
94 com/rules 3 Possible directions for further research include generalization to continuous data values, such as those obtained in gene expression and stock price data, and relaxation of the algorithm to “soft” clustering solutions. [sent-218, score-0.546]
95 The proposed framework also provides a natural platform for derivation of generalization bounds for missing values prediction that will be discussed elsewhere. [sent-220, score-0.141]
96 Document clustering using word clusters via the information bottleneck method. [sent-222, score-0.431]
97 Data clustering by markovian relaxation and the information bottleneck method. [sent-249, score-0.328]
98 Genomic expression programs in the response of yeast cells to environmental changes. [sent-296, score-0.303]
99 The environmental stress response: a common yeast response to environmental stresses. [sent-302, score-0.374]
100 Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. [sent-312, score-0.233]
wordName wordTfidf (topN-words)
[('ib', 0.318), ('lmin', 0.242), ('populated', 0.192), ('gout', 0.188), ('naftali', 0.188), ('bottleneck', 0.172), ('fmdl', 0.161), ('gin', 0.159), ('rows', 0.15), ('gene', 0.147), ('noam', 0.14), ('columns', 0.14), ('collaborative', 0.136), ('esr', 0.134), ('iclust', 0.134), ('stocks', 0.134), ('umber', 0.134), ('entries', 0.122), ('clustering', 0.119), ('multivariate', 0.116), ('coherence', 0.114), ('clusters', 0.107), ('missing', 0.104), ('slonim', 0.099), ('cluster', 0.099), ('compression', 0.095), ('stock', 0.094), ('yeast', 0.093), ('expression', 0.086), ('environmental', 0.086), ('ni', 0.082), ('ratings', 0.082), ('movie', 0.082), ('cardinalities', 0.081), ('otal', 0.081), ('grouping', 0.079), ('genes', 0.077), ('tishby', 0.075), ('nc', 0.074), ('completion', 0.071), ('stress', 0.071), ('partitions', 0.07), ('mi', 0.07), ('industry', 0.07), ('movies', 0.065), ('botstein', 0.064), ('mdl', 0.064), ('mae', 0.064), ('price', 0.063), ('rating', 0.061), ('january', 0.061), ('matrix', 0.06), ('rmse', 0.06), ('sigir', 0.06), ('tradeoff', 0.059), ('functional', 0.058), ('supplementary', 0.057), ('ltering', 0.057), ('dataset', 0.056), ('gos', 0.054), ('movielens', 0.054), ('seldin', 0.054), ('nancial', 0.053), ('genetics', 0.053), ('entry', 0.053), ('joint', 0.049), ('william', 0.047), ('partition', 0.047), ('cooccurrence', 0.047), ('viewers', 0.047), ('optimization', 0.046), ('non', 0.046), ('days', 0.046), ('description', 0.044), ('nir', 0.043), ('vi', 0.042), ('column', 0.041), ('sequential', 0.041), ('mutual', 0.041), ('friedman', 0.041), ('bits', 0.041), ('minimization', 0.04), ('thomas', 0.039), ('relevant', 0.039), ('response', 0.038), ('relaxation', 0.037), ('prediction', 0.037), ('sought', 0.036), ('comparable', 0.036), ('acm', 0.035), ('variables', 0.034), ('analyzed', 0.034), ('preserved', 0.034), ('ag', 0.034), ('row', 0.034), ('word', 0.033), ('degenerate', 0.033), ('categorical', 0.033), ('document', 0.032), ('relations', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
2 0.15674938 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
3 0.14669651 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis
Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´ µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½ µ½ Ù « « à ½ ´ Ò « «µ ´½ µÁ Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò on « à and Í Î ½ Í Î Ï µ ´ « à · Ò µ È ´ Í Ù ½ Î Ï µ (2) ¼ Í Î Ï µ ´¬ · ´Á ½µ Ò µ È ´ Í Ù ¼ Πϵ (3) È Ù , Í excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í Î Ï Í Î Ï µ µ ÐÓ ´« à · Ò µ ¬ · ´Á ½µ Ò ´ µ ½ ¾ ¢ Ü ´ Ü· µ¾ ´Ü Ü µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û ÛÓ µ ´Û ÛÓ µ ¾ ½ ´Ü Ûµ ´Ü Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í Î Ï µ ¼ Í Î Ï µ È ´Ù È ´Ù ¡ Ò È ´ Í Ù ½ Î Ï µ ¡ ´¬ · Á ½ Ò µ È ´ Í Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.
4 0.13805406 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
5 0.12207785 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
6 0.11061219 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
7 0.11002557 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
8 0.10888092 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
9 0.10065663 130 nips-2006-Max-margin classification of incomplete data
10 0.094202206 158 nips-2006-PG-means: learning the number of clusters in data
11 0.082086585 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
12 0.077401638 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
13 0.070107982 115 nips-2006-Learning annotated hierarchies from relational data
14 0.063806951 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
15 0.063080586 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
16 0.062648118 150 nips-2006-On Transductive Regression
17 0.062405895 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
18 0.061662011 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
19 0.060643554 116 nips-2006-Learning from Multiple Sources
20 0.057235114 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
topicId topicWeight
[(0, -0.224), (1, 0.05), (2, 0.059), (3, 0.089), (4, 0.068), (5, -0.045), (6, 0.084), (7, 0.008), (8, 0.076), (9, -0.035), (10, -0.111), (11, 0.192), (12, -0.037), (13, 0.123), (14, -0.128), (15, -0.031), (16, 0.063), (17, -0.031), (18, 0.044), (19, -0.047), (20, -0.004), (21, -0.053), (22, 0.082), (23, 0.038), (24, -0.009), (25, 0.052), (26, -0.018), (27, 0.092), (28, 0.119), (29, -0.01), (30, 0.082), (31, 0.014), (32, -0.051), (33, -0.0), (34, 0.124), (35, 0.071), (36, -0.033), (37, -0.078), (38, 0.183), (39, -0.067), (40, -0.043), (41, -0.129), (42, -0.055), (43, -0.126), (44, -0.138), (45, -0.119), (46, 0.068), (47, 0.078), (48, -0.087), (49, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.93656176 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
2 0.59183872 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis
Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´ µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½ µ½ Ù « « à ½ ´ Ò « «µ ´½ µÁ Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò on « à and Í Î ½ Í Î Ï µ ´ « à · Ò µ È ´ Í Ù ½ Î Ï µ (2) ¼ Í Î Ï µ ´¬ · ´Á ½µ Ò µ È ´ Í Ù ¼ Πϵ (3) È Ù , Í excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í Î Ï Í Î Ï µ µ ÐÓ ´« à · Ò µ ¬ · ´Á ½µ Ò ´ µ ½ ¾ ¢ Ü ´ Ü· µ¾ ´Ü Ü µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û ÛÓ µ ´Û ÛÓ µ ¾ ½ ´Ü Ûµ ´Ü Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í Î Ï µ ¼ Í Î Ï µ È ´Ù È ´Ù ¡ Ò È ´ Í Ù ½ Î Ï µ ¡ ´¬ · Á ½ Ò µ È ´ Í Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.
3 0.58886051 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
4 0.5589081 158 nips-2006-PG-means: learning the number of clusters in data
Author: Yu Feng, Greg Hamerly
Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1
5 0.48148477 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
6 0.45407626 7 nips-2006-A Local Learning Approach for Clustering
7 0.41510949 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
8 0.41136375 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
9 0.38316405 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
10 0.38090852 108 nips-2006-Large Scale Hidden Semi-Markov SVMs
11 0.37874609 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models
12 0.37341633 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
13 0.35851753 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
14 0.35708499 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
15 0.35298011 4 nips-2006-A Humanlike Predictor of Facial Attractiveness
16 0.3459585 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
17 0.3390685 166 nips-2006-Recursive Attribute Factoring
18 0.3388786 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
19 0.33692893 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
20 0.33358732 174 nips-2006-Similarity by Composition
topicId topicWeight
[(1, 0.095), (3, 0.027), (7, 0.06), (9, 0.06), (12, 0.01), (20, 0.053), (22, 0.047), (25, 0.011), (44, 0.071), (54, 0.318), (57, 0.068), (65, 0.036), (69, 0.03), (71, 0.014), (90, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.76043302 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
2 0.6904363 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
3 0.49361372 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
4 0.4783695 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
5 0.47467959 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.47378483 65 nips-2006-Denoising and Dimension Reduction in Feature Space
7 0.47078732 158 nips-2006-PG-means: learning the number of clusters in data
8 0.47018346 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
9 0.46751541 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
10 0.46719092 175 nips-2006-Simplifying Mixture Models through Function Approximation
11 0.46667877 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
12 0.46503663 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
13 0.46437663 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
14 0.46340212 154 nips-2006-Optimal Change-Detection and Spiking Neurons
15 0.46287513 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
16 0.46266469 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
17 0.46160275 117 nips-2006-Learning on Graph with Laplacian Regularization
18 0.46148828 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements
19 0.46061048 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
20 0.45999962 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights