nips nips2006 nips2006-132 knowledge-graph by maker-knowledge-mining

132 nips-2006-Modeling Dyadic Data with Binary Latent Factors


Source: pdf

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. [sent-10, score-0.255]

2 The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. [sent-11, score-0.561]

3 Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. [sent-12, score-0.715]

4 We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. [sent-13, score-0.23]

5 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. [sent-14, score-0.438]

6 In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). [sent-15, score-0.202]

7 A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. [sent-17, score-0.188]

8 Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. [sent-19, score-0.19]

9 [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. [sent-23, score-0.335]

10 [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. [sent-26, score-0.207]

11 In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. [sent-27, score-0.441]

12 Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. [sent-28, score-0.314]

13 In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. [sent-29, score-0.211]

14 Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. [sent-30, score-0.259]

15 The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . [sent-31, score-0.258]

16 (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. [sent-33, score-0.322]

17 2 BMF model description Binary matrix factorization is a model of an Á ¢ Â dyadic data matrix with exchangeable rows and columns. [sent-34, score-0.622]

18 Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . [sent-36, score-0.734]

19 is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. [sent-38, score-0.191]

20 ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . [sent-42, score-0.273]

21 We adopt the same priors over binary matrices as previously described in [5]. [sent-43, score-0.188]

22 For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . [sent-44, score-0.583]

23 The resulting prior depends only on the number Ò of active features in each column. [sent-48, score-0.235]

24 An identical prior is used on , with  rows and Ä columns, but with different concentration prior . [sent-49, score-0.326]

25 Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. [sent-51, score-0.314]

26 This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. [sent-54, score-0.148]

27 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). [sent-58, score-0.463]

28 1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. [sent-61, score-0.649]

29 The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. [sent-62, score-0.494]

30 Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. [sent-63, score-0.175]

31 ) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . [sent-65, score-0.23]

32 In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« Ã · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. [sent-68, score-0.193]

33 Let be an ÁÂ column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a ÁÂ ¢ ÃÄ binary matrix which is the kronecker product ª . [sent-70, score-0.294]

34 2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. [sent-77, score-0.703]

35 The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. [sent-78, score-0.285]

36 for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. [sent-82, score-0.32]

37 Let set A have at least one non-zero entry in rows other than . [sent-83, score-0.187]

38 Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. [sent-84, score-0.426]

39 Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. [sent-87, score-0.346]

40 Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. [sent-91, score-0.158]

41 In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. [sent-92, score-0.279]

42 3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. [sent-94, score-0.277]

43 In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. [sent-96, score-0.398]

44 For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. [sent-97, score-0.463]

45 To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. [sent-98, score-0.41]

46 The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. [sent-99, score-0.254]

47 Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. [sent-100, score-0.331]

48 We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. [sent-102, score-0.237]

49 The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. [sent-106, score-0.278]

50 Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. [sent-107, score-0.212]

51 4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. [sent-109, score-0.181]

52 The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. [sent-111, score-0.303]

53 Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e. [sent-112, score-0.33]

54 In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. [sent-118, score-0.255]

55 1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). [sent-122, score-0.334]

56 Vertical and horizontal bars are combined in some way to generate data samples. [sent-123, score-0.251]

57 The goal of the illustration is to show recovery of the latent structure in the form of bars. [sent-124, score-0.171]

58 We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. [sent-125, score-0.171]

59 The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). [sent-127, score-0.487]

60 Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. [sent-131, score-0.171]

61 We train the BMF with 200 training examples of the type shown in the top row in Figure 2. [sent-132, score-0.184]

62 Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. [sent-133, score-0.237]

63 Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. [sent-136, score-0.326]

64 The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. [sent-138, score-0.63]

65 (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. [sent-139, score-0.178]

66 The bottom half of these bars were removed and labeled missing during learning. [sent-141, score-0.368]

67 The missing values have been set to their expected value, ¼, to highlight the missing region. [sent-144, score-0.178]

68 The top row shows values of and used to generate the data. [sent-148, score-0.217]

69 The second row shows a sample of and from the Markov chain. [sent-149, score-0.151]

70 can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. [sent-150, score-0.197]

71 Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. [sent-151, score-0.157]

72 Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. [sent-153, score-0.318]

73 In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. [sent-160, score-0.3]

74 The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. [sent-161, score-0.332]

75 In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. [sent-163, score-0.153]

76 In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. [sent-164, score-0.407]

77 In the last three rows of Figure 4 we show the features captured by the model. [sent-165, score-0.31]

78 In row F, we show the average image of the data which have each feature in on. [sent-166, score-0.208]

79 In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. [sent-169, score-0.584]

80 These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. [sent-171, score-0.235]

81 Note that there are à features in rows F and G, and Ä features in row H. [sent-172, score-0.55]

82 The bottom half of these digits were removed and labeled missing during learning. [sent-175, score-0.283]

83 (C) The mean of the reconstruction for the bottom halves. [sent-178, score-0.16]

84 (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. [sent-180, score-0.261]

85 By adding these features together, which the features do, reconstructions of the digits is possible. [sent-183, score-0.332]

86 3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). [sent-187, score-0.377]

87 The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. [sent-188, score-0.538]

88 Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. [sent-189, score-0.246]

89 The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. [sent-190, score-0.326]

90 The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. [sent-192, score-0.432]

91 One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. [sent-195, score-0.495]

92 Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. [sent-196, score-0.389]

93 BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. [sent-197, score-0.61]

94 The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. [sent-198, score-0.975]

95 We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. [sent-200, score-0.287]

96 (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. [sent-201, score-0.178]

97 BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. [sent-208, score-0.44]

98 This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. [sent-209, score-0.171]

99 Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. [sent-246, score-0.169]

100 Infinite latent feature models and the Indian buffet process. [sent-251, score-0.308]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bmf', 0.636), ('dyadic', 0.202), ('bars', 0.171), ('latent', 0.171), ('rows', 0.153), ('row', 0.151), ('columns', 0.135), ('features', 0.123), ('binary', 0.12), ('reshaped', 0.116), ('gene', 0.105), ('ibp', 0.092), ('mcmc', 0.09), ('entries', 0.09), ('missing', 0.089), ('reconstruction', 0.089), ('digits', 0.086), ('weights', 0.082), ('acceptance', 0.077), ('halves', 0.077), ('viewer', 0.075), ('componential', 0.075), ('proposal', 0.074), ('poisson', 0.073), ('factorization', 0.071), ('bottom', 0.071), ('column', 0.07), ('jain', 0.069), ('matrices', 0.068), ('prior', 0.066), ('factorial', 0.064), ('expression', 0.064), ('posterior', 0.064), ('exchangeable', 0.06), ('overlapping', 0.06), ('updates', 0.06), ('nite', 0.058), ('impute', 0.058), ('plaid', 0.058), ('toronto', 0.057), ('neal', 0.057), ('feature', 0.057), ('gibbs', 0.056), ('contiguous', 0.055), ('movie', 0.053), ('markov', 0.052), ('integrated', 0.052), ('sampling', 0.051), ('radford', 0.05), ('countably', 0.05), ('proposals', 0.049), ('horizontal', 0.047), ('active', 0.046), ('items', 0.046), ('buffet', 0.046), ('indian', 0.046), ('precisions', 0.046), ('unobserved', 0.045), ('conditioning', 0.044), ('neighbours', 0.043), ('concentration', 0.041), ('bayesian', 0.041), ('clusters', 0.041), ('images', 0.041), ('merging', 0.04), ('zoubin', 0.04), ('inference', 0.039), ('moves', 0.039), ('patients', 0.038), ('pixel', 0.038), ('dirichlet', 0.038), ('predictive', 0.037), ('half', 0.037), ('nearest', 0.037), ('composed', 0.036), ('multiple', 0.036), ('create', 0.036), ('hyperparameters', 0.036), ('analytically', 0.036), ('cause', 0.036), ('conditional', 0.035), ('proposing', 0.035), ('beta', 0.035), ('unity', 0.035), ('item', 0.035), ('mixture', 0.035), ('ratio', 0.035), ('models', 0.034), ('matrix', 0.034), ('entry', 0.034), ('captured', 0.034), ('model', 0.034), ('splitting', 0.034), ('top', 0.033), ('unsupervised', 0.033), ('generate', 0.033), ('logistic', 0.033), ('likelihoods', 0.033), ('complex', 0.032), ('precision', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

2 0.23030525 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

3 0.14669651 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby

Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1

4 0.13426118 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

5 0.10927062 130 nips-2006-Max-margin classification of incomplete data

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1

6 0.1043349 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

7 0.098274022 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

8 0.094698623 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

9 0.081940658 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

10 0.076438874 66 nips-2006-Detecting Humans via Their Pose

11 0.075251013 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

12 0.074098431 57 nips-2006-Conditional mean field

13 0.071517244 158 nips-2006-PG-means: learning the number of clusters in data

14 0.07123965 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

15 0.070439242 113 nips-2006-Learning Structural Equation Models for fMRI

16 0.069993243 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.069148481 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

18 0.067234002 115 nips-2006-Learning annotated hierarchies from relational data

19 0.063768663 41 nips-2006-Bayesian Ensemble Learning

20 0.060815148 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.229), (1, 0.029), (2, 0.119), (3, -0.086), (4, 0.049), (5, -0.004), (6, 0.159), (7, -0.086), (8, -0.027), (9, -0.022), (10, -0.09), (11, 0.179), (12, 0.04), (13, 0.024), (14, -0.132), (15, 0.027), (16, 0.01), (17, 0.004), (18, -0.03), (19, 0.049), (20, 0.066), (21, -0.148), (22, 0.183), (23, -0.07), (24, 0.029), (25, 0.117), (26, -0.014), (27, 0.143), (28, 0.048), (29, 0.058), (30, 0.122), (31, 0.028), (32, -0.114), (33, 0.117), (34, 0.085), (35, -0.038), (36, -0.08), (37, -0.141), (38, -0.112), (39, -0.126), (40, -0.016), (41, -0.011), (42, -0.068), (43, -0.054), (44, -0.118), (45, -0.088), (46, 0.012), (47, 0.034), (48, -0.007), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.956173 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

2 0.86262298 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

3 0.69108206 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

4 0.64488482 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby

Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1

5 0.49278715 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin, Sean Cutler

Abstract: We present a hierarchical Bayesian model for sets of related, but different, classes of time series data. Our model performs alignment simultaneously across all classes, while detecting and characterizing class-specific differences. During inference the model produces, for each class, a distribution over a canonical representation of the class. These class-specific canonical representations are automatically aligned to one another — preserving common sub-structures, and highlighting differences. We apply our model to compare and contrast solenoid valve current data, and also, liquid-chromatography-ultraviolet-diode array data from a study of the plant Arabidopsis thaliana. 1 Aligning Time Series From Different Classes Many practical problems over a wide range of domains require synthesizing information from several noisy examples of one or more categories in order to build a model which captures common structure and also learns the patterns of variability between categories. In time series analysis, these modeling goals manifest themselves in the tasks of alignment and difference detection. These tasks have diverse applicability, spanning speech & music processing, equipment & industrial plant diagnosis/monitoring, and analysis of biological time series such as microarray & liquid/gas chromatography-based laboratory data (including mass spectrometry and ultraviolet diode arrays). Although alignment and difference detection have been extensively studied as separate problems in the signal processing and statistical pattern recognition communities, to our knowledge, no existing model performs both tasks in a unified way. Single class alignment algorithms attempt to align a set of time series all together, assuming that variability across different time series is attributable purely to noise. In many real-world situations, however, we have time series from multiple classes (categories) and our prior belief is that there is both substantial shared structure between the class distributions and, simultaneously, systematic (although often rare) differences between them. While in some circumstances (if differences are small and infrequent), single class alignment can be applied to multi-class data, it is much more desirable to have a model which performs true multi-class alignment in a principled way, allowing for more refined and accurate modeling of the data. In this paper, we introduce a novel hierarchical Bayesian model which simultaneously solves the multi-class alignment and difference detection tasks in a unified manner, as illustrated in Figure 1. The single-class alignment shown in this figure coerces the feature in region A for class 1 to be inappropriately collapsed in time, and the overall width of the main broad peak in class 2 to be inappropriately narrowed. In contrast, our multi-class model handles these features correctly. Furthermore, because our algorithm does inference for a fully probabilistic model, we are able to obtain quantitative measures of the posterior uncertainty in our results, which, unlike the point estimates produced by most current approaches, allow us to assess our relative confidence in differences learned by the model. Our basic setup for multi-class alignment assumes the class labels are known for each time series, as is the case for most difference detection problems. However, as we discuss at the end of the paper, our model can be extended to the completely unsupervised case. normal abnormal 3 common structure 3 1 1 20 0 3 −20 class−specific differences 20 1 0 −20 3 class−specific models 3 A 1 1 0 50 100 150 200 250 0 50 100 150 200 Figure 1: Nine time series from the NASA valve solenoid current data set [4]. Four belong to a ‘normal’ class, and five to an ‘abnormal’ class. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is current amplitude. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data (over MCMC samples) within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM. Top right: Parent trace. Middle right: Class-specific energy impulses with the topmost showing the class impulses for the less smooth class. Bottom right: Child traces superimposed. Note that if one generates more HB-CPM MCMC samples, the parent cycles between the two classes since the model has no preference for which class is seen as a modification of the other; the child classes remain stable however. 2 A Hierarchical Bayesian Continuous Profile Model Building on our previous Continuous Profile Model (CPM) [7], we propose a Hierarchical Bayesian Continuous Profile Model (HB-CPM) to address the problems of multi-class alignment and difference detection, together, for sets of sibling time series data — that is, replicate time series from several distinct, but related classes. The HB-CPM is a generative model that allows simultaneous alignment of time series and also provides aligned canonical representations of each class along with measures of uncertainty on these representations. Inference in the model can be used, for example, to detect and quantify similarities and differences in class composition. The HB-CPM extends the basic CPM in two significant ways: i) it addresses the multi-class rather than the single-class alignment problem, and ii) it uses a fully Bayesian framework rather than a maximum likelihood approach, allowing us to estimate uncertainty in both the alignments and the canonical representations. Our model, depicted in Figure 2, assumes that each observed time series is generated as a noisy transformation of a single, class-specific latent trace. Each latent trace is an underlying, noiseless representation of the set of replicated, observable time series belonging to a single class. An observed time series is generated from this latent trace exactly as in the original CPM, by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as with an HMM. Each hidden state corresponds to a ‘latent time’ in the latent trace. Thus different choices of hidden state sequences result in different nonlinear transformations of the underlying trace. The HB-CPM uses a separate latent trace for each class, which we call child traces. Crucially, each of these child traces is generated from a single parent trace (also unobserved), which 250 captures the common structure among all of the classes. The joint prior distribution for the child traces in the HB-CPM model can be realized by first sampling a parent trace, and then, for each class, sampling a sparse ‘difference vector’ which dictates how and where each child trace should differ from the common parent. z Figure 2: Core elements of the HB-CPM, illustrated with two-class data (hidden and observed) drawn from the model’s prior. parent r1 r2 z1 impulse z2 child trace child trace x1 x2 x3 class 1 observed time series 2.1 impulse x4 x5 x6 class 2 observed time series The Prior on Latent Traces Let the vector xk = (xk , xk , ..., xk ) represent the k th observed scalar time series, and w k ∈ 1..C 1 2 N be the class label of this time series. Also, let z = (z1 , z2 , ..., zM ) be the parent trace, and c c c z c = (z1 , z2 , ..., zM ) be the child trace for the cth class. During inference, posterior samples of c z form a canonical representation of the observed times series in class c, and z contains their common sub-structure. Ideally, the length of the latent traces, M , would be very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments, and found this to be sufficient with < 0.2. Because the resolution of the latent traces is higher than that of the observed time series, experimental time can be made to effectively speed up or slow down by advancing along the latent trace in larger or smaller jumps. As mentioned previously, the child traces in the HB-CPM inherit most of their structure from a common parent trace. The differences between child and parent are encoded in a difference vector for each class, dc = (dc , dc , ..., dc ); normally, most elements of dc are close to zero. Child traces 1 2 M are obtained by adding this difference vector to the parent trace: z c = z + dc . We model both the parent trace and class-specific difference vectors with what we call an energy impulse chain, which is an undirected Markov chain in which neighbouring nodes are encouraged to be similar (i.e., smooth), and where this smoothness is perturbed by a set of marginally independent energy impulse nodes, with one energy impulse node attached to each node in the chain. For the difc c c ference vector of the cth class, the corresponding energy impulses are denoted r c = (r1 , r2 , ..., rM ), and for the parent trace the energy impulses are denoted r = (r1 , r2 , ..., rM ). Conditioned on the energy impulses, the probability of a difference vector is p(dc |r c , αc , ρc ) = 1 1 exp − Zr c 2 M −1 i=1 M c (dc − dc )2 (dc − ri )2 i i+1 i + c c α ρ i=1 . (1) Here, Zrc is the normalizing constant for this probability density, αc controls the smoothness of the chain, and ρc controls the influence of the energy impulses. Together, αc and ρc also control the overall tightness of the distribution for dc . Presently, we set all αc = α , and similarly ρc = ρ — that is, these do not differ between classes. Similarly, the conditional probability of the parent trace is p(z|r, α, ρ) = 1 1 exp − Zr 2 M −1 i=1 M (zi − zi+1 )2 (zi − ri )2 + α ρ i=1 . (2) These probability densities are each multivariate Gaussian with tridiagonal precision matrixes (corresponding to the Markov nature of the interactions). Each component of each energy impulse for the parent, rj , is drawn independently from a single univariate Gaussian, N (ri |µpar , spar ), whose mean and variance are in turn drawn from a Gaussian and inverse-gamma, respectively. The class-specific difference vector impulses, however, are drawn from a mixture of two zero-mean Gaussians — one ‘no difference’ (inlier) Gaussian, and one ‘classdifference’ (outlier) Gaussian. The means are zero so as to encourage difference vectors to be near c zero (and thus child traces to be similar to the parent trace). Letting δi denote the binary latent c mixture component indicator variables for each rj , c c c c p(δj ) = Multinomial(δj |mc , mc ) = (mc )δj (mc )1−δj in out in out c c p(rj |δj ) = c N (rj |0, s2 ), in c N (rj |0, s2 ), out if if c δj c δj =1 . =0 (3) (4) Each Gaussian mixture variance has an Inverse-Gamma prior, which for the ‘no difference’ variance, s2 , is set to have very low mean (and not overly dispersed) so that ‘no difference’ regions truly have in little difference from the parent class, while for the ‘class-difference’ variance, s 2 , the prior is out set to have a larger mean, so as to model our belief that substantial class-specific differences do occasionally exist. The priors for αc , ρc , α, ρ are each log-normal (inverse-gamma priors would not be conjugate in this model, so we use log-normals which are easier to specify). Additionally, the mixing proportions, mc , mc , have a Dirichlet prior, which typically encodes our belief that the out in proportion that are ‘class differences’ is likely to be small. 2.2 The HMM Portion of the Model Each observed xk is modeled as being generated by an HMM conditioned on the appropriate child k trace, z w . The probability of an observed time series conditioned on a path of hidden time states, k N wk τ k , and the child trace, is given by p(xk |z w , τ k ) = i=1 N (xk |zτ k uk , ξ k ), where ξ k is the i i emission variance for time series k, and the scale factor, uk , allows for constant, global, multiplicak tive rescaling. The HMM transition probabilities T k (τi−1 → τik ) are multinomial within a limited k k range, with p (τi = a|τi−1 = b) = κ(a−b) for (a − b) ∈ [1, Jτ ] and pk (τi = a|τi−1 = b) = 0 for (a − b) < 1 or (a − b) > Jτ where Jτ is the maximum allowable number of consecutive time Jτ states that can be advanced in a single transition. (Of course, i=1 κk = 1.) This multinomial disi tribution, in turn, has a Dirichlet prior. The HMM emission variances, ξ k , have an inverse-gamma prior. Additionally, the prior over the first hidden time state is a uniform distribution over a constant number of states, 1..Q, where Q defines how large a shift can exist between any two observed time series. The prior over each global scaling parameter, uk , is a log-normal with fixed variance and mean of zero, which encourages the scaling factors to remain near unity. 3 Posterior Inference of Alignments and Parameters by MCMC Given a set of observed time series (and their associated class labels), the main computational operation to be performed in the HB-CPM is inference of the latent traces, alignment state paths and other model parameters. Exact inference is analytically intractable, but we are able to use Markov Chain Monte Carlo (MCMC) methods to create an iterative algorithm which, if run for sufficiently long, produces samples from the correct posterior distribution. This posterior provides simultaneous alignments of all observed time series in all classes, and also, crucially, aligned canonical representations of each class, along with error bars on these representations, allowing for a principled approach to difference detection in time series data from different classes. We may also wish to obtain a posterior estimate of some of our parameters conditioned on the data, and marginalized over the other parameters. In particular, we might be interested in obtaining the posterior over hidden time state vectors for each time series, τ k , which together provide a simultaneous, multi-class alignment of our data. We may, in addition, or, alternatively, be interested in the posterior of the child traces, z c , which together characterize how the classes agree and disagree. The former may be more of interest for visualizing aligned observed time series, or in expanding out aligned scalar time series to a related vector time series, while the latter would be more of interest when looking to characterize differences in multi-class, scalar time series data. We group our parameters into blocks, and sample these blocks conditioned on the values of the other parameters (as in Gibbs sampling) — however, when certain conditional distributions are not amenable to direct sampling, we use slice sampling [8]. The scalar conditional distributions for each c of µpar , spar , mc , mc , δj , κk are known distributions, amenable to direct sampling. The conditional out i in distributions for the scalars αc , ρc , α, ρ and uk are not tractable, and for each of these we use slice sampling (doubling out and shrinking). The conditional distribution for each of r and r c is multivariate Gaussian, and we sample directly from each using a Cholesky decomposition of the covariance matrix. 1 p(r|z, α, ρ) = p(z|r, α, ρ)p(r) = N (r|c, C) (5) Z 1 p(r c |dc , αc , ρc ) = p(dc |r, αc , ρc )p(r) = N (r c |b, B), (6) Z where, using I to denote the identity matrix, −1 µpar z S + Ispar −1 +I (7) , c=C C= ρ2 ρ spar B= S† 2 (ρc ) −1 +v c −1 , b=B dc . ρc (8) −1 The diagonal matrix v c consists of mixture component variances (s2 or s2 ). S −1 [or S † ] is the out in tridiagonal precision matrix of the multivariate normal distribution p(z|r, α, ρ) [or p(d c |r c , αc , ρc )], −1 −1 2 1 1 1 and has entries Sj,j = α + ρ for j = 2..(M − 1), Sj,j = α + ρ for j = 1, M , and −1 −1 −1 1 Sj,j+1 = Sj+1,j = − α [or analogously for S † ]. The computation of C and B can be made more efficient by using the Sherman-Morrison-Woodbury matrix inversion lemma. For example, −1 −1 −1 −1 −1 B = (ρ1)2 (S † − S † (v c + S † )−1 S † ), and we have S −1 [or S † ] almost for free, and c no longer need to invert S [or S † ] to obtain it. The conditional distributions of each of z, z c are also multivariate Gaussians. However, because of the underlying Markov dependencies, their precision matrixes are tridiagonal, and hence we can use belief propagation, in the style of Kalman filtering, followed by a stochastic traceback to sample from them efficiently. Thus each can be sampled in time proportional to M rather than M 3 , as required for a general multivariate Gaussian. Lastly, to sample from the conditional distribution of the hidden time vectors for each sample, τ k , we run belief propagation (analogous to the HMM forward-backward algorithm) followed by a stochastic traceback. In our experiments, the parent trace was initialized by averaging one smoothed example from each class. The child traces were initialized to the initial parent trace. The HMM states were initialized by a Viterbi decoding with respect to the initial values of the other parameters. The scaling factors were initialized to unity, and the child energy impulses to zero. MCMC was run for 5000 iterations, with convergence generally realized in less than 1000 iterations. 4 Experiments and Results We demonstrate use of the HB-CPM on two data sets. The first data set is the part of the NASA shuttle valve data [4], which measures valve solenoid current against time for some ‘normal’ runs and some ‘abnormal’ runs. Measurements were taken at a rate of 1ms per sample, with 1000 samples per time series. We subsampled the data by a factor of 7 in time since it was extremely dense. The results of performing posterior inference in our model on this two-class data set are shown in Figure 1. They nicely match our intuition of what makes a good solution. In our experiments, we also compared our model to a simple “single-class” version of the HB-CPM in which we simply remove the child trace level of the model, letting all observed data in both classes depend directly on one single parent trace. The single-class alignment, while doing a reasonable job, does so by coercing the two classes to look more similar than they should. This is evident in one particular region labeled on the graph and discussed in the legend. Essentially a single class alignment causes us to lose class-specific fine detail — the precise information we seek to retain for difference detection. The second data set is from a botany study which uses reverse-phase HPLC (high performance liquid chromatography) as a high-throughput screening method to identify genes involved in xenobiotic uptake and metabolism in the model plant Arabidopsis thaliana. Liquid-chromatography (LC) techniques are currently being developed and refined with the aim of providing a robust platform with which to detect differences in biological organisms — be they plants, animals or humans. Detected differences can reveal new fundamental biological insight, or can be applied in more clinical settings. LC-mass spectrometry technology has recently undergone explosive growth in tackling the problem of biomarker discovery — for example, detecting biological markers that can predict treatment outcome or severity of disease, thereby providing the potential for improved health care and better understanding of the mechanisms of drug and disease. In botany, LC-UV data is used to help understand the uptake and metabolism of compounds in plants by looking for differences across experimental conditions, and it is this type of data that we use here. LC separates mixtures of analytes on the basis of some chemical property — hydrophobicity, for reverse-phase LC, used to generate our data. Components of the analyte in our data set were detected as they came off the LC column with a Diode Array Detector (DAD), yielding UV-visible spectra collected at 540 time points (we used the 280 nm band, which is informative for these experiments). We performed background subtraction [2] and then subsampled this data by a factor of four. This is a three-class data set, where the first class is untreated plant extract, followed by two classes consisting of this same plant treated with compounds that were identified as possessing robust uptake in vivo, and, hence, when metabolized, provide a differential LC-UV signal of interest. Figure 3 gives an overview of the LC-UV results, while Figure 4 zooms in on a particular area of interest to highlight how subtle differences can be detected by the HB-CPM, but not by a singleclass alignment scheme. As with the NASA data set, a single-class alignment coerces features across classes that are in fact different to look the same, thereby preventing us from detecting them. Recall that this data set consists of a ‘no treatment’ plant extract, and two ‘treatments’ of this same plant. Though our model was not informed of these special relationships, it nevertheless elegantly captures this structure by giving almost no energy impulses to the ‘no treatment’ class, meaning that this class is essentially the parent trace, and allowing the ‘treatment’ classes to diverge from it, thereby nicely matching the reality of the situation. All averaging over MCMC runs shown is over 4000 samples, after a 1000 burn in period, which took around 3 hours for the NASA data, and 5 hours for the LC data set, on machines with dual 3 GHz Pentium 4 processors. 5 Related Work While much work has been done on time series alignment, and on comparison/clustering of time series, none of this work, to our knowledge, directly addresses the problem presented in this paper — simultaneously aligning and comparing sets of related time series in order to characterize how they differ from one another. The classical algorithm for aligning time series is Dynamic Time Warping (DTW) [10]. DTW works on pairs of time series, aligning one time series to a specified reference time, in a non-probabilistic way, without explicit allowance for differences in related time series. More recently, Gaffney et al [5] jointly clustered and aligned time series data from different classes. However, their model does not attempt to put time series from different classes into correspondence with one another — only time series within a class are aligned to one another. Ziv Bar-Joseph et al [1] use a similar approach to cluster and align microarray time series data. Ramsay et al [9] have introduced a curve clustering model, in which a time warping function, h(t), for each time series is learned by way of learning its relative curvature, parameterized with order one B-spline coefficients. This model accounts for 5 5 3 3 9 0 5 −9 9 0 −9 9 3 0 −9 5 5 3 3 0 50 100 150 200 250 300 0 50 100 150 200 250 Figure 3: Seven time series from each of three classes of LC-UV data. On all figures, the horizontal axis is time, or latent time for figures of latent traces and observed time series aligned to latent traces. The vertical axis is log of UV absorbance. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM model. Top right: Parent trace. Middle right: Class-specific energy impulses, with the top-most showing the class impulses for the ‘no treatment’ class. Bottom right: Child traces superimposed. See Figure 4 for a zoom-in in around the arrow. systematic changes in the range and domain of time series in a way that aligns curves with the same fundamental shape. However, their method does not allow for class-specific differences between shapes to be taken into account. The anomaly detection (AD) literature deals with related, yet distinct problems. For example, Chan et al [3] build a model of one class of time series data (they use the same NASA valve data as in this paper), and then match test data, possibly belonging to another class (e.g. ‘abnormal’ shuttle valve data) to this model to obtain an anomaly score. Emphasis in the AD community is on detecting abnormal events relative to a normal baseline, in an on-line manner, rather than comparing and contrasting two or more classes from a dataset containing examples of all classes. The problem of ‘elastic curve matching‘ is addressed in [6], where a target time series that best matches a query series is found, by mapping the problem of finding the best matching subsequence to the problem of finding the cheapest path in a DAG (directed acyclic graph). 6 Discussion and Conclusion We have introduced a hierarchical, Bayesian model to perform detection of rare differences between sets of related time series, a problem which arises across a wide range of domains. By training our model, we obtain the posterior distribution over a set of class-specific canonical representations of each class, which are aligned in a way that preserves their common sub-structures, yet retains and highlights important differences. This model can be extended in several interesting and useful ways. One small modification could be useful for the LC-UV data set presented in this paper, in which one of the classes was ‘no treatment’, while the other two were each a different ‘treatment’. We might model the ‘no treatment’ as the parent trace, and each of the treatments as a child trace, so that the direct comparison of interest would be made more explicit. Another direction would be to apply the HB-CPM in a completely 300 4 5 0 3 −5 4 5 0 3 −5 4 5 0 3 −5 100 105 110 115 120 125 130 135 140 145 150 155 100 105 110 115 120 125 130 135 140 145 150 155 Figure 4: Left: A zoom in of data displayed in Figure 3, from the region of time 100-150 (labeled in that figure in latent time, not observed time). Top left: mean and standard deviation of the unaligned data. Middle left: mean and standard deviation of the single-class alignment. Bottom left: mean and standard deviation of the child traces from the HB-CPM. A case in point of a difference that could be detected with the HB-CPM and not in the raw or single-class aligned data, is the difference occurring at time point 127. Right: The mean and standard deviation of the child energy impulses, with dashed lines showing correspondences with the child traces in the bottom left panel. unsupervised setting where we learn not only the canonical class representations, but also obtain the posterior over the class labels by introducing a latent class indicator variable. Lastly, one could use a model with cyclical latent traces to model cyclic data such as electrocardiogram (ECG) and climate data. In such a model, an observed trace being generated by the model would be allowed to cycle back to the start of the latent trace, and the smoothness constraints on the trace would be extended to apply to beginning and end of the traces, coercing these to be similar. Such a model would allow one to do anomaly detection in cyclic data, as well as segmentation. Acknowledgments: Thanks to David Ross and Roland Memisevic for useful discussions, and Ben Marlin for his Matlab slice sampling code. References [1] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39–48, 2002. [2] H. Boelens, R. Dijkstra, P. Eilers, F. Fitzpatrick, and J. Westerhuis. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and raman spectroscopic detection. Journal of Chromatography A, 1057:21–30, 2004. [3] P. K. Chan and M. V. Mahoney. Modeling multiple time series for anomaly detection. In ICDM, 2005. [4] B. Ferrell and S. Santuro. NASA shuttle valve data. http://www.cs.fit.edu/∼pkc/nasa/data/, 2005. [5] S. J. Gaffney and P. Smyth. Joint probabilistic curve clustering and alignment. In Advances in Neural Information Processing Systems 17, 2005. [6] L. Latecki, V. Megalooikonomou, Q. Wang, R. Lakaemper, C. Ratanamahatana, and E. Keogh. Elastic partial matching of time series, 2005. [7] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems 17, 2005. [8] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [9] J. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society(B), 60, 1998. [10] H. Sakoe and S. Chiba. Dynamic programming algorithm for spoken word recognition. Readings in Speech Recognition, pages 159–165, 1990.

6 0.4922193 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

7 0.45560786 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

8 0.41481036 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.40493158 130 nips-2006-Max-margin classification of incomplete data

10 0.39116877 113 nips-2006-Learning Structural Equation Models for fMRI

11 0.38546816 66 nips-2006-Detecting Humans via Their Pose

12 0.38431635 41 nips-2006-Bayesian Ensemble Learning

13 0.36883125 53 nips-2006-Combining causal and similarity-based reasoning

14 0.36865211 138 nips-2006-Multi-Task Feature Learning

15 0.36689118 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

16 0.36019644 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

17 0.35522604 98 nips-2006-Inferring Network Structure from Co-Occurrences

18 0.35471624 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

19 0.35303327 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

20 0.34430078 115 nips-2006-Learning annotated hierarchies from relational data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.072), (3, 0.02), (7, 0.062), (9, 0.032), (20, 0.443), (22, 0.058), (44, 0.06), (54, 0.014), (57, 0.087), (65, 0.034), (69, 0.026), (71, 0.014), (90, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91617107 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

Author: Mark Johnson, Thomas L. Griffiths, Sharon Goldwater

Abstract: This paper introduces adaptor grammars, a class of probabilistic models of language that generalize probabilistic context-free grammars (PCFGs). Adaptor grammars augment the probabilistic rules of PCFGs with “adaptors” that can induce dependencies among successive uses. With a particular choice of adaptor, based on the Pitman-Yor process, nonparametric Bayesian models of language using Dirichlet processes and hierarchical Dirichlet processes can be written as simple grammars. We present a general-purpose inference algorithm for adaptor grammars, making it easy to define and use such models, and illustrate how several existing nonparametric Bayesian models can be expressed within this framework. 1

same-paper 2 0.89095467 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

3 0.8696174 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

4 0.70960832 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

5 0.51633185 41 nips-2006-Bayesian Ensemble Learning

Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch

Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1

6 0.49635249 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

7 0.488222 192 nips-2006-Theory and Dynamics of Perceptual Bistability

8 0.4830364 121 nips-2006-Learning to be Bayesian without Supervision

9 0.48160404 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

10 0.47853258 53 nips-2006-Combining causal and similarity-based reasoning

11 0.47702423 42 nips-2006-Bayesian Image Super-resolution, Continued

12 0.4760552 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

13 0.45950201 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

14 0.4582561 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

15 0.45429224 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

16 0.4506205 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

17 0.45044106 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

18 0.44624758 115 nips-2006-Learning annotated hierarchies from relational data

19 0.44480982 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

20 0.44235563 113 nips-2006-Learning Structural Equation Models for fMRI