nips nips2006 nips2006-9 knowledge-graph by maker-knowledge-mining

9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments


Source: pdf

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. [sent-7, score-0.783]

2 This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. [sent-8, score-0.549]

3 We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. [sent-9, score-0.554]

4 A variety of solutions to this problem are based on the analysis of similarity judgments. [sent-11, score-0.197]

5 By defining a probabilistic model that accounts for the similarity between stimuli based on their representation, statistical methods can be used to infer underlying representations from human similarity judgments. [sent-12, score-0.592]

6 The particular methods used to infer representions from similarity judgments depend on the nature of the underlying representations. [sent-13, score-0.333]

7 For stimuli that are assumed to be represented as points in some psychological space, multidimensional scaling algorithms [1] can be used to translate similarity judgments into stimulus locations. [sent-14, score-0.54]

8 For stimuli that are assumed to be represented in terms of a set of latent features, additive clustering is the method of choice. [sent-15, score-0.407]

9 The original formulation of the additive clustering (ADCLUS) problem [2] is as follows. [sent-16, score-0.296]

10 Assume that we have data in the form of a n×n similarity matrix S = [sij ], where sij is the judged similarity between the ith and jth of n objects. [sent-17, score-0.579]

11 Under these assumptions, a representation that uses m features to describe n objects is given by an n × m matrix F = [fik ], where fik = 1 if the ith object possesses the kth feature, and fik = 0 if it is not. [sent-20, score-0.892]

12 Each feature has an associated non-negative saliency weight w = (w1, . [sent-21, score-0.283]

13 When written in matrix form, the ADCLUS model seeks to uncover a feature matrix F and a weight vector w such that S ≈ FWF , where W = diag(w) is a diagonal matrix with nonzero elements corresponding to the saliency weights. [sent-25, score-0.382]

14 In most applications it is assumed that there is a fixed “additive constant”, a required feature possessed by all objects. [sent-26, score-0.227]

15 2 A Nonparametric Bayesian ADCLUS Model To formalize additive clustering as a statistical model, it is standard practice to assume that error terms are i. [sent-27, score-0.296]

16 Equation 1 reveals that the additive clustering model is structurally similar to the better-known factor analysis model [4], although there are several differences: most notably the constraints that F is binary valued, W is necessarily diagonal and S is non-negative. [sent-36, score-0.296]

17 In any case, if we define µij = k wk fik fjk to be the similarity predicted by a particular choice of F and w, then: sij | F, w, σ ∼ Normal(µij , σ2), (2) where σ2 is the variance of the Gaussian error distribution. [sent-37, score-0.915]

18 However, self-similarities sii are not modeled in additive clustering, and are generally fixed to (the same) arbitrary values for both the model and data. [sent-38, score-0.198]

19 In our approach, additive clustering is framed as a form of nonparametric Bayesian inference, in which Equation 2 provides the likelihood function, and the model is completed by placing priors over the weights w and the feature matrix F. [sent-40, score-0.533]

20 We assume a fixed Gamma prior over feature saliencies though it is straightforward to extend this to other, more flexible, priors. [sent-41, score-0.291]

21 Setting a prior over binary feature matrices F is more difficult, since there is generally no good reason to assume an upper bound on the number of features that might be relevant to a particular similarity matrix. [sent-42, score-0.621]

22 Each customer entering the restaurant samples a number of dishes from the buffet, with a preference for those dishes that other diners have tried. [sent-45, score-0.254]

23 For the kth dish sampled by at least one of the first n − 1 customers, the probability that the nth customer will also try that dish is nk p(fnk = 1|Fn−1) = , (3) n where Fn−1 records the choices of the previous customers, and nk denotes the number of previous customers that have sampled that dish. [sent-46, score-0.347]

24 Being adventurous, the new customer may try some hitherto untasted meals from the infinite buffet on offer. [sent-47, score-0.232]

25 The complete IBP-ADCLUS model becomes, sij | F, w, σ ∼ wk | λ1, λ2 ∼ F|α ∼ Normal(µij , σ2 ) Gamma(λ1 , λ2) IBP(α). [sent-49, score-0.356]

26 3 A Gibbs-Metropolis Sampling Scheme As a Bayesian formulation of additive clustering, statistical inference in Equation 4 is based on the posterior distribution over feature matrices and saliency vectors, p(F, w | S). [sent-51, score-0.705]

27 Unfortunately, this is generally quite difficult, so a natural alternative is to use Markov chain Monte Carlo (MCMC) methods to repeatedly sample from the posterior distribution: estimates of posterior quantities can be made using these samples as proxies for the full distribution. [sent-53, score-0.342]

28 If the current ∗ saliency is wk , a candidate wk is first generated from a Gaussian(wk , 0. [sent-57, score-0.595]

29 The value of wk is then reassigned using the Metropolis update rule. [sent-59, score-0.234]

30 If w−k denotes the set of all saliencies except wk , this rule is wk ← ∗ wk wk with probability a , where a = with probability 1 − a ∗ ∗ p(S | F,w−k ,wk )p(wk | λ) p(S | F,w−k ,wk )p(wk | λ) . [sent-60, score-1.123]

31 (5) ∗ With a Gamma prior, the Metropolis sampler automatically rejects all negative valued wk . [sent-61, score-0.347]

32 For features currently possessed by at least one object, assignments are updated using a standard Gibbs sampler: the value of fik is drawn from the conditional posterior distribution over fik | S, F−ik, w. [sent-63, score-1.127]

33 Since feature assignments are discrete, it is easy to find this conditional probability by noting that p(fik |S, F−ik, w) ∝ p(S|F, w)p(fik|F−ik ), (6) where F−ik denotes the set of all feature assignments except fik . [sent-64, score-0.774]

34 Moreover, since feature assignments in the IBP are exchangeable, we can treat the kth assignment as if it were the last. [sent-66, score-0.277]

35 The Gibbs sampler deletes all single-stimulus features with probability 1, since n−ik will be zero for one of the stimuli. [sent-68, score-0.32]

36 Since the IBP describes a prior over infinite feature matrices, the resampling procedure needs to accommodate the remaining (infinite) set of features that are not currently represented among the manifest features F. [sent-70, score-0.639]

37 When resampling feature assignments, some finite number of those currently-latent features will become manifest. [sent-71, score-0.336]

38 When sampling from the conditional prior over feature assignments for the ith stimulus, we hold the feature assignments fixed for all other stimuli, so this is equivalent to sampling some number of “singleton” features (i. [sent-72, score-0.713]

39 , features possessed only by stimulus i) from the conditional prior, which is Poisson( α/n) as noted previously. [sent-74, score-0.398]

40 We did so by inspecting the time series plot formed by graphing the log posterior probability of successive samples. [sent-81, score-0.254]

41 To illustrate this, one of the chains used in our simulations (see Section 5) is displayed in Figure 2, with nine parallel chains used for comparison: the time series plot shows no long-term trends, and that different chains are visually indistinguishable from one another. [sent-82, score-0.279]

42 4 Four Estimators for the ADCLUS Model Since the introduction of the additive clustering model, a range of algorithms have been used to infer features, including “subset selection” [2], expectation maximization [3], continuous approximations [11] and stochastic hillclimbing [5] among others. [sent-84, score-0.328]

43 We will explore estimators based on computing the posterior distribution over F and w given S. [sent-88, score-0.334]

44 This includes estimators based on maximum a posteriori (MAP) estimation, corresponding to the value of a variable with highest posterior probability, and taking expectations over the posterior distribution. [sent-89, score-0.505]

45 Much of the literature defines an estimator conditional on the assumption that the number of features in the model m is fixed [3][11][12]. [sent-93, score-0.251]

46 If we treat the posterior probability to be our measure of utility, the estimators become, ˆ ˆ F1, w1 = arg max p(F, w | S, m) F,w (7) Estimating the dimension is harder. [sent-95, score-0.41]

47 The natural (MAP) estimate for m is easy to state: p(F, w | S) dw m1 = arg max p(m | S) = arg max ˆ m m (8) F∈Fm where Fm denotes the set of feature matrices containing m unique features. [sent-96, score-0.334]

48 The saliencies are ˆ ˆ ˆ estimated after F2 is chosen, via conditional MAP estimation: ˆ ˆ w2 = arg max p(w | F2, S). [sent-106, score-0.216]

49 w (10) This approach is typical of existing (parametric) Bayesian approaches to additive clustering [5][14], where analytic approximations to p(F | S) are used for expediency. [sent-107, score-0.296]

50 Variance accounted for (b) by the four similarity estimators ˆ S, where the target is either the observed training data So , a new test data set Sn, or the true similarity matrix St. [sent-115, score-0.68]

51 A fourth approach aims to summarize the posterior distribution by looking at the marginal posterior probabilities associated with particular features. [sent-117, score-0.342]

52 The probability that a particular feature fk belongs in the representation is given by: p(fk | S) = p(F | S). [sent-118, score-0.291]

53 Letting rk = p(fk | S) denote the posterior probability that feature fk ˆ is manifest, we can construct a vector ˆ = [ˆk] that contains these probabilities for all 2n possible r r features. [sent-120, score-0.461]

54 In practice, it is impossible to look at all 2n features, so one would typically report only those features for which rk is large. [sent-122, score-0.212]

55 Since these tend to be the features that ˆ make the largest contributions to E[sij ∗ |S], there is a sense in which this approach approximates the expected posterior similarities. [sent-123, score-0.347]

56 5 Recovering Noisy Feature Matrices By using the IBP-ADCLUS framework, we can compare the performance of the four estimators in a reasonable fashion. [sent-124, score-0.192]

57 Loosely following [12], we generated noisy similarity matrices with n = 8, 16 and 32 stimuli, based on “true” feature matrices Ft in which mt = 2 log2(n), where each object possessed each feature with probability 0. [sent-125, score-0.775]

58 We fixed α = 2 for all simulations: since the number of manifest features in an IBP model follows a Poisson(αHn) distribution (where Hn is the nth harmonic number) [6], the prior has a strong bias toward parsimony. [sent-130, score-0.309]

59 The prior expected number of features is approximately 5. [sent-131, score-0.215]

60 For a given similarity matrix, 10 Gibbs-Metropolis chains were run from different start points, and 1000 samples were drawn from each. [sent-136, score-0.29]

61 The chains were burnt in for 1000 iterations, and a lag of 10 iterations was used between successive samples. [sent-137, score-0.185]

62 2 0 0 5 10 15 Number of Features 10 15 20 25 Number of Features Figure 4: Posterior distributions over the number of features when the Bayesian ADCLUS model is applied to (a) the numbers data, (b) the countries data and (c) the letters data. [sent-150, score-0.263]

63 (a) The representation reported in [3], extracted using an EM algorithm with the number of features fixed at eight. [sent-152, score-0.249]

64 (b) The 10 most probable features extracted using the Bayesian ADCLUS model. [sent-153, score-0.254]

65 The first column gives the posterior probability that a particular feature belongs in the representation. [sent-154, score-0.329]

66 The second column displays the average saliency of a feature in the event that it is included. [sent-155, score-0.254]

67 (a) FEATURE 2 0 1 2 4 3 WEIGHT 8 6 9 6 7 8 9 2 3 4 5 6 1 3 5 7 9 1 2 3 4 4 5 6 7 8 additive constant 0. [sent-156, score-0.198]

68 WEIGHT FEATURE 3 6 9 2 4 8 0 1 2 2 3 4 5 6 6 7 8 9 0 1 2 3 4 2 4 6 8 1 3 5 7 9 4 5 6 7 8 7 8 9 additive constant 0. [sent-166, score-0.198]

69 Figure 3(a) shows the posterior distributions over the number of features m for each of the three simulation conditions. [sent-190, score-0.347]

70 There is a tendency to underestimate the number of features when provided with small similarity matrices, with the modal number being 3, 7 and 10. [sent-191, score-0.448]

71 However, since the posterior estimate of m is below the prior estimate when n = 8, it seems this effect is data-driven, as 79% of the variance in the data matrix So can be accounted for using only three features. [sent-192, score-0.341]

72 ˆ Since each approach allows the construction of an estimated similarity matrix S, a natural comparison is to look at the proportion of variance this estimate accounts for in the observed data So , the novel data set Sn, and the true matrix St. [sent-193, score-0.333]

73 When n = 32, this profile is observed for all four estimators, suggesting that in this case all four estimators have converged appropriately. [sent-195, score-0.221]

74 A standard data set used in evaluating additive clustering models measures the conceptual similarity of the numbers 0 through 9 [17]. [sent-202, score-0.493]

75 9% of the variance, with features corresponding to arith- Table 2: Featural representation of the similarity between 16 countries. [sent-206, score-0.41]

76 The table shows the eight highest-probability features extracted by the Bayesian ADCLUS model. [sent-207, score-0.289]

77 The average weight associated with the additive constant is 0. [sent-209, score-0.227]

78 311 Table 3: Featural representation of the perceptual similarity between 26 capital letters. [sent-228, score-0.271]

79 The table shows the ten highest-probability features extracted by the Bayesian ADCLUS model. [sent-229, score-0.252]

80 The average weight associated with the additive constant is 0. [sent-231, score-0.227]

81 Although the posterior probability is spread over a large number of feature matrices, 92. [sent-258, score-0.329]

82 The modal number of represented features was m1 =11, with 27. [sent-260, score-0.28]

83 The posterior distribution over ˆ the number of features is shown in Figure 4(a). [sent-262, score-0.347]

84 Since none of the existing literature has used the “approximate expectation” approach to find highly probable features, it is useful to note the strong similarities between Table 1(a) and Table 1(b), which reports the ten highest-probability features across the entire posterior distribution. [sent-263, score-0.546]

85 Applying this approach to obtain an estimate of the posterior ˆ predictive similarities S4 revealed that this matrix accounts for 97. [sent-264, score-0.365]

86 A second application is to human forced-choice judgments of the similarities between 16 countries [18]. [sent-267, score-0.319]

87 In this task, participants were shown lists of four countries and asked to pick out the two countries most similar to each other. [sent-268, score-0.203]

88 1 reveals that only eight features appear in the representation more than 25% of the time. [sent-270, score-0.25]

89 Given this, it is not surprising that the posterior distribution over the number of features, shown in Figure 4 (b), indicates that the modal number of features is eight. [sent-271, score-0.422]

90 The eight most probable features are listed in Table 2. [sent-272, score-0.255]

91 As a third example, we analyzed a somewhat larger data set, consisting of kindergarten children’s assessment of the perceptual similarity of the 26 capital letters [19]. [sent-278, score-0.234]

92 The posterior distribution over the number of represented features is shown in Figure 4(c). [sent-282, score-0.376]

93 7 Discussion Learning how similarity relations are represented is a difficult modeling problem. [sent-286, score-0.226]

94 Additive clustering provides a framework for learning featural representations of stimulus similarity, but remains underused due to the difficulties associated with the inference. [sent-287, score-0.425]

95 By adopting a Bayesian approach to additive clustering, we are able to obtain a richer characterization of the structure behind human similarity judgments. [sent-288, score-0.395]

96 Moreover, by using nonparametric Bayesian techniques to place a prior distribution over infinite binary feature matrices via the Indian Buffet Process, we can allow the data to determine the number of features that the algorithm recovers. [sent-289, score-0.501]

97 As noted by [16], people are capable of recognizing that individual stimuli possess an arbitrarily large number of characteristics, but in any particular context will make judgments using only a finite, usually small number of properties that form part of our current mental representation. [sent-291, score-0.267]

98 Infinite latent feature models and the Indian buffet process. [sent-333, score-0.285]

99 Extending the ALCOVE model of category learning to featural stimulus domains. [sent-382, score-0.276]

100 A measure of stimulus similarity and errors in some paired-associate learning tasks. [sent-425, score-0.273]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adclus', 0.375), ('fik', 0.275), ('wk', 0.234), ('featural', 0.2), ('additive', 0.198), ('similarity', 0.197), ('features', 0.176), ('posterior', 0.171), ('estimators', 0.163), ('buffet', 0.158), ('ibp', 0.158), ('similarities', 0.128), ('saliency', 0.127), ('feature', 0.127), ('saliencies', 0.125), ('sij', 0.122), ('indian', 0.119), ('sampler', 0.113), ('judgments', 0.104), ('possessed', 0.1), ('metropolis', 0.099), ('clustering', 0.098), ('fk', 0.096), ('chains', 0.093), ('bayesian', 0.089), ('ik', 0.087), ('countries', 0.087), ('assignments', 0.084), ('matrices', 0.082), ('stimuli', 0.082), ('nonparametric', 0.077), ('stimulus', 0.076), ('iraq', 0.075), ('libya', 0.075), ('modal', 0.075), ('prob', 0.075), ('zimbabwe', 0.075), ('customer', 0.074), ('kth', 0.066), ('adelaide', 0.065), ('dishes', 0.065), ('nigeria', 0.065), ('psychology', 0.065), ('sn', 0.063), ('accounted', 0.061), ('manifest', 0.059), ('customers', 0.055), ('psychological', 0.052), ('successive', 0.052), ('representations', 0.051), ('map', 0.051), ('diners', 0.05), ('fjk', 0.05), ('indonesia', 0.05), ('philippines', 0.05), ('conditional', 0.046), ('mental', 0.046), ('arg', 0.045), ('dish', 0.043), ('pragmatically', 0.043), ('shepard', 0.043), ('probable', 0.042), ('table', 0.04), ('navarro', 0.04), ('fwf', 0.04), ('lag', 0.04), ('rosenbluth', 0.04), ('prior', 0.039), ('gibbs', 0.039), ('st', 0.038), ('poisson', 0.038), ('representation', 0.037), ('capital', 0.037), ('hn', 0.037), ('eight', 0.037), ('variance', 0.037), ('extracted', 0.036), ('rk', 0.036), ('dw', 0.035), ('nth', 0.035), ('possess', 0.035), ('gamma', 0.034), ('accounts', 0.033), ('matrix', 0.033), ('resampling', 0.033), ('fm', 0.032), ('infer', 0.032), ('probability', 0.031), ('ij', 0.031), ('japan', 0.03), ('china', 0.03), ('ith', 0.03), ('literature', 0.029), ('mt', 0.029), ('four', 0.029), ('represented', 0.029), ('carlo', 0.029), ('monte', 0.029), ('weight', 0.029), ('cognitive', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

2 0.20680402 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

3 0.13426118 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

4 0.13225195 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

5 0.1257474 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf

Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1

6 0.11161844 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

7 0.10611068 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

8 0.10546398 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

9 0.10460953 86 nips-2006-Graph-Based Visual Saliency

10 0.092033297 121 nips-2006-Learning to be Bayesian without Supervision

11 0.087783135 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

12 0.08172702 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

13 0.078658663 192 nips-2006-Theory and Dynamics of Perceptual Bistability

14 0.078128189 130 nips-2006-Max-margin classification of incomplete data

15 0.074357815 115 nips-2006-Learning annotated hierarchies from relational data

16 0.072410613 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

17 0.072404034 41 nips-2006-Bayesian Ensemble Learning

18 0.071907833 53 nips-2006-Combining causal and similarity-based reasoning

19 0.069259785 57 nips-2006-Conditional mean field

20 0.061767373 80 nips-2006-Fundamental Limitations of Spectral Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.23), (1, 0.027), (2, 0.092), (3, -0.07), (4, 0.059), (5, -0.03), (6, 0.125), (7, -0.045), (8, -0.006), (9, -0.073), (10, 0.036), (11, 0.209), (12, 0.037), (13, -0.061), (14, -0.117), (15, 0.011), (16, 0.063), (17, 0.046), (18, -0.152), (19, -0.02), (20, 0.171), (21, -0.033), (22, 0.101), (23, -0.007), (24, -0.008), (25, 0.096), (26, 0.018), (27, 0.048), (28, -0.121), (29, 0.123), (30, 0.103), (31, -0.032), (32, -0.089), (33, 0.126), (34, 0.005), (35, 0.028), (36, -0.078), (37, -0.006), (38, -0.218), (39, 0.002), (40, 0.107), (41, 0.001), (42, 0.107), (43, 0.021), (44, -0.004), (45, 0.034), (46, -0.093), (47, -0.043), (48, -0.042), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94841212 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

2 0.73711312 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

3 0.68947816 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

4 0.56633604 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

5 0.51375294 53 nips-2006-Combining causal and similarity-based reasoning

Author: Charles Kemp, Patrick Shafto, Allison Berke, Joshua B. Tenenbaum

Abstract: Everyday inductive reasoning draws on many kinds of knowledge, including knowledge about relationships between properties and knowledge about relationships between objects. Previous accounts of inductive reasoning generally focus on just one kind of knowledge: models of causal reasoning often focus on relationships between properties, and models of similarity-based reasoning often focus on similarity relationships between objects. We present a Bayesian model of inductive reasoning that incorporates both kinds of knowledge, and show that it accounts well for human inferences about the properties of biological species. 1

6 0.45906246 121 nips-2006-Learning to be Bayesian without Supervision

7 0.45838308 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

8 0.44258717 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.42960164 41 nips-2006-Bayesian Ensemble Learning

10 0.42355552 192 nips-2006-Theory and Dynamics of Perceptual Bistability

11 0.41991964 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

12 0.41396233 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

13 0.40381303 86 nips-2006-Graph-Based Visual Saliency

14 0.40241575 115 nips-2006-Learning annotated hierarchies from relational data

15 0.39147729 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

16 0.36920685 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

17 0.36498713 130 nips-2006-Max-margin classification of incomplete data

18 0.34575975 174 nips-2006-Similarity by Composition

19 0.34490383 98 nips-2006-Inferring Network Structure from Co-Occurrences

20 0.34192869 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.073), (3, 0.019), (7, 0.055), (9, 0.042), (20, 0.468), (22, 0.052), (44, 0.051), (57, 0.084), (65, 0.032), (69, 0.018), (71, 0.015), (90, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90027964 23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

Author: Mark Johnson, Thomas L. Griffiths, Sharon Goldwater

Abstract: This paper introduces adaptor grammars, a class of probabilistic models of language that generalize probabilistic context-free grammars (PCFGs). Adaptor grammars augment the probabilistic rules of PCFGs with “adaptors” that can induce dependencies among successive uses. With a particular choice of adaptor, based on the Pitman-Yor process, nonparametric Bayesian models of language using Dirichlet processes and hierarchical Dirichlet processes can be written as simple grammars. We present a general-purpose inference algorithm for adaptor grammars, making it easy to define and use such models, and illustrate how several existing nonparametric Bayesian models can be expressed within this framework. 1

2 0.87316489 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

same-paper 3 0.85383725 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

4 0.68225127 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

5 0.48671991 41 nips-2006-Bayesian Ensemble Learning

Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch

Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1

6 0.46603608 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

7 0.45988345 192 nips-2006-Theory and Dynamics of Perceptual Bistability

8 0.45069578 121 nips-2006-Learning to be Bayesian without Supervision

9 0.44945365 53 nips-2006-Combining causal and similarity-based reasoning

10 0.44838357 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

11 0.44791809 42 nips-2006-Bayesian Image Super-resolution, Continued

12 0.44546077 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

13 0.43578753 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

14 0.43021438 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

15 0.42750394 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

16 0.42414641 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

17 0.42073587 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

18 0.41535163 115 nips-2006-Learning annotated hierarchies from relational data

19 0.41510928 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

20 0.41180518 113 nips-2006-Learning Structural Equation Models for fMRI