nips nips2008 nips2008-221 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. [sent-3, score-0.421]
2 The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. [sent-4, score-0.379]
3 Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. [sent-5, score-0.26]
4 In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. [sent-6, score-0.198]
5 Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. [sent-7, score-0.227]
6 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. [sent-8, score-0.731]
7 They model dyadic data by a multiplicative interaction of two Gaussian process priors. [sent-9, score-0.179]
8 A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. [sent-11, score-0.036]
9 A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. [sent-12, score-0.126]
10 GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. [sent-13, score-0.244]
11 A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. [sent-14, score-0.141]
12 In a similar way, this entity set is associated with another kernel function Ω. [sent-15, score-0.145]
13 For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. [sent-16, score-0.194]
14 Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. [sent-17, score-0.322]
15 Then an SRM aims to model y(u, v) by the following generative process, Model 1. [sent-18, score-0.03]
16 Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. [sent-20, score-0.036]
17 , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). [sent-24, score-0.444]
18 For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). [sent-27, score-0.23]
19 d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. [sent-28, score-0.071]
20 In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). [sent-33, score-0.317]
21 In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. [sent-34, score-0.09]
22 The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. [sent-35, score-0.227]
23 The advantage of this new notation is demonstrated by the following theorem [2]. [sent-37, score-0.04]
24 Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). [sent-39, score-0.048]
25 The new formulation of inverted Wishart is consistent under marginalization. [sent-40, score-0.06]
26 In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i. [sent-42, score-0.109]
27 , any of their sub kernel matrices is an identity matrix. [sent-44, score-0.084]
28 Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. [sent-45, score-0.283]
29 Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . [sent-48, score-0.3]
30 Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. [sent-49, score-0.067]
31 There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. [sent-50, score-0.092]
32 Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. [sent-51, score-0.247]
33 In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. [sent-52, score-0.078]
34 By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. [sent-53, score-0.096]
35 In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. [sent-54, score-0.029]
36 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. [sent-57, score-0.069]
37 Our contribution is in how to make the MCMC inference more efficient for large scale data. [sent-59, score-0.029]
38 Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. [sent-65, score-0.058]
39 Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. [sent-72, score-0.518]
40 Then the generative model is written as Model 2 and depicted in Figure 1. [sent-73, score-0.03]
41 The generative model of a matrix-variate SRM: Σ 1. [sent-75, score-0.03]
42 s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. [sent-80, score-0.058]
43 As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . [sent-82, score-0.096]
44 2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). [sent-85, score-0.252]
45 K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. [sent-89, score-0.09]
46 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. [sent-92, score-0.554]
47 This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). [sent-94, score-0.04]
48 The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. [sent-97, score-0.03]
49 s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. [sent-102, score-0.381]
50 The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). [sent-105, score-0.096]
51 We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . [sent-106, score-0.129]
52 Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . [sent-107, score-0.121]
53 3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. [sent-109, score-0.151]
54 Specifically, the prior information is encoded as the prior covariance parameters, i. [sent-110, score-0.06]
55 In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . [sent-113, score-0.154]
56 This prior knowledge can be obtained from some additional features of entities. [sent-117, score-0.036]
57 Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . [sent-118, score-0.301]
58 Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. [sent-137, score-0.029]
59 Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. [sent-141, score-0.161]
60 4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c. [sent-143, score-0.232]
61 However, blocking Gibbs sampling is not necessary to be computationally efficient. [sent-146, score-0.136]
62 To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. [sent-147, score-0.146]
63 The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. [sent-149, score-0.096]
64 We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. [sent-150, score-0.26]
65 Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. [sent-151, score-0.096]
66 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. [sent-154, score-0.382]
67 In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e. [sent-158, score-0.727]
68 A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e. [sent-161, score-0.158]
69 The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. [sent-164, score-0.416]
70 This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. [sent-166, score-0.245]
71 The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. [sent-167, score-0.535]
72 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. [sent-168, score-0.302]
73 We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0. [sent-169, score-0.09]
74 The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). [sent-174, score-0.198]
75 BSRM outperforms the variational approach of SRMs and fMMMF. [sent-175, score-0.067]
76 915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. [sent-191, score-0.069]
77 The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i. [sent-195, score-0.149]
78 We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. [sent-199, score-0.149]
79 Because of the data size, we cannot run the SRM-VB of [15]. [sent-202, score-0.029]
80 The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. [sent-204, score-0.213]
81 The 10 top factors of both the user side and the movie side are used as features, i. [sent-205, score-0.166]
82 The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. [sent-209, score-0.071]
83 8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. [sent-233, score-0.069]
84 RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1. [sent-236, score-0.125]
85 Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix. [sent-250, score-0.211]
86 The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. [sent-252, score-0.212]
87 In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. [sent-254, score-0.149]
88 The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. [sent-255, score-0.257]
89 The top 30 user-side factors are used as features, none of movie-side factors are used, i. [sent-256, score-0.084]
90 The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. [sent-259, score-0.071]
91 The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. [sent-263, score-0.085]
92 8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. [sent-272, score-0.069]
93 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. [sent-273, score-0.526]
94 We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. [sent-274, score-0.044]
95 The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. [sent-275, score-0.15]
96 In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. [sent-276, score-0.033]
97 Multiplicative latent factor models for description and prediction of social networks. [sent-310, score-0.093]
98 Learning systems of concepts with an infinite relational model. [sent-327, score-0.247]
99 Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. [sent-367, score-0.219]
100 A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). [sent-398, score-0.309]
wordName wordTfidf (topN-words)
[('iw', 0.479), ('srms', 0.348), ('relational', 0.247), ('bsrm', 0.199), ('srm', 0.196), ('draw', 0.178), ('zo', 0.174), ('rmse', 0.159), ('entities', 0.151), ('ratings', 0.149), ('dyadic', 0.141), ('variate', 0.14), ('gii', 0.124), ('mae', 0.124), ('fg', 0.109), ('fij', 0.109), ('entity', 0.109), ('factorization', 0.105), ('fmmmf', 0.1), ('sampling', 0.096), ('wishart', 0.093), ('matrix', 0.09), ('gibbs', 0.083), ('gj', 0.081), ('eachmovie', 0.075), ('ix', 0.071), ('hyper', 0.071), ('bayesian', 0.07), ('collaborative', 0.068), ('variational', 0.067), ('net', 0.067), ('gk', 0.064), ('users', 0.063), ('movie', 0.062), ('user', 0.062), ('covariance', 0.06), ('latent', 0.06), ('inverted', 0.06), ('gp', 0.059), ('yu', 0.055), ('uj', 0.055), ('vb', 0.053), ('fk', 0.052), ('dqn', 0.05), ('rmses', 0.05), ('unblocking', 0.05), ('mcmc', 0.049), ('sub', 0.048), ('pr', 0.046), ('monte', 0.044), ('rated', 0.044), ('scalability', 0.044), ('factors', 0.042), ('stochastic', 0.041), ('blocking', 0.04), ('dpm', 0.04), ('variates', 0.04), ('theorem', 0.04), ('priors', 0.04), ('root', 0.039), ('ui', 0.038), ('multiplicative', 0.038), ('carlo', 0.037), ('features', 0.036), ('kernel', 0.036), ('squared', 0.036), ('ciency', 0.035), ('dirac', 0.034), ('mean', 0.033), ('models', 0.033), ('rating', 0.032), ('tresp', 0.032), ('nite', 0.032), ('subscripts', 0.031), ('row', 0.03), ('indexed', 0.03), ('column', 0.03), ('generative', 0.03), ('wd', 0.03), ('ij', 0.029), ('run', 0.029), ('size', 0.029), ('normal', 0.029), ('inference', 0.029), ('ltering', 0.028), ('dr', 0.028), ('yi', 0.027), ('zhu', 0.027), ('informative', 0.027), ('tens', 0.026), ('accuracy', 0.026), ('simplicity', 0.026), ('fully', 0.025), ('ends', 0.025), ('iteration', 0.025), ('vt', 0.025), ('gaussian', 0.025), ('indicates', 0.025), ('chain', 0.024), ('indices', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
2 0.088568136 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence
Abstract: Identification and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many fields, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1
3 0.078983463 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
4 0.078232884 134 nips-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: In many settings, such as protein interactions and gene regulatory networks, collections of author-recipient email, and social networks, the data consist of pairwise measurements, e.g., presence or absence of links between pairs of objects. Analyzing such data with probabilistic models requires non-standard assumptions, since the usual independence or exchangeability assumptions no longer hold. In this paper, we introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) with a local model to instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodel with applications to social networks and protein interaction networks. 1
5 0.077538028 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
Author: Neil D. Lawrence, Magnus Rattray, Michalis K. Titsias
Abstract: Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation. 1
6 0.074871406 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
7 0.068340339 31 nips-2008-Bayesian Exponential Family PCA
8 0.060792629 249 nips-2008-Variational Mixture of Gaussian Process Experts
9 0.058473628 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
10 0.057920959 216 nips-2008-Sparse probabilistic projections
11 0.057908151 32 nips-2008-Bayesian Kernel Shaping for Learning Control
12 0.05764275 236 nips-2008-The Mondrian Process
13 0.057471141 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
14 0.057470303 235 nips-2008-The Infinite Hierarchical Factor Regression Model
15 0.056358609 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
16 0.055693839 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
17 0.053935651 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
18 0.052996021 138 nips-2008-Modeling human function learning with Gaussian processes
19 0.050139498 62 nips-2008-Differentiable Sparse Coding
20 0.046837877 248 nips-2008-Using matrices to model symbolic relationship
topicId topicWeight
[(0, -0.163), (1, -0.01), (2, 0.036), (3, 0.03), (4, 0.045), (5, -0.055), (6, 0.026), (7, 0.155), (8, -0.05), (9, 0.011), (10, 0.023), (11, -0.0), (12, 0.093), (13, 0.008), (14, 0.065), (15, -0.005), (16, -0.016), (17, 0.008), (18, 0.062), (19, 0.007), (20, -0.04), (21, -0.044), (22, 0.067), (23, 0.008), (24, -0.031), (25, 0.033), (26, -0.048), (27, 0.009), (28, -0.037), (29, 0.056), (30, -0.035), (31, -0.017), (32, -0.024), (33, -0.03), (34, 0.041), (35, 0.073), (36, 0.067), (37, 0.052), (38, 0.012), (39, -0.032), (40, -0.036), (41, 0.043), (42, -0.015), (43, -0.101), (44, -0.116), (45, 0.189), (46, 0.069), (47, 0.083), (48, -0.005), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.90846866 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
2 0.70211524 134 nips-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: In many settings, such as protein interactions and gene regulatory networks, collections of author-recipient email, and social networks, the data consist of pairwise measurements, e.g., presence or absence of links between pairs of objects. Analyzing such data with probabilistic models requires non-standard assumptions, since the usual independence or exchangeability assumptions no longer hold. In this paper, we introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) with a local model to instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodel with applications to social networks and protein interaction networks. 1
3 0.69934732 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
4 0.63839531 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
Author: Mauricio Alvarez, Neil D. Lawrence
Abstract: We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network. 1
5 0.6075055 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
Author: Alan S. Willsky, Erik B. Sudderth, Michael I. Jordan, Emily B. Fox
Abstract: Many nonlinear dynamical phenomena can be effectively modeled by a system that switches among a set of conditionally linear dynamical modes. We consider two such models: the switching linear dynamical system (SLDS) and the switching vector autoregressive (VAR) process. Our nonparametric Bayesian approach utilizes a hierarchical Dirichlet process prior to learn an unknown number of persistent, smooth dynamical modes. We develop a sampling algorithm that combines a truncated approximation to the Dirichlet process with efficient joint sampling of the mode and state sequences. The utility and flexibility of our model are demonstrated on synthetic data, sequences of dancing honey bees, and the IBOVESPA stock index.
6 0.59276229 236 nips-2008-The Mondrian Process
7 0.57318634 31 nips-2008-Bayesian Exponential Family PCA
8 0.53737515 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
9 0.5083037 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
10 0.50387132 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
11 0.50090754 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
12 0.50009239 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
13 0.49685481 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
14 0.4871572 216 nips-2008-Sparse probabilistic projections
15 0.48034406 249 nips-2008-Variational Mixture of Gaussian Process Experts
16 0.47425056 234 nips-2008-The Infinite Factorial Hidden Markov Model
17 0.46459815 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
18 0.46406457 32 nips-2008-Bayesian Kernel Shaping for Learning Control
19 0.44783926 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
20 0.44728586 105 nips-2008-Improving on Expectation Propagation
topicId topicWeight
[(4, 0.017), (6, 0.054), (7, 0.116), (12, 0.04), (28, 0.142), (57, 0.108), (59, 0.017), (61, 0.295), (63, 0.023), (71, 0.012), (77, 0.027), (78, 0.013), (83, 0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.75847983 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
2 0.73034894 44 nips-2008-Characteristic Kernels on Groups and Semigroups
Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1
3 0.59054798 66 nips-2008-Dynamic visual attention: searching for coding length increments
Author: Xiaodi Hou, Liqing Zhang
Abstract: A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return. 1
4 0.58742261 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
Author: Neil D. Lawrence, Magnus Rattray, Michalis K. Titsias
Abstract: Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation. 1
5 0.5859459 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
6 0.5841288 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
7 0.58382863 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
8 0.58352214 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
9 0.58272451 200 nips-2008-Robust Kernel Principal Component Analysis
10 0.58095998 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
11 0.57974595 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
12 0.57973558 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
13 0.57960171 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
14 0.5775972 194 nips-2008-Regularized Learning with Networks of Features
15 0.57641953 118 nips-2008-Learning Transformational Invariants from Natural Movies
16 0.57605445 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
17 0.57597196 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
18 0.57586449 248 nips-2008-Using matrices to model symbolic relationship
19 0.57565111 4 nips-2008-A Scalable Hierarchical Distributed Language Model
20 0.57545143 138 nips-2008-Modeling human function learning with Gaussian processes