nips nips2008 nips2008-216 knowledge-graph by maker-knowledge-mining

216 nips-2008-Sparse probabilistic projections

Source: pdf

Author: Cédric Archambeau, Francis R. Bach

Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Sparse probabilistic projections C´ dric Archambeau e Department of Computer Science University College London, United Kingdom c. [sent-1, score-0.141]

2 org Abstract We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. [sent-8, score-0.719]

3 Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. [sent-9, score-0.593]

4 We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. [sent-10, score-0.254]

5 In recent years, several methods for sparse PCA have been designed to ﬁnd projections which retain maximal variance, while enforcing many entries of the projection matrix to be zero [20, 6]. [sent-15, score-0.288]

6 While most of these methods are based on convex or partially convex relaxations of the sparse PCA problem, [16] has looked at using the probabilistic PCA framework of [18] along with 1 -regularisation. [sent-16, score-0.246]

7 Recent attempts for constructing sparse CCA include [10, 19]. [sent-19, score-0.148]

8 In this paper, we build on the probabilistic interpretation of CCA outlined by [2] and further extended by [13]. [sent-20, score-0.128]

9 We introduce a general probabilistic model, which allows us to infer from an arbitrary number of views of the data, both a shared latent representation and individual low-dimensional representations of each one of them. [sent-21, score-0.284]

10 Hence, the probabilistic reformulations of PCA and CCA ﬁt this probabilistic framework. [sent-22, score-0.196]

11 Moreover, we are interested in sparse solutions, as these are important for interpretation purposes, denoising or feature extraction. [sent-23, score-0.251]

12 A proper probabilistic approach allows us to treat the trade-off between the modelling accuracy (of the high-dimensional observations by low-dimensional latent variables) and the degree of sparsity of the projection directions in principled way. [sent-25, score-0.37]

13 For example, we do not need to estimate the sparse components successively, using, e. [sent-26, score-0.148]

14 , deﬂation, but we can estimate all sparse directions jointly as we are taking the uncertainty of the latent variable into account. [sent-28, score-0.303]

15 In order to ensure sparse solutions we propose two strategies. [sent-29, score-0.148]

16 The ﬁrst one, discussed in Appendix A, is based on automatic relevance determination (ARD) [14]. [sent-30, score-0.117]

17 The entries in the projection matrix which are not well determined by the data are automatically driven to zero. [sent-32, score-0.097]

18 The second approach uses priors from the generalised hyperbolic family [3], and more speciﬁcally the inverse Gamma. [sent-33, score-0.473]

19 In this case, the degree of sparsity can be adjusted, eventually leading to very sparse solutions if desired. [sent-34, score-0.216]

20 For both approaches we derive a variational EM algorithm [15]. [sent-35, score-0.127]

21 (b) Marginal prior on the individual matrix entries (b = 1). [sent-52, score-0.108]

22 The measurement xp ∈ RDp is modelled as a mix of a common (or view independent) continuous latent vector y0 ∈ RQ0 and a view dependent continuous latent vector yp ∈ RQp , such that xp = Wp y0 + Vp yp + µp + where p. [sent-58, score-0.68]

23 {µp }P p=1 Wp ∈ RDp ×Q0 , Vp ∈ RDp ×Qp , p, are the view dependent offsets and p ∼ −1 N (0, τp IDp ) (1) is the residual error in view We are interested in the case where y0 and yp are low-dimensional vectors, i. [sent-59, score-0.187]

24 We impose Gaussian priors on the latent vectors: Dp for (2) y0 ∼ N (0, Φ−1 ), yp ∼ N (0, Φ−1 ), p ∈ {1, . [sent-62, score-0.276]

25 p 0 The resulting generative model comprises a number of popular probabilistic projection techniques as special cases. [sent-66, score-0.177]

26 If there is a single view (and a single latent cause) and the prior covariance is diagonal, we recover probabilistic factor analysis [9]. [sent-67, score-0.366]

27 If the prior is also isotropic, then we get probabilistic PCA [18]. [sent-68, score-0.158]

28 If there are two views, we recover probabilistic CCA [2]. [sent-69, score-0.098]

29 One way to achieve sparsity is by means of ARD-type priors [14]. [sent-73, score-0.108]

30 In this framework, a zero-mean Gaussian prior is imposed on the entries of the weight matrices: wip j ∼ N (0, 1/αip j ), ip ∈ {1, . [sent-74, score-0.242]

31 , Q0 }, (3) vip kp ∼ N (0, 1/βip kp ), ip ∈ {1, . [sent-80, score-0.219]

32 (4) Type II maximum likelihood leads then to a sparse solution when considering independent hyperparameters. [sent-87, score-0.148]

33 The updates arising in the context of probabilistic projections are given in Appendix A. [sent-88, score-0.169]

34 Since marginalisation with respect to both the latent vectors and the weights is intractable, we apply variational EM [15]. [sent-89, score-0.359]

35 In the remaining of this paper, we will assume that the marginal prior on each weight λij , which is either an entry of {Wp }P or {Vp }P and will be p=1 p=1 deﬁned shortly, has the form of an (inﬁnite) weighted sum of scaled Gaussians: p(λij ) = −1 N (0, γij ) p(γij ) dγij . [sent-94, score-0.119]

36 (5) We will chose the prior over γij in such a way that the resulting marginal prior over the corresponding λij induces sparsity. [sent-95, score-0.15]

37 A similar approach was followed in the context of sparse nonparametric Bayesian regression in [4, 5]. [sent-96, score-0.148]

38 Let us denote the nth observation, the corresponding latent vector and the means respectively by xn = xn1 , . [sent-99, score-0.286]

39 The generative model can be reformulated as follows: zn ∼ N (0, Φ−1 ), λij |γij ∼ Φ ∈ RQ×Q , Q = Q0 + −1 N (0, γij ), p Qp , (6) i ∈ {1, . [sent-109, score-0.628]

40 , Q}, D = xn |zn , Λ ∼ N (Λzn + µ, Ψ −1 Λ∈R ), D×Q D×D , Ψ∈R p Dp , , (7) (8) where    Λ1 W1  . [sent-115, score-0.131]

41 τP IDP Note that we do not assume that the latent spaces are correlated as Φ = diag{Φ0 , Φ1 , . [sent-152, score-0.155]

42 This is consistent with the fact the common latent space is modelled independently through y0 . [sent-156, score-0.181]

43 2 Sparsity inducing prior over the individual scale variables We impose an inverse Gamma prior on the scale variable γij : γij ∼ IG(a/DQ, b), (9) for all i and j. [sent-159, score-0.345]

44 The effective prior on the individual weights is shown in Figure 1(b). [sent-163, score-0.096]

45 Intuitively, the joint distribution over the weights is sparsity inducing as it is sharply peaked around zero (and in fact inﬁnite for sufﬁciently small a). [sent-164, score-0.207]

46 However, it corresponds to imposing a ﬂat prior on the scale variables over the log-scale, which is a limiting case of the Gamma distribution. [sent-169, score-0.177]

47 When imposing independent Gamma priors on the scale variables, the effective joint marginal is a product of Student-t distributions, which again is sharply peaked around zero and sparsity inducing. [sent-170, score-0.305]

48 3 Variational approximation We view {zn }N and matrix Γ as latent variables, and optimise the parameters θ = {µ, Φ, Λ, Ψ} n=1 by EM. [sent-171, score-0.246]

49 In other words, we view the weight matrix Λ as a matrix of parameter and estimate the 3 entries by maximum a posteriori (MAP) learning. [sent-172, score-0.13]

50 The variational free energy is given by N Fq (x1 , . [sent-174, score-0.216]

51 , xN , θ) = − ln p(xn , zn , Γ|θ) q − H[q(z1 , . [sent-177, score-0.689]

52 , zN , Γ)], (12) n=1 where · q denotes the expectation with respect to the variational distribution q and H[·] is the differential entropy. [sent-180, score-0.127]

53 Since the Kullback-Leibler divergence (KL) is non-negative, the negative free energy is a lower bound to log-marginal likelihood: N ln p(xn |θ) = −Fq ({xn }, θ) + KL [q({zn }, Γ) p({zn }, Γ)|{xn }, θ)] −Fq ({xn }, θ). [sent-181, score-0.18]

54 n=1 (13) Interestingly it is not required to make a factorised approximation of the the joint posterior q to ﬁnd a tractable solution. [sent-182, score-0.16]

55 Indeed, the posterior q factorises naturally given the data and the weights, such that the posteriors we will obtain in the E-step are exact. [sent-183, score-0.142]

56 The variational EM ﬁnds maximum likelihood estimates for the parameters by cycling through the following two steps until convergence: 1. [sent-184, score-0.127]

57 The posterior over the latent variables are computed for ﬁxed parameters by minimising the KL in (13). [sent-185, score-0.242]

58 It can be shown that the variational posteriors are given by N q(z1 , . [sent-186, score-0.166]

59 , zN ) ∝ q(Γ) ∝ e ln p(xn ,zn ,Γ|θ) e n=1 ln p(xn ,zn |Γ,θ) q(z1 ,. [sent-189, score-0.182]

60 The variational free energy (12) is minimised wrt the parameters for ﬁxed q. [sent-194, score-0.216]

61 This leads in effect to type II ML estimates for the paramteres and is equivalent to maximising the expected complete log-likelihood: N θ ← argmax θ ln p(xn , zn , Γ|θ) q . [sent-195, score-0.727]

62 (16) n=1 Depending on the initialisation, the variational EM algorithm converges to a local maximum of the log-marginal likelihood. [sent-196, score-0.127]

63 The convergence can be checked by monitoring the variational lower bound, which monotonically increases during the optimisation. [sent-197, score-0.127]

64 The explicit expression of the variational bound is here omitted due to a lack of space 3. [sent-198, score-0.127]

65 1 Posterior of the latent vectors The joint posterior of the latent vectors factorises into N posteriors due to the fact the observations are independent. [sent-199, score-0.534]

66 Hence, the posterior of each low-dimenstional latent vector is given by q(zn ) = N (¯n , Sn ), z ¯ (17) ¯ ¯ ¯ where zn = Sn Λ Ψ(xn − µ) is the mean and Sn = Λ ΨΛ + Φ 3. [sent-200, score-0.812]

67 Posterior of the scale variables The inverse Gamma distribution is not conjugate to the exponential family. [sent-202, score-0.185]

68 It has the form of a product of generalised inverse Gaussian distributions (see Appendix B for a formal deﬁnition): D Q D Q N − (¯ ij , ϕij , χij ) ω ¯ ¯ p(γij |λij ) = q(Γ) = i=1 j=1 (18) i=1 j=1 a where ωij = − DQ + 1 is the index and ϕij = λ2 and χij = 2b are the shape parameters. [sent-204, score-0.583]

69 The ¯ ¯ ¯ ij 2 factorised form arises from the scale variable being independent conditioned on the weights. [sent-205, score-0.419]

70 3 Update for the parameters Based on the properties of the Gaussian and the generalised inverse Gaussian, we can compute the variational lower bound, which can then be maximised. [sent-207, score-0.406]

71 , γiQ ¯ ¯ (Λp zn ) Λp zn = tr{ zn zn Λp Λp }, √ χij ϕij ¯ ¯ χij Kω+1 ¯ . [sent-211, score-2.392]

72 More importantly, since we are seeking a sparse projection matrix, we do not suffer from the rotational ambiguity problem as is for example the case standard probabilistic PCA. [sent-213, score-0.295]

73 1 Experiments Synthetic denoising experiments Because of identiﬁability issues which are subject of ongoing work, we prefer to compare various methods for sparse PCA in a denoising experiment. [sent-215, score-0.294]

74 That is, we assume that the data were generated from sparse components plus some noise and we compare the various sparse PCA on the denoising task, i. [sent-216, score-0.369]

75 We generated the data as follows: select uniformly at random M = 4 unit norm sparse vectors in P = 10 dimensions with known number S = 4 of non zero entries, then generate i. [sent-219, score-0.189]

76 When the latent variables are Gaussian, our model exactly matches the data and our method should provide a better ﬁt; however, we consider also situations where the model is misspeciﬁed in order to study the robustness of our probabilistic model. [sent-223, score-0.281]

77 We consider our two models: SCA-1 (which uses automatic relevance determination type of sparsity priors) and SCA-2 (which uses generalised hyperbolic distribution), where we used 6 latent dimensions (larger than 4) and ﬁxed hyperparameters that lead to vague priors. [sent-224, score-0.71]

78 It can be seen that two proposed probabilistic approaches perform similarly and signiﬁcantly outperform other sparse PCA methods, even when the model is misspeciﬁed. [sent-227, score-0.246]

79 0 Table 1: Denoising experiment with sparse PCA (we report mean squared errors): (top) Gaussian distributed latent vectors, (middle) latent vectors generated from the uniform distribution, (bottom) latent vectors generated from the Laplace distribution. [sent-278, score-0.695]

80 Figure 2: Power traces and ﬁrst three principal directions. [sent-291, score-0.102]

81 Clearly, sparse probabilitic PCA can be viewed as being more robust to spurious noise and provides a more reliable and amenable solution. [sent-299, score-0.148]

82 5 Conclusion In this paper we introduced a general probabilistic model for inferring sparse probabilistic projection matrices. [sent-300, score-0.393]

83 Sparsity was enforced by either imposing an ARD-type prior or by means of the a NormalInverse Gamma prior. [sent-301, score-0.135]

84 Although the inverse Gamma is not conjugate to the exponential family, the posterior is tractable as it is a special case of the generalised inverse Gaussian [12], which in turn is a conjugate prior to this family. [sent-302, score-0.566]

85 6 A Automatic thresholding the weights by ARD In this section, we provide the updates for achieving automatic thresholding of projection matrix entries in a probabilistic setting. [sent-305, score-0.383]

86 We apply Tipping’s sparse Bayesian theory [8], which is closely related to ARD [14]. [sent-306, score-0.148]

87 More speciﬁcally, we assume the prior over the scale variables is uniform over a log-scale, which is a limiting case of the Gamma distribution. [sent-307, score-0.128]

88 Let us view {zn }N and Λ as latent variables and optimise the parameters θ = {µ, Φ, Ψ, Γ} by n=1 variational EM. [sent-308, score-0.401]

89 The variational free energy is given by N Fq (x1 , . [sent-309, score-0.216]

90 , xN , θ) = − ln p(xn , zn , Λ|θ) q − H[q(z1 , . [sent-312, score-0.689]

91 (24) n=1 In order to ﬁnd a tractable solution, we further have to assume that the approximate posterior q has a factorised form. [sent-316, score-0.16]

92 We can then compute the posterior of the low-dimenstional latent vectors: q(zn ) = N (¯n , Sn ), z ¯ ¯ ¯ ¯ ¯ ¯ ¯ where zn = Sn Λ Ψ(xn − µ) and Sn = Λ ΨΛ + the weights is given by ¯ +Φ i Ψ(i, i)Σi −1 . [sent-317, score-0.848]

93 And the posterior of D D ¯ ¯ N (λi , Σi ), q(λi ) = q(Λ) = (25) (26) i=1 i=1 −1 ¯ ¯ ¯ ¯ ¯ ¯ where λi = Σi Ψ(i, i) n (xn (i) − µ(i))¯n and Σi = Γi + Ψ(i, i) n {Sn + zn zn } z . [sent-318, score-1.255]

94 Finally, the updates for the parameters are found by maximising the negative free energy, which corresponds to performing type II ML for the scaling variables. [sent-321, score-0.111]

95 Generalised inverse Gaussian distribution The Generalised inverse Gaussian distribution is in the class of generalised hyperbolic distributions. [sent-323, score-0.525]

96 The following expectations are useful [12]: y = χ Rω ( χφ), φ y −1 = φ R−ω ( χφ), χ √ d ln Kω ( χφ) ln y = ln ω + , dω (30) where Rω (·) ≡ Kω+1 (·)/Kω (·). [sent-325, score-0.273]

97 7 Inverse Gamma distribution When φ = 0 and ω < 0, the generalised inverse Gaussian distribution reduces to the inverse Gamma distribution: ba −a−1 − b IG(a, b) = x (31) e x , a, b > 0. [sent-335, score-0.371]

98 Absolute moments of generalized hyperbolic distributions and approximate scaling of normal inverse Gaussian L´ vy processes. [sent-358, score-0.246]

99 A direct formulation for sparse PCA using semideﬁnite programming. [sent-382, score-0.148]

100 Calibration of multivariate generalized hyperbolic distributions using the EM algorithm, with applications in risk management, portfolio optimization and portfolio credit risk. [sent-413, score-0.234]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zn', 0.598), ('ij', 0.304), ('generalised', 0.187), ('latent', 0.155), ('xnp', 0.154), ('hyperbolic', 0.154), ('pca', 0.152), ('sparse', 0.148), ('sn', 0.144), ('xn', 0.131), ('variational', 0.127), ('bessel', 0.126), ('dq', 0.125), ('gamma', 0.113), ('ip', 0.105), ('zou', 0.101), ('cca', 0.101), ('probabilistic', 0.098), ('inverse', 0.092), ('ln', 0.091), ('dspca', 0.088), ('wp', 0.085), ('vp', 0.085), ('yp', 0.081), ('ard', 0.075), ('rdp', 0.075), ('factorised', 0.075), ('fq', 0.075), ('denoising', 0.073), ('principal', 0.073), ('sparsity', 0.068), ('qp', 0.068), ('dp', 0.065), ('prior', 0.06), ('posterior', 0.059), ('em', 0.058), ('kp', 0.057), ('view', 0.053), ('diag', 0.053), ('cryptographic', 0.05), ('projection', 0.049), ('imposing', 0.049), ('entries', 0.048), ('automatic', 0.048), ('canonical', 0.048), ('free', 0.045), ('factorises', 0.044), ('archambeau', 0.044), ('idp', 0.044), ('energy', 0.044), ('projections', 0.043), ('appendix', 0.042), ('laplace', 0.041), ('ml', 0.041), ('vectors', 0.041), ('determination', 0.04), ('portfolio', 0.04), ('attacks', 0.04), ('misspeci', 0.04), ('peaked', 0.04), ('priors', 0.04), ('scale', 0.04), ('posteriors', 0.039), ('thresholding', 0.038), ('optimise', 0.038), ('sharply', 0.038), ('maximising', 0.038), ('xp', 0.038), ('template', 0.037), ('weights', 0.036), ('united', 0.036), ('regularisation', 0.036), ('ig', 0.036), ('tipping', 0.034), ('isotropic', 0.034), ('gaussian', 0.032), ('views', 0.031), ('interpretation', 0.03), ('generative', 0.03), ('marginal', 0.03), ('kl', 0.03), ('relevance', 0.029), ('traces', 0.029), ('hyperparameters', 0.029), ('weight', 0.029), ('ghahramani', 0.028), ('variables', 0.028), ('updates', 0.028), ('pascal', 0.028), ('device', 0.027), ('tractable', 0.026), ('correlation', 0.026), ('modelled', 0.026), ('enforced', 0.026), ('bayesian', 0.026), ('editor', 0.025), ('inducing', 0.025), ('conjugate', 0.025), ('kind', 0.025), ('modi', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 216 nips-2008-Sparse probabilistic projections

Author: Cédric Archambeau, Francis R. Bach

2 0.15730946 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for ﬁnding the most probable (MAP) conﬁguration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and signiﬁcantly faster than a method that does not use partitioning. 1

3 0.1269992 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

Author: Xuming He, Richard S. Zemel

Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difﬁcult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1

4 0.1138758 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

5 0.11335996 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random ﬁeld (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufﬁcient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

6 0.10940837 31 nips-2008-Bayesian Exponential Family PCA

7 0.10231685 105 nips-2008-Improving on Expectation Propagation

8 0.093921788 238 nips-2008-Theory of matching pursuit

9 0.091150559 202 nips-2008-Robust Regression and Lasso

10 0.088914536 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

11 0.085955814 32 nips-2008-Bayesian Kernel Shaping for Learning Control

12 0.08558242 249 nips-2008-Variational Mixture of Gaussian Process Experts

13 0.083653226 138 nips-2008-Modeling human function learning with Gaussian processes

14 0.081070602 106 nips-2008-Inferring rankings under constrained sensing

15 0.075548716 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.074827418 200 nips-2008-Robust Kernel Principal Component Analysis

17 0.071442343 233 nips-2008-The Gaussian Process Density Sampler

18 0.070478305 57 nips-2008-Deflation Methods for Sparse PCA

19 0.069805071 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

20 0.069519967 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, -0.055), (2, 0.01), (3, 0.062), (4, 0.071), (5, -0.052), (6, -0.049), (7, 0.144), (8, -0.124), (9, 0.027), (10, -0.06), (11, -0.07), (12, 0.134), (13, 0.048), (14, 0.012), (15, -0.075), (16, 0.008), (17, 0.015), (18, 0.061), (19, -0.088), (20, 0.045), (21, -0.188), (22, 0.01), (23, 0.045), (24, -0.036), (25, -0.166), (26, 0.018), (27, -0.016), (28, 0.017), (29, 0.068), (30, 0.05), (31, -0.109), (32, -0.039), (33, 0.073), (34, -0.024), (35, 0.063), (36, -0.025), (37, -0.03), (38, -0.033), (39, 0.029), (40, 0.036), (41, 0.092), (42, -0.104), (43, -0.047), (44, 0.089), (45, -0.023), (46, 0.004), (47, 0.07), (48, -0.116), (49, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96384746 216 nips-2008-Sparse probabilistic projections

Author: Cédric Archambeau, Francis R. Bach

2 0.70580065 31 nips-2008-Bayesian Exponential Family PCA

Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overﬁtting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1

3 0.60069275 105 nips-2008-Improving on Expectation Propagation

Author: Manfred Opper, Ulrich Paquet, Ole Winther

Abstract: A series of corrections is developed for the ﬁxed points of Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.

4 0.52933383 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

Author: Zhihua Zhang, Michael I. Jordan, Dit-Yan Yeung

Abstract: Kernel supervised learning methods can be uniﬁed by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as “Silverman’s g-prior.” We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion. 1

5 0.50225282 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative ﬁltering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) deﬁnes a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative ﬁltering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “inﬁnite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-speciﬁc noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both ﬁxed to be Dirac delta functions, and U, V are ﬁnite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the ﬁnite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive deﬁnite kernel matrix. We note that the above deﬁnition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we deﬁne a distribution of inﬁnite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive deﬁnite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classiﬁcation, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a ﬁnite set containing both I and O, from now on, we only consider the ﬁnite subset U0 × V0 , a ﬁnite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (speciﬁcally, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efﬁcient for large scale data. We ﬁrst introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this ﬁnite model and Model 1 is that the coefﬁcient 1/ d is ignored for simplicity because this coefﬁcient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefﬁciency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Speciﬁcally, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difﬁcult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efﬁciently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efﬁciently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efﬁciency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efﬁcient. To improve the computational efﬁciency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative ﬁltering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netﬂix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efﬁciency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-ﬁve stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We ﬁnd that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very signiﬁcant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efﬁciency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netﬂix data: We also test the algorithms on the large collection of user ratings from netﬂix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netﬂix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netﬂix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efﬁciently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efﬁcient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative ﬁltering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Grifﬁths, T. Yamada, and N. Ueda. Learning systems of concepts with an inﬁnite relational model. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientiﬁc Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Inﬁnite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We ﬁll the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8

6 0.50089401 249 nips-2008-Variational Mixture of Gaussian Process Experts

7 0.49404088 233 nips-2008-The Gaussian Process Density Sampler

8 0.49380934 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

9 0.4833324 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression

10 0.47781619 238 nips-2008-Theory of matching pursuit

11 0.47294959 200 nips-2008-Robust Kernel Principal Component Analysis

12 0.4666203 32 nips-2008-Bayesian Kernel Shaping for Learning Control

13 0.46423692 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

14 0.44706756 134 nips-2008-Mixed Membership Stochastic Blockmodels

15 0.42393854 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

16 0.41168621 138 nips-2008-Modeling human function learning with Gaussian processes

17 0.40427846 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

18 0.40300992 202 nips-2008-Robust Regression and Lasso

19 0.39284131 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

20 0.38505527 234 nips-2008-The Infinite Factorial Hidden Markov Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.067), (7, 0.081), (12, 0.033), (15, 0.01), (28, 0.133), (57, 0.101), (59, 0.016), (63, 0.029), (71, 0.01), (77, 0.287), (78, 0.093), (81, 0.011), (83, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93118948 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

Author: Christoph Kolodziejski, Bernd Porr, Minija Tamosiunaite, Florentin Wörgötter

Abstract: In this theoretical contribution we provide mathematical proof that two of the most important classes of network learning - correlation-based differential Hebbian learning and reward-based temporal difference learning - are asymptotically equivalent when timing the learning with a local modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation based perspective that is more closely related to the biophysics of neurons. 1

2 0.92194289 61 nips-2008-Diffeomorphic Dimensionality Reduction

Author: Christian Walder, Bernhard Schölkopf

Abstract: This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian ﬂow ﬁeld which we compute using ideas from kernel methods. We demonstrate the efﬁcacy of our approach on various real world data sets. 1

3 0.91365963 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

Author: Haixuan Yang, Irwin King, Michael Lyu

Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-ﬁtting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1

4 0.90860057 144 nips-2008-Multi-resolution Exploration in Continuous Spaces

Author: Ali Nouri, Michael L. Littman

Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would beneﬁt from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1

same-paper 5 0.87361228 216 nips-2008-Sparse probabilistic projections

Author: Cédric Archambeau, Francis R. Bach

6 0.72376478 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

7 0.70963407 200 nips-2008-Robust Kernel Principal Component Analysis

8 0.70627308 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

9 0.69493675 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

10 0.68309522 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

11 0.67580426 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields

12 0.6723963 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization

13 0.67166191 124 nips-2008-Load and Attentional Bayes

14 0.66780555 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

15 0.66478056 44 nips-2008-Characteristic Kernels on Groups and Semigroups

16 0.66173375 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

17 0.65906143 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

18 0.65681309 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data

19 0.65558141 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes

20 0.65505642 195 nips-2008-Regularized Policy Iteration