nips nips2008 nips2008-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller
Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. [sent-5, score-0.466]
2 This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. [sent-6, score-0.184]
3 We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. [sent-7, score-0.242]
4 1 Introduction In Principal Components Analysis (PCA) we seek to reduce the dimensionality of a D-dimensional data vector to a smaller K-dimensional vector, which represents an embedding of the data in a lower dimensional space. [sent-8, score-0.226]
5 The traditional PCA algorithm is non-probabilistic and defines the eigenvectors corresponding to the K-largest eigenvalues as this low dimensional embedding. [sent-9, score-0.073]
6 In probabilistic approaches to PCA, such as probabilistic PCA (PPCA) and Bayesian PCA [1], the data is modelled by unobserved latent variables, and these latent variables define the low dimensional embedding. [sent-10, score-0.4]
7 In these models both the data and the latent variables are assumed to be Gaussian distributed. [sent-11, score-0.137]
8 This Gaussian assumption may not be suitable for all data types, especially in the case where data is binary or integer valued. [sent-12, score-0.054]
9 These general approaches to PCA involve the representation of the data matrix X as a product of smaller matrices: the factor score matrix V, representing the reduced vectors; and a data independent part Θ, known as the factor loading matrix. [sent-14, score-0.289]
10 In the original data matrix, there are N × D entries, and in the matrix factorisation there are (N + D) × K entries, which is a reduction in the number of parameters if K N, D [3]. [sent-15, score-0.264]
11 Models such as PCA, NMF and EPCA are from the class of deterministic latent variable models [6], since their latent variables are set to their maximum a posteriori (MAP) values. [sent-16, score-0.251]
12 This problem stems from the use of an inappropriate objective function, and can be remedied by using an alternate approximate inference scheme. [sent-19, score-0.053]
13 In this paper, we propose a fully Bayesian approach to PCA generalised to the exponential family. [sent-20, score-0.161]
14 Our approach follows the method of factorising the data matrix into two lower rank matrices using an exponential family distribution for the data with conjugate priors. [sent-21, score-0.462]
15 The exponential family of distributions is reviewed in section 2, and the complete specification for the model is given in section 3. [sent-22, score-0.254]
16 Learning and inference in the model is performed using the Hybrid Monte Carlo approach, which is appropriate due to the continuous nature of variables in the model. [sent-23, score-0.069]
17 We present results on the performance of our Bayesian exponential family PCA model in section 5. [sent-25, score-0.254]
18 We report performance using both a synthetic data set to highlight particular model properties and also on two real datasets: the Cedar Buffalo digits dataset and data on cardiac SPECT images. [sent-26, score-0.185]
19 The Bayesian approach gives us many samples of the final low dimensional embedding of the data, and techniques for determining a single low dimensional embedding are discussed in section 6. [sent-27, score-0.352]
20 In this paper, the natural representation of the exponential family likelihood is used, such that s(xn ) = xn . [sent-30, score-0.329]
21 It is convenient to represent a variable xn that is drawn from an exponential family distribution using the notation: xn ∼ Expon(θ) with natural parameters θ. [sent-31, score-0.467]
22 Probability distributions that belong to the exponential family also have natural conjugate prior distributions p(θ). [sent-32, score-0.289]
23 The conjugate prior distribution for the exponential family distribution of equation (1) is: p(θ) ∝ exp{λ θ + νg(θ) + f (λ)} (2) where λ and ν are hyperparameters of the prior distribution. [sent-33, score-0.357]
24 As an example, for binary data an appropriate data distribution is the Bernoulli distribution. [sent-35, score-0.079]
25 The exponential µ family form of this distribution, using the terms in equation (1) are: h(x) = 0, θ = ln( 1−µ ) and g(θ) = − ln(1 + eθ ). [sent-37, score-0.234]
26 The natural parameters can be mapped to the parameter values of the distribution using the link function, which is the logistic sigmoid in the case of the Bernoulli distribution. [sent-38, score-0.063]
27 The terms of the conjugate distribution can also be derived easily. [sent-39, score-0.062]
28 3 Bayesian Exponential Family PCA We can consider Bayesian Exponential Family PCA (BXPCA) as a method of searching for two matrices V and Θ, and we define the product matrix P = VΘ. [sent-40, score-0.094]
29 In traditional PCA, the elements of the matrix P which are the means of Gaussians, lie in the same space as that of the data X. [sent-41, score-0.077]
30 In the case of BXPCA and other methods for non-Gaussian PCA such as EPCA [4], this matrix represents the natural parameters of the exponential family distribution of the data. [sent-42, score-0.347]
31 We represent the observed data as an N × D matrix X = {x1 , . [sent-43, score-0.077]
32 , xN }, with an individual data point xn = [xn1 , . [sent-46, score-0.104]
33 K is the number of latent factors representing the dimensionality of the reduced space. [sent-59, score-0.142]
34 Let m and S be hyperparameters representing a K-dimensional vector of initial mean values and an initial covariance matrix respectively. [sent-62, score-0.086]
35 Let α and β be the hyperparameters corresponding to the shape and scale parameters of an inverse Gamma distribution. [sent-63, score-0.056]
36 We start by drawing µ from a Gaussian distribution and the 2 elements σk of the diagonal matrix Σ from an inverse gamma distribution: µ ∼ N (µ|m, S) 2 σk ∼ iG(α, β) (3) Figure 1: Graphical Model for Bayesian Exponential Family PCA. [sent-64, score-0.092]
37 For each data point n, we draw the K-dimensional entry vn of the factor score matrix: vn ∼ N (vn |µ, Σ) (4) The data is described by an exponential family distribution with natural parameters θ k . [sent-65, score-0.56]
38 The exponential family distribution modelling the data, and the corresponding prior over the model parameters, is: xn |vn , Θ ∼ Expon vnk θ k θ k ∼ Conj (λ, ν) (5) k We denote Ω = {V, Θ, µ, Σ} as the set of unknown parameters with hyperparameters Ψ = {m, S, α, β, λ, ν}. [sent-66, score-0.487]
39 2 Learning The model parameters Ω = {V, Θ, µ, Σ} are learned from the data using Hybrid Monte Carlo (HMC) sampling [7]. [sent-69, score-0.085]
40 Hybrid Monte Carlo is a suitable sampler for use with this model since all the variables are continuous and it is possible to compute the derivative of the log-joint probability. [sent-71, score-0.062]
41 HMC is also an attractive scheme for sampling since it avoids the random walk behaviour of the Metropolis or the Gibbs sampling algorithms [7]. [sent-72, score-0.054]
42 Hybrid Monte Carlo (HMC) is an auxiliary variable sampler where we sample from an augmented distribution p(x, u), rather than the target distribution p(x), since it is easier to sample from this augmented distribution [8]. [sent-73, score-0.164]
43 In BXPCA, the target distribution is: E(Ω|Ψ) = − ln p(X, Ω|Ψ) and represents the potential energy function. [sent-75, score-0.153]
44 The auxiliary variable u, is Gaussian and is used to define ∂E(Ω) the kinetic energy K = 1 uT u. [sent-76, score-0.088]
45 The sum of the kinetic and the potential energy defines the Hamiltonian. [sent-78, score-0.072]
46 Therefore to correctly apply HMC to this model, we must ensure that all constrained variables are transformed to an unconstrained space, perform dynamics in this unconstrained space, and then transform the variables back to the original constrained space. [sent-82, score-0.166]
47 Each σk can be transformed to a corresponding 2 unconstrained variable ξk using the transformation: σk = eξk . [sent-84, score-0.08]
48 We also extended the HMC procedure to handle missing inputs in a principled manner, by analytically integrating them out. [sent-86, score-0.066]
49 In practice, this implies working with missing data under the Missing at Random (MAR) assumption. [sent-87, score-0.093]
50 Here, we divide the data into the set of observed and missing data, X = {Xobs , Xmissing }, and use the set Xobs in the inference. [sent-88, score-0.093]
51 4 Related Work Exponential Family PCA: Exponential family PCA (EPCA) [4] is a general class of PCA algorithms that allows the ideas of PCA to be applied to any data that can be modelled from a distribution in the exponential family. [sent-89, score-0.345]
52 Like BXPCA, it is based on a factorisation of the data into a factor score matrix V and a factor loading matrix Θ. [sent-90, score-0.373]
53 The algorithm is based on the optimisation of a loss function which is based on the Bregman divergence between the data and the learned reconstruction of the data. [sent-91, score-0.069]
54 The learning is based on an alternating minimisation procedure where the two matrices V and Θ are optimised in turn, and each optimisation is a convex function. [sent-92, score-0.114]
55 The EPCA objective function can be seen as the likelihood function of a probabilistic model, and hence this optimisation corresponds to maximum a posteriori (MAP) learning. [sent-93, score-0.083]
56 The use of MAP learning makes EPCA a deterministic latent variable model [6], since the latent variables are set to their MAP values. [sent-94, score-0.253]
57 In both our model and EPCA, the product P = VΘ represents the natural parameters of the distribution over the data, and must be transformed using the link function to get to the parameter space of the associated data distribution. [sent-95, score-0.159]
58 Our model is different from EPCA in that it is a fully probabilistic model in which all parameters can be integrated out by MCMC. [sent-96, score-0.083]
59 Non-negative Matrix Factorisation: Non-negative Matrix Factorisation (NMF) [2] is a technique of factorising a matrix into the product of two positive lower rank matrices. [sent-99, score-0.108]
60 In NMF, the matrix product P approximates the mean parameters of the data distribution, and is thus in the same space as the data. [sent-100, score-0.117]
61 A mean parameter for example, is the rate λ if the data is modelled as a Poisson distribution, or is the probability of data being a 1 if the data is modelled as a Bernoulli. [sent-101, score-0.199]
62 Similarly to EPCA, this learning method places NMF in the class of deterministic latent variable methods. [sent-103, score-0.123]
63 Discrete Components Analysis: The Discrete Components Analysis (DCA) [3] is a family of probabilistic algorithms that deals with the application of PCA to discrete data and is a unification of the existing theory relating to dimensionality reduction with discrete distributions. [sent-104, score-0.264]
64 The various algorithms of the DCA family are simulated using either Gibbs sampling or variational approximations. [sent-106, score-0.142]
65 Bayesian Partial Membership: The Bayesian Partial Membership (BPM) model is a clustering technique that allows data points to have fractional membership in multiple clusters. [sent-107, score-0.086]
66 The model is derived from a finite mixture model which allows the usual indicator variables to take on any value in the range [0,1]. [sent-108, score-0.065]
67 The resulting model has the same form as the model shown in figure 1, but instead of the model variable V being modelled as a Gaussian with unknown mean and covariance, it is instead modelled as a Dirichlet distribution. [sent-109, score-0.194]
68 In the BXPCA, we interpret the matrix V as a lower dimensional embedding of the data which can be used for dimensionality reduction. [sent-111, score-0.249]
69 In contrast, the corresponding matrix for the BPM model, whose values are restricted to [0,1], is the partial membership of each data point and represents the extent to which each data point belongs to each of the K clusters. [sent-112, score-0.143]
70 In the initial phase of the sampling, the energy decreases slowly and the model is unable to learn any useful structure from the data. [sent-120, score-0.065]
71 Around sample 750, the energy function decreases and some useful structure has been learnt. [sent-121, score-0.073]
72 By sample 4000 the model has learnt the original data well, as can be seen by comparing sample 4000 and the original data. [sent-122, score-0.155]
73 1 EPCA 0 BXPCA −3 1 2 3 4 5 8 10 Latent Factors (K) (c) 15 20 25 30 10 0 5 10 15 Latent Factors (K) 20 25 30 (d) Figure 3: Boxplots comparing the NLP and RMSE of BXPCA and EPCA for various latent factors. [sent-143, score-0.103]
74 The test data was created by randomly selecting 10% of the data points. [sent-145, score-0.054]
75 These test data points were set as missing values in the training data. [sent-146, score-0.093]
76 Inference is then run using BXPCA, which has been extended to consider missing data. [sent-147, score-0.066]
77 This method of using missing data is a natural way of testing these algorithms, since both are generative models. [sent-148, score-0.13]
78 We evaluate the same metrics for EPCA, which is also trained considering missing data. [sent-150, score-0.066]
79 This missing data testing methodology is also used in the experiments on real data that are described later. [sent-151, score-0.139]
80 In figure 3a and 3b, the RMSE and NLP of the two algorithms are compared respectively, for various choices of the latent factor K. [sent-152, score-0.106]
81 The digit 2 was used, and consists of 700 greyscale images with 64 attributes. [sent-162, score-0.068]
82 The digits were binarised by thresholding at a greyscale value of 128 from the 0 to 255 greyscale range. [sent-163, score-0.135]
83 Table 1 compares the performance of BXPCA and EPCA, using the same method of creating training and testing data sets as for the synthetic data. [sent-164, score-0.108]
84 SPECT Data: The data set describes the diagnosis of cardiac Single Proton Emission Computed Tomography (SPECT) images [11]. [sent-166, score-0.057]
85 560 Choice of Final Embedding For the purposes of dimensionality reduction, PCA is used to search for a low dimensional embedding V of the data points. [sent-220, score-0.222]
86 In EPCA, the alternating minimisation returns a single V that is the low dimensional representation. [sent-221, score-0.121]
87 In BXPCA though, we do not get a single V, but rather many samples which represent the variation in the embedding. [sent-222, score-0.056]
88 There are several approaches to obtaining a single low dimensional representation from the set of samples. [sent-225, score-0.073]
89 We then sample V from the conditional distribution: V ∼ p(V|X, Θ∗ , µ∗ , Σ∗ ) ∝ p(X|V, Θ∗ )p(V|µ∗ , Σ∗ ) (8) where equation (8) is obtained using Bayes theorem and the joint probability distribution given in equation (6). [sent-230, score-0.053]
90 We can now average these samples to obtain a single embedding since the problems of rotation and permutation have been removed by constraining the variables {Θ∗ , µ∗ , Σ∗ }. [sent-231, score-0.14]
91 We demonstrate this procedure using the synthetic data described in the previous section for K = 2. [sent-232, score-0.071]
92 Figure 4 shows the embedding in the 2D space for 10 data points and 20 independent samples drawn according to equation (8). [sent-233, score-0.142]
93 7 Conclusions and Future Work We have described a Bayesian approach to PCA which is generalised to the exponential family. [sent-242, score-0.161]
94 We have employed a Hybrid Monte Carlo sampling scheme with an energy based on the log-joint probability of the model. [sent-243, score-0.063]
95 In particular, we have demonstrated the ability of BXPCA to learn the structure of the data while avoiding overfitting problems, which are experienced by other maximum likelihood approaches to exponential family PCA. [sent-244, score-0.261]
96 Instead of considering a Gaussian distribution, a Laplacian or other heavy tailed distribution could be used, which would allow us to determine the lower dimensional embedding of the data, and also give the model a sparseness property. [sent-247, score-0.186]
97 We could also specifically include restrictions on the form of the score and the loading matrices, V and Θ respectively, to ensure identifiability. [sent-248, score-0.094]
98 Schapire, “A generalization of principal components to the exponential family,” in Advances in Neural Information Processing Systems, vol. [sent-274, score-0.153]
99 Orlitsky, “Semi-parametric exponential family PCA,” in Advances in Neural Information Processing Systems, vol. [sent-278, score-0.234]
100 Sutter, “Deterministic latent variable models and their pitfalls,” in SIAM Conference on Data Mining (SDM), pp. [sent-284, score-0.101]
wordName wordTfidf (topN-words)
[('epca', 0.6), ('bxpca', 0.525), ('pca', 0.227), ('hmc', 0.169), ('nlp', 0.169), ('rmse', 0.146), ('factorisation', 0.131), ('family', 0.124), ('nmf', 0.112), ('exponential', 0.11), ('spect', 0.094), ('embedding', 0.091), ('latent', 0.085), ('ln', 0.083), ('vn', 0.08), ('xn', 0.077), ('dca', 0.075), ('vnk', 0.075), ('missing', 0.066), ('modelled', 0.059), ('hybrid', 0.057), ('carlo', 0.055), ('monte', 0.053), ('generalised', 0.051), ('dimensional', 0.05), ('matrix', 0.05), ('greyscale', 0.049), ('notch', 0.049), ('bits', 0.049), ('energy', 0.045), ('loading', 0.045), ('tting', 0.044), ('synthetic', 0.044), ('optimisation', 0.042), ('membership', 0.039), ('bayesian', 0.039), ('bpm', 0.038), ('buffalo', 0.038), ('cedar', 0.038), ('conj', 0.038), ('expon', 0.038), ('factorising', 0.038), ('xobs', 0.038), ('digits', 0.037), ('conjugate', 0.037), ('hyperparameters', 0.036), ('unconstrained', 0.035), ('variation', 0.032), ('dimensionality', 0.031), ('cardiac', 0.03), ('minimisation', 0.03), ('transformed', 0.029), ('alternate', 0.029), ('sample', 0.028), ('final', 0.028), ('score', 0.028), ('poisson', 0.027), ('data', 0.027), ('kinetic', 0.027), ('mackay', 0.027), ('factors', 0.026), ('tipping', 0.025), ('variables', 0.025), ('distribution', 0.025), ('samples', 0.024), ('inference', 0.024), ('matrices', 0.024), ('probabilistic', 0.023), ('components', 0.023), ('bernoulli', 0.023), ('box', 0.023), ('gure', 0.023), ('low', 0.023), ('deterministic', 0.022), ('cambridge', 0.022), ('demonstrates', 0.021), ('welling', 0.021), ('restrictions', 0.021), ('factor', 0.021), ('principal', 0.02), ('parameters', 0.02), ('discrete', 0.02), ('product', 0.02), ('model', 0.02), ('reduction', 0.019), ('testing', 0.019), ('digit', 0.019), ('natural', 0.018), ('creating', 0.018), ('alternating', 0.018), ('ma', 0.018), ('sampling', 0.018), ('comparing', 0.018), ('avoids', 0.018), ('posteriori', 0.018), ('sampler', 0.017), ('gamma', 0.017), ('original', 0.017), ('variable', 0.016), ('andrieu', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 31 nips-2008-Bayesian Exponential Family PCA
Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller
Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1
2 0.14043881 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
Author: Yuhong Guo
Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1
3 0.10940837 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
4 0.096723802 200 nips-2008-Robust Kernel Principal Component Analysis
Author: Minh H. Nguyen, Fernando Torre
Abstract: Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods. 1
5 0.068340339 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
6 0.061605494 138 nips-2008-Modeling human function learning with Gaussian processes
7 0.057659503 61 nips-2008-Diffeomorphic Dimensionality Reduction
8 0.056957543 62 nips-2008-Differentiable Sparse Coding
9 0.055169985 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
10 0.054961473 233 nips-2008-The Gaussian Process Density Sampler
11 0.054125782 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
12 0.051686317 238 nips-2008-Theory of matching pursuit
13 0.051656324 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
14 0.050262131 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
15 0.049499612 235 nips-2008-The Infinite Hierarchical Factor Regression Model
16 0.049399637 57 nips-2008-Deflation Methods for Sparse PCA
17 0.044065841 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
18 0.042076275 219 nips-2008-Spectral Hashing
19 0.041992601 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
20 0.041470114 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
topicId topicWeight
[(0, -0.136), (1, -0.032), (2, 0.024), (3, 0.026), (4, 0.014), (5, -0.021), (6, -0.002), (7, 0.124), (8, -0.043), (9, -0.003), (10, 0.045), (11, -0.057), (12, 0.078), (13, 0.062), (14, 0.021), (15, -0.087), (16, 0.04), (17, 0.022), (18, 0.07), (19, -0.065), (20, -0.052), (21, -0.12), (22, 0.076), (23, -0.014), (24, -0.141), (25, -0.127), (26, -0.065), (27, -0.003), (28, -0.124), (29, 0.144), (30, 0.012), (31, -0.107), (32, 0.031), (33, -0.003), (34, -0.033), (35, 0.071), (36, 0.046), (37, -0.032), (38, -0.031), (39, -0.011), (40, 0.007), (41, 0.008), (42, 0.077), (43, -0.087), (44, 0.08), (45, -0.048), (46, 0.102), (47, 0.006), (48, -0.124), (49, 0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.91680866 31 nips-2008-Bayesian Exponential Family PCA
Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller
Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1
2 0.70865417 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
Author: Yuhong Guo
Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1
3 0.67871672 200 nips-2008-Robust Kernel Principal Component Analysis
Author: Minh H. Nguyen, Fernando Torre
Abstract: Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods. 1
4 0.66398329 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
5 0.52572203 238 nips-2008-Theory of matching pursuit
Author: Zakria Hussain, John S. Shawe-taylor
Abstract: We analyse matching pursuit for kernel principal components analysis (KPCA) by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al [7] and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms. 1
6 0.50584573 61 nips-2008-Diffeomorphic Dimensionality Reduction
7 0.49388415 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
8 0.47792709 233 nips-2008-The Gaussian Process Density Sampler
9 0.44740123 105 nips-2008-Improving on Expectation Propagation
10 0.41260916 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
11 0.39521456 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
12 0.38381144 57 nips-2008-Deflation Methods for Sparse PCA
13 0.36828309 249 nips-2008-Variational Mixture of Gaussian Process Experts
14 0.35621265 236 nips-2008-The Mondrian Process
15 0.35404694 134 nips-2008-Mixed Membership Stochastic Blockmodels
16 0.3489247 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
17 0.33921593 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
18 0.33024755 248 nips-2008-Using matrices to model symbolic relationship
19 0.32937527 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
20 0.30679065 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
topicId topicWeight
[(6, 0.051), (7, 0.098), (12, 0.034), (28, 0.17), (44, 0.267), (57, 0.109), (59, 0.017), (63, 0.024), (77, 0.077), (83, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.7711702 31 nips-2008-Bayesian Exponential Family PCA
Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller
Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1
2 0.71578133 211 nips-2008-Simple Local Models for Complex Dynamical Systems
Author: Erik Talvitie, Satinder P. Singh
Abstract: We present a novel mathematical formalism for the idea of a “local model” of an uncontrolled dynamical system, a model that makes only certain predictions in only certain situations. As a result of its restricted responsibilities, a local model may be far simpler than a complete model of the system. We then show how one might combine several local models to produce a more detailed model. We demonstrate our ability to learn a collection of local models on a large-scale example and do a preliminary empirical comparison of learning a collection of local models and some other model learning methods. 1
3 0.66085273 200 nips-2008-Robust Kernel Principal Component Analysis
Author: Minh H. Nguyen, Fernando Torre
Abstract: Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods. 1
4 0.65812665 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
5 0.65366799 118 nips-2008-Learning Transformational Invariants from Natural Movies
Author: Charles Cadieu, Bruno A. Olshausen
Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the first layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the first layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can influence representations at lower levels as a by-product of inference in a graphical model. 1
6 0.65295845 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
7 0.65277553 138 nips-2008-Modeling human function learning with Gaussian processes
8 0.64537203 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
9 0.64486331 62 nips-2008-Differentiable Sparse Coding
10 0.64472675 66 nips-2008-Dynamic visual attention: searching for coding length increments
11 0.64419967 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
12 0.64322132 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
13 0.64312667 234 nips-2008-The Infinite Factorial Hidden Markov Model
14 0.64172465 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
15 0.64079398 235 nips-2008-The Infinite Hierarchical Factor Regression Model
16 0.6404447 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
17 0.63882047 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
18 0.6377449 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
19 0.6375621 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
20 0.637254 219 nips-2008-Spectral Hashing