nips nips2008 nips2008-233 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. [sent-12, score-0.363]
2 Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. [sent-13, score-0.387]
3 Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. [sent-14, score-0.567]
4 We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. [sent-16, score-0.23]
5 1 Introduction We present the Gaussian Process Density Sampler (GPDS), a generative model for probability density functions, based on a Gaussian process. [sent-17, score-0.226]
6 We are able to draw exact and exchangeable data from a fixed density drawn from the prior. [sent-18, score-0.44]
7 Given data, this generative prior allows us to perform inference of the unnormalized density. [sent-19, score-0.205]
8 We perform this inference by expressing the generative process in terms of a latent history, then constructing a Markov chain Monte Carlo algorithm on that latent history. [sent-20, score-0.651]
9 The central idea of the GPDS is to allow nonparametric Bayesian density estimation where the prior is specified via a Gaussian process covariance function that encodes the intuition that “similar data should have similar probabilities. [sent-21, score-0.361]
10 ” One way to perform Bayesian nonparametric density estimation is to use a Dirichlet process to define a distribution over the weights of the components in an infinite mixture model, using a simple parametric form for each component. [sent-22, score-0.289]
11 Alternatively, Neal [1] generalizes the Dirichlet process itself, introducing a spatial component to achieve an exchangeable prior on discrete or continuous density functions with hierarchical characteristics. [sent-23, score-0.412]
12 Another way to define a nonparametric density is to transform a simple latent distribution through a nonlinear map, as in the Density Network [2] and the Gaussian Process Latent Variable Model [3]. [sent-24, score-0.444]
13 Here we use the Gaussian process to define a prior on the density function itself. [sent-25, score-0.318]
14 We first construct a Gaussian process prior with the data space X as its input and the one-dimensional real space R as its output. [sent-28, score-0.135]
15 1, ℓy = 2, α = 5 Figure 1: Four samples from the GPDS prior are shown, with 200 data samples. [sent-39, score-0.129]
16 In each case the base measure is the zero-mean spherical Gaussian with unit variance. [sent-41, score-0.113]
17 Probability density functions must be everywhere nonnegative and must integrate to unity. [sent-47, score-0.218]
18 We define a map from a function g(x) : X → R, x ∈ X , to a proper density f (x) via f (x) = 1 Φ(g(x)) π(x) Zπ [g] (1) where π(x) is an arbitrary base probability measure on X , and Φ(·) : R → (0, 1) is a nonnegative function with upper bound 1. [sent-48, score-0.292]
19 (2) Through the map defined by Equation 1, a Gaussian process prior becomes a prior distribution over normalized probability density functions on X . [sent-54, score-0.39]
20 3 Generating exact samples from the prior We can use rejection sampling to generate samples from a common density drawn from the the prior described in Section 2. [sent-56, score-0.764]
21 A rejection sampler requires a proposal density that provides an upper bound for the unnormalized density of interest. [sent-57, score-0.756]
22 In this case, the proposal density is π(x) and the unnormalized density of interest is Φ(g(x))π(x). [sent-58, score-0.55]
23 If g(x) were known, rejection sampling would proceed as follows: First generate proposals {˜q } x from the base measure π(x). [sent-59, score-0.496]
24 The proposal xq would be accepted if a variate rq drawn uniformly ˜ from (0, 1) was less than Φ(g(˜q )). [sent-60, score-0.412]
25 We can nevertheless use rejection sampling by “discovering” g(x) as we proceed at just the places we need to know it, by sampling from the prior distribution of the latent function. [sent-63, score-0.538]
26 As it is necessary only to know g(x) at the {xq } to accept or reject these proposals, the samples are still exact. [sent-64, score-0.145]
27 In practice, we generate the samples sequentially, as in Algorithm 1, so that we may be assured of having as many accepted samples as we require. [sent-67, score-0.216]
28 In each loop, a proposal is drawn from the base measure π(x) and the function g(x) is sampled from the Gaussian process at this proposed coordinate, conditional on all the function values already sampled. [sent-68, score-0.317]
29 (a): Draw Q samples {˜q }Q from the base measure π(x), which in this case is uniform on x [0, 1]. [sent-70, score-0.131]
30 (d): Accept only the points whose uniform draws are beneath the squashed function value, i. [sent-74, score-0.155]
31 (e): The accepted points (˜q , rq ) are uniformly drawn from the shaded area beneath the curve g x and the marginal distribution of the accepted xq is proportional to Φ(g(x))π(x). [sent-77, score-0.367]
32 After the function is sampled, a uniform variate is drawn from beneath the bound and compared to the Φ-squashed function at the proposal location. [sent-79, score-0.31]
33 Second, conditioned on the proposals from the base measure, the Gaussian process is a simple multivariate Gaussian distribution, which is exchangeable in its components. [sent-85, score-0.418]
34 Finally, conditioned on the draw from the Gaussian process, the acceptance/rejection steps are independent Bernoulli samples, and the overall procedure is exchangeable. [sent-86, score-0.116]
35 More broadly, exchangeable priors are useful in Bayesian modeling because we may consider the data conditionally independent, given the latent density. [sent-88, score-0.34]
36 We use the GPDS prior from Section 2 to specify our beliefs about f (x), and we wish to generate samples from the posterior distribution over the latent function g(x) corresponding to the unknown density. [sent-90, score-0.419]
37 We may also wish to generate samples from the predictive distribution or perform hierarchical inference of the prior hyperparameters. [sent-91, score-0.246]
38 By using the GPDS prior to model the data, we are asserting that the data can be explained as the result of the procedure described in Section 3. [sent-92, score-0.113]
39 We do not, however, know what rejections were made en route to accepting the observed data. [sent-93, score-0.463]
40 These rejections are critical to defining the latent function g(x). [sent-94, score-0.681]
41 One might think of defining a density as analogous to putting up a tent: pinning the canvas down with pegs is just as important as putting up poles. [sent-95, score-0.183]
42 In density modeling, defining regions with little probability mass is just as important as defining the areas with significant mass. [sent-96, score-0.183]
43 Although the rejections are not known, the generative procedure provides a probabilistic model that allows us to traverse the posterior distribution over possible latent histories that resulted in the data. [sent-97, score-0.864]
44 If we define a Markov chain whose equilibrium distribution is the posterior distribution over latent histories, then we may simulate plausible explanations of every step taken to arrive at the data. [sent-98, score-0.322]
45 Such samples capture all the information available about the unknown density, and with them we may ask additional questions about g(x) or run the generative procedure further to draw predictive samples. [sent-99, score-0.256]
46 This approach is related to that described by Murray [7], who performed inference on an exactly-coalesced Markov chain [8], and by Beskos et al. [sent-100, score-0.109]
47 m=1 We perform Gibbs-like sampling of the latent history by alternating between modification of the number of rejections M and block updating of the rejection locations M and latent function values GM and GN . [sent-108, score-1.243]
48 We will maintain an explicit ordering of the latent rejections for reasons of clarity, although this is not necessary due to exchangeability. [sent-109, score-0.681]
49 1 Modifying the number of latent rejections ˆ We propose a new number of latent rejections M by drawing it from a proposal distribution ˆ ← M ). [sent-114, score-1.5]
50 If M is greater than M , we must also propose new rejections to add to the laˆ q(M tent state. [sent-115, score-0.502]
51 We take advantage of the exchangeability of the process to generate the new rejections: we imagine these proposals were made after the last observed datum was accepted, and our proˆ posal is to call them rejections and move them before the last datum. [sent-116, score-0.781]
52 If M is less than M , we do the opposite by proposing to move some rejections to after the last acceptance. [sent-117, score-0.504]
53 When proposing additional rejections, we must also propose times for them among the current ˆ +N −1 latent history. [sent-118, score-0.259]
54 There are Mˆ −M such ways to insert these additional rejections into the existing M latent history, such that the sampler terminates after the N th acceptance. [sent-119, score-0.731]
55 Upon − ˆ simplification, the proposal ratios for both addition and removal of rejections are identical: ˆ M >M ˆ ˆ +N −1 q(M ← M ) Mˆ −M M ˆ ˆ q(M ← M ) MM ˆ −M ˆ M M q(M ←M ) M ! [sent-121, score-0.601]
56 2 Modifying rejection locations and function values Given the number of latent rejections M , we propose modifying their locations M, their latent function values GM , and the values of the latent function at the data GN . [sent-129, score-1.441]
57 We will denote these proposals ˆ ˆ as M = {ˆm }M , GM = {ˆm = g (ˆm )}M and GN = {ˆn = g (xn )}N , respectively. [sent-130, score-0.187]
58 We x m=1 ˆ g ˆx g ˆ m=1 n=1 ˆ make simple perturbative proposals of M via a proposal density q(M ← M). [sent-131, score-0.547]
59 For the latent function values, however, perturbative proposals will be poor, as the Gaussian process typically defines a narrow mass. [sent-132, score-0.507]
60 To avoid this, we propose modifications to the latent function that leave the prior invariant. [sent-133, score-0.29]
61 ˆ ˆ ˆ We make joint proposals of M, GM and GN in three steps. [sent-134, score-0.187]
62 Second, we draw a set of M intermediate function values from the Gaussian from q(M ˆ process at M, conditioned on the current rejection locations and their function values, as well as ˆ the function values at the data. [sent-136, score-0.362]
63 Third, we propose new function values at M and the data D via an underrelaxation proposal of the form g (x) = α g(x) + 1 − α2 h(x) ˆ where h(x) is a sample from the Gaussian process prior and α is in [0, 1). [sent-137, score-0.273]
64 This procedure leaves the Gaussian process prior invariant, but makes conservative proposals if α is near one. [sent-139, score-0.363]
65 Hyperparameter inference Given a sample from the posterior on the latent history, we can also perform a Metropolis–Hasting step in the space of hyperparameters. [sent-142, score-0.301]
66 Parameters θ, governing the covariance function and mean function of the Gaussian process provide common examples of hyperparameters, but we might also introduce parameters φ that control the behavior of the base measure π(x). [sent-143, score-0.137]
67 We denote the proposal ˆ ˆ distributions for these parameters as q(θ ← θ) and q(φ ← φ), respectively. [sent-144, score-0.138]
68 π(xn | φ) n=1 Prediction The predictive distribution is the one that arises on the space X when the posterior on the latent function g(x) (and perhaps hyperparameters) is integrated out. [sent-147, score-0.297]
69 In the GPDS we sample from the predictive distribution by running the generative process of Section 3, initialized to the current latent history sample from the Metropolis–Hastings procedure described above. [sent-149, score-0.479]
70 (a): The current state, with rejections labeled M = {xm } on the left, along with the values of the latent function GM = {gm }. [sent-153, score-0.681]
71 On the right side are the data D = {xn } ˆ and the corresponding values of the latent function GN = {gn }. [sent-154, score-0.218]
72 (b): New rejections M = {ˆm } are x ˆ proposed via q(M ← M), and the latent function is sampled at these points. [sent-155, score-0.681]
73 (c): The latent function is perturbed at the new rejection locations and at the data via an underrelaxed proposal. [sent-156, score-0.442]
74 We find the expectation of each side under the posterior of g and the hyperparameters θ and φ: Φ(g(x′ )) dθ dφ p(θ, φ | D) dg p(g | θ, D) dx′ p(x | g, θ, φ)π(x′ ) min 1, Φ(g(x)) Φ(g(x)) . [sent-157, score-0.158]
75 5 Results We examined the GPDS prior and the latent history inference procedure on a toy data set and on a skull reconstruction task. [sent-160, score-0.592]
76 We compared the approach described in this paper to a kernel density estimate (Parzen windows), an infinite mixture of Gaussians (iMoG), and Dirichlet diffusion trees (DFT). [sent-161, score-0.257]
77 The kernel density estimator used a spherical Gaussian with the bandwidth set via ten-fold cross validation. [sent-162, score-0.222]
78 The toy data problem consisted of 100 uniform draws from a two-dimensional ring with radius 1. [sent-164, score-0.118]
79 We modeled the the joint density of ten measurements of linear distances between anatomical landmarks on 228 rhesus macaque (Macaca mulatta) skulls. [sent-171, score-0.454]
80 These linear distances were generated from three-dimensional coordinate data of anatomical landmarks taken by a single observer from dried skulls using a digitizer [11]. [sent-172, score-0.19]
81 Figure 5 shows a computed tomography (CT) scan reconstruction of a macaque skull, along with the ten linear distances used. [sent-174, score-0.158]
82 5 GPDS iMoG DFT Ring Mac T1 Mac T2 Mac T3 Figure 4: The macaque skull data are linear dis- Figure 5: This bar plot shows the improvement of tances calculated between three-dimensional coordinates of anatomical landmarks. [sent-188, score-0.243]
83 These are superior and inferior views of a computed tomography (CT) scan of a male macaque skull, with the ten linear distances superimposed. [sent-189, score-0.158]
84 the GPDS, infinite mixture of Gaussians (iMoG), and Dirichlet diffusion trees (DFT) in mean log probability (base e) of the test set over crossvalidated Parzen windows on the toy ring data and the macaque data. [sent-191, score-0.277]
85 This work introduces the first such prior that enables tractable density estimation, complementing alternatives such as Dirichlet Diffusion Trees [1] and infinite mixture models. [sent-198, score-0.255]
86 1 Computational complexity The inference method for the GPDS prior is “practical” in the sense that it can be implemented without approximations, but it has potentially-steep computational costs. [sent-205, score-0.116]
87 To compare two latent histories in a Metropolis–Hastings step we must evaluate the marginal likelihood of the Gaussian process. [sent-206, score-0.278]
88 The expected cost of an M–H step is determined by the expected number of rejections M . [sent-209, score-0.463]
89 This expression is derived from the observation that π(x) provides an upper bound on the function Φ(g(x))π(x) and the ratio of acceptances to rejections is determined by the proportion of the mass of π(x) contained by Φ(g(x))π(x). [sent-211, score-0.463]
90 Sparse approaches to Gaussian process regression that improve the asymptotically cubic behavior may also be relevant to the GPDS, but it is unclear that these will be an improvement over other approximate GP-based schemes for density modeling. [sent-213, score-0.246]
91 2 Alternative inference methods In developing inference methods for the GPDS prior, we have also explored the use of exchange sampling [17, 7]. [sent-215, score-0.196]
92 Exchange sampling is an MCMC technique explicitly developed for the situation where there is an intractable normalization constant that prevents exact likelihood evaluation, but exact samples may be generated for any particular parameter setting. [sent-216, score-0.195]
93 Undirected graphical models such as the Ising and Potts models provide common examples of cases where exchange sampling is applicable via coupling from the past [8]. [sent-217, score-0.108]
94 Using the exact sampling procedure of Section 3, it is applicable to the GPDS as well. [sent-218, score-0.133]
95 Exchange sampling for the GPDS, however, requires more evaluations of the function g(x) than the latent history approach. [sent-219, score-0.338]
96 In practice the latent history approach of Section 4 does perform better. [sent-220, score-0.292]
97 Probabilistic non-linear principal component analysis with Gaussian process latent variable models. [sent-238, score-0.281]
98 Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. [sent-260, score-0.128]
99 The relationship between fluctuating asymmetry and environmental variance in rhesus macaque skulls. [sent-288, score-0.125]
100 Posterior consistency of logistic Gaussian process priors in density estimation. [sent-315, score-0.316]
wordName wordTfidf (topN-words)
[('gpds', 0.463), ('rejections', 0.463), ('latent', 0.218), ('gm', 0.192), ('proposals', 0.187), ('gn', 0.186), ('density', 0.183), ('rejection', 0.156), ('proposal', 0.138), ('metropolis', 0.114), ('parzen', 0.106), ('skull', 0.096), ('exchangeable', 0.094), ('macaque', 0.09), ('hastings', 0.088), ('dft', 0.088), ('imog', 0.088), ('xm', 0.078), ('beneath', 0.077), ('draw', 0.075), ('diffusion', 0.074), ('murray', 0.074), ('base', 0.074), ('history', 0.074), ('prior', 0.072), ('accepted', 0.069), ('rq', 0.069), ('locations', 0.068), ('gaussian', 0.068), ('hasting', 0.066), ('chain', 0.065), ('process', 0.063), ('exchange', 0.062), ('dg', 0.06), ('histories', 0.06), ('hyperparameters', 0.059), ('dirichlet', 0.058), ('samples', 0.057), ('anatomical', 0.057), ('neal', 0.057), ('mac', 0.053), ('variate', 0.053), ('sampler', 0.05), ('landmarks', 0.05), ('carlo', 0.049), ('xn', 0.048), ('markov', 0.047), ('accept', 0.047), ('monte', 0.047), ('toy', 0.047), ('unnormalized', 0.046), ('sampling', 0.046), ('exact', 0.046), ('dx', 0.045), ('conditioning', 0.045), ('inference', 0.044), ('beskos', 0.044), ('cavendish', 0.044), ('papaspiliopoulos', 0.044), ('skulls', 0.044), ('squashed', 0.044), ('mcmc', 0.044), ('nonparametric', 0.043), ('generative', 0.043), ('logistic', 0.042), ('drawn', 0.042), ('proposing', 0.041), ('xq', 0.041), ('reject', 0.041), ('procedure', 0.041), ('predictive', 0.04), ('bayesian', 0.039), ('posterior', 0.039), ('distances', 0.039), ('chib', 0.039), ('nats', 0.039), ('tent', 0.039), ('adams', 0.039), ('retrospective', 0.039), ('perturbative', 0.039), ('spherical', 0.039), ('ring', 0.037), ('rhesus', 0.035), ('datum', 0.035), ('nonnegative', 0.035), ('biometrika', 0.034), ('draws', 0.034), ('generate', 0.033), ('ryan', 0.033), ('proxy', 0.033), ('modifying', 0.032), ('gp', 0.032), ('mackay', 0.031), ('toronto', 0.03), ('primate', 0.03), ('iain', 0.03), ('ning', 0.029), ('tomography', 0.029), ('windows', 0.029), ('priors', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
2 0.14651932 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
3 0.098187335 136 nips-2008-Model selection and velocity estimation using novel priors for motion patterns
Author: Shuang Wu, Hongjing Lu, Alan L. Yuille
Abstract: Psychophysical experiments show that humans are better at perceiving rotation and expansion than translation. These findings are inconsistent with standard models of motion integration which predict best performance for translation [6]. To explain this discrepancy, our theory formulates motion perception at two levels of inference: we first perform model selection between the competing models (e.g. translation, rotation, and expansion) and then estimate the velocity using the selected model. We define novel prior models for smooth rotation and expansion using techniques similar to those in the slow-and-smooth model [17] (e.g. Green functions of differential operators). The theory gives good agreement with the trends observed in human experiments. 1
4 0.09650255 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence
Abstract: Identification and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many fields, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1
5 0.08803577 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
Author: Neil D. Lawrence, Magnus Rattray, Michalis K. Titsias
Abstract: Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation. 1
6 0.078983463 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
7 0.078300513 228 nips-2008-Support Vector Machines with a Reject Option
8 0.077544734 138 nips-2008-Modeling human function learning with Gaussian processes
9 0.076128125 234 nips-2008-The Infinite Factorial Hidden Markov Model
10 0.072748959 152 nips-2008-Non-stationary dynamic Bayesian networks
11 0.071442343 216 nips-2008-Sparse probabilistic projections
12 0.070351347 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
13 0.069981053 105 nips-2008-Improving on Expectation Propagation
14 0.066855296 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
15 0.065079965 235 nips-2008-The Infinite Hierarchical Factor Regression Model
16 0.059729721 249 nips-2008-Variational Mixture of Gaussian Process Experts
17 0.054961473 31 nips-2008-Bayesian Exponential Family PCA
18 0.054786284 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
19 0.050721131 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
20 0.049945749 32 nips-2008-Bayesian Kernel Shaping for Learning Control
topicId topicWeight
[(0, -0.163), (1, -0.016), (2, 0.073), (3, 0.006), (4, 0.056), (5, -0.056), (6, 0.0), (7, 0.214), (8, -0.006), (9, 0.044), (10, -0.002), (11, 0.041), (12, 0.153), (13, -0.03), (14, -0.006), (15, -0.024), (16, -0.035), (17, 0.035), (18, 0.045), (19, 0.053), (20, -0.092), (21, -0.003), (22, 0.03), (23, -0.022), (24, -0.026), (25, 0.041), (26, -0.006), (27, -0.05), (28, 0.034), (29, 0.074), (30, -0.041), (31, -0.032), (32, -0.025), (33, -0.065), (34, -0.004), (35, 0.073), (36, -0.055), (37, -0.01), (38, 0.099), (39, -0.052), (40, -0.023), (41, -0.062), (42, 0.066), (43, -0.078), (44, 0.087), (45, 0.12), (46, 0.017), (47, 0.171), (48, -0.1), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.94051188 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
2 0.68046653 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
3 0.68012899 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
4 0.63570529 105 nips-2008-Improving on Expectation Propagation
Author: Manfred Opper, Ulrich Paquet, Ole Winther
Abstract: A series of corrections is developed for the fixed points of Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.
5 0.61495256 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
Author: Norm Aleks, Stuart Russell, Michael G. Madden, Diane Morabito, Kristan Staudenmayer, Mitchell Cohen, Geoffrey T. Manley
Abstract: We describe an application of probabilistic modeling and inference technology to the problem of analyzing sensor data in the setting of an intensive care unit (ICU). In particular, we consider the arterial-line blood pressure sensor, which is subject to frequent data artifacts that cause false alarms in the ICU and make the raw data almost useless for automated decision making. The problem is complicated by the fact that the sensor data are averaged over fixed intervals whereas the events causing data artifacts may occur at any time and often have durations significantly shorter than the data collection interval. We show that careful modeling of the sensor, combined with a general technique for detecting sub-interval events and estimating their duration, enables detection of artifacts and accurate estimation of the underlying blood pressure values. Our model’s performance identifying artifacts is superior to two other classifiers’ and about as good as a physician’s. 1
6 0.58899289 236 nips-2008-The Mondrian Process
7 0.57754946 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
8 0.55605263 31 nips-2008-Bayesian Exponential Family PCA
9 0.54141551 234 nips-2008-The Infinite Factorial Hidden Markov Model
10 0.52651405 134 nips-2008-Mixed Membership Stochastic Blockmodels
11 0.52549374 249 nips-2008-Variational Mixture of Gaussian Process Experts
12 0.52475691 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
13 0.50838971 152 nips-2008-Non-stationary dynamic Bayesian networks
14 0.49456647 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
15 0.49241799 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
16 0.48910481 138 nips-2008-Modeling human function learning with Gaussian processes
17 0.47236574 216 nips-2008-Sparse probabilistic projections
18 0.44647241 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
19 0.4258213 213 nips-2008-Sparse Convolved Gaussian Processes for Multi-output Regression
20 0.41783732 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
topicId topicWeight
[(6, 0.04), (7, 0.053), (12, 0.017), (28, 0.12), (57, 0.593), (63, 0.011), (71, 0.013), (77, 0.029), (83, 0.038)]
simIndex simValue paperId paperTitle
1 0.93760079 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
same-paper 2 0.92812979 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
3 0.91993791 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
4 0.90565377 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
5 0.89146459 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.79171294 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
7 0.76558572 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
8 0.71117574 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
9 0.66894734 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
10 0.63030422 234 nips-2008-The Infinite Factorial Hidden Markov Model
11 0.61239761 35 nips-2008-Bayesian Synchronous Grammar Induction
12 0.60753798 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
13 0.59567636 200 nips-2008-Robust Kernel Principal Component Analysis
14 0.58437163 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction
15 0.57978106 66 nips-2008-Dynamic visual attention: searching for coding length increments
16 0.57587665 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
17 0.57548755 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
18 0.57017416 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
19 0.56803006 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
20 0.5592643 229 nips-2008-Syntactic Topic Models