nips nips2005 nips2005-133 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Iain Murray, David MacKay, Zoubin Ghahramani, John Skilling
Abstract: Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Nested sampling for Potts models Iain Murray Gatsby Computational Neuroscience Unit University College London i. [sent-1, score-0.227]
2 net Abstract Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. [sent-15, score-0.227]
3 Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. [sent-16, score-0.273]
4 The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. [sent-18, score-0.268]
5 1 Introduction The computation of normalizing constants plays an important role in statistical inference. [sent-20, score-0.046]
6 For example, Bayesian model comparison needs the evidence, or marginal likelihood of a model M Z = p(D|M) = p(D|θ, M)p(θ|M) dθ ≡ L(θ)π(θ) dθ, (1) where the model has prior π and likelihood L over parameters θ after observing data D. [sent-21, score-0.273]
7 However, given its importance in Bayesian model comparison, many approaches—both sampling-based and deterministic—have been proposed for estimating it. [sent-23, score-0.039]
8 Often the evidence cannot be obtained using samples drawn from either the prior π, or the posterior p(θ|D, M) ∝ L(θ)π(θ). [sent-24, score-0.115]
9 Nested sampling provides an alternate standard, which makes no use of temperature and does not require tuning of intermediate distributions or other large sets of parameters. [sent-30, score-0.265]
10 θ2 θ2 θ2 Figure 1: (a) Elements of parame- θ1 L(x) θ1 L(x) 1 8 1 4 1 2 (a) 1 x θ1 L(x) x3 x2 x1 (b) 1 x x1 1 x (c) ter space (top) are sorted by likelihood and arranged on the x-axis. [sent-31, score-0.09]
11 An eighth of the prior mass is inside the innermost likelihood contour in this figure. [sent-32, score-0.327]
12 (b) Point xi is drawn from the prior inside the likelihood contour defined by xi−1 . [sent-33, score-0.379]
13 Li is identified and p({xi }) is known, but exact values of xi are not known. [sent-34, score-0.091]
14 (c) With N particles, the least likely one sets the likelihood contour and is replaced by a new point inside the contour ({Li } and p({xi }) are still known). [sent-35, score-0.308]
15 Nested sampling uses a natural definition of Z, a sum over prior mass. [sent-36, score-0.295]
16 The weighted sum over likelihood elements is expressed as the area under a monotonic one-dimensional curve “L vs x” (figure 1(a)), where: 1 Z= L(θ)π(θ) dθ = L(θ(x)) dx. [sent-37, score-0.122]
17 (2) 0 This is a change of variables dx(θ) = π(θ)dθ, where each volume element of the prior in the original θ-vector space is mapped onto a scalar element on the one-dimensional x-axis. [sent-38, score-0.068]
18 The ordering of the elements on the x-axis is chosen to sort the prior mass in decreasing order of likelihood values (x1 < x2 ⇒ L(θ(x1 )) > L(θ(x2 ))). [sent-39, score-0.253]
19 See appendix A for dealing with elements with identical likelihoods. [sent-40, score-0.058]
20 Given some points {(xi , Li )}I ordered such that xi > xi+1 , the area under the i=1 ˆ curve (2) is easily approximated. [sent-41, score-0.168]
21 We denote by Z estimates obtained using a ˆ trapezoidal rule. [sent-42, score-0.051]
22 A simple algorithm to draw I points is algorithm 1, see also figure 1(b). [sent-46, score-0.152]
23 for i = 2 to I: draw θi ∼ π (θ|L(θi−1 )), ˘ where π(θ) L(θ) > L(θi−1 ) π (θ|L(θi−1 )) ∝ ˘ 0 otherwise. [sent-48, score-0.126]
24 (3) Algorithm 2 Initialize: draw N points θ(n) ∼ π(θ) for i = 2 to I: • m = argminn L(θ(n) ) • θi−1 = θ(m) • draw θm ∼ π (θ|L(θi−1 )), given ˘ by equation (3) We know p(x1 ) = Uniform(0, 1), because x is a cumulative sum of prior mass. [sent-49, score-0.38]
25 Similarly p(xi |xi−1 ) = Uniform(0, xi−1 ), as every point is drawn from the prior subject to L(θi ) > L(θi−1 ) ⇒ xi < xi−1 . [sent-50, score-0.159]
26 A simple generalization, algorithm 2, uses multiple θ particles; at each step the least likely is replaced with a draw from a constrained prior (figure 1(c)). [sent-52, score-0.194]
27 Error bars on the geometric mean show exp(−i/N ± Samples of p(x|N ) are superimposed (i = 1600 . [sent-55, score-0.09]
28 The key idea is that samples from the prior, subject to a nested sequence of constraints (3), give a probabilistic realization of the curve, figure 1(a). [sent-63, score-0.617]
29 In this paper we present some new discussion of important issues regarding the practical implementation of nested sampling and provide the first application to a challenging problem. [sent-66, score-0.797]
30 1 MCMC approximations The nested sampling algorithm assumes obtaining samples from π (θ|L(θi−1 )), equa˘ tion (3), is possible. [sent-69, score-0.844]
31 Rejection sampling using π would slow down exponentially with iteration number i. [sent-70, score-0.227]
32 We explore approximate sampling from π using Markov chain ˘ Monte Carlo (MCMC) methods. [sent-71, score-0.273]
33 In high-dimensional problems it is likely that the majority of π ’s mass is typically in ˘ a thin shell at the contour surface [5, p37]. [sent-72, score-0.127]
34 In order to complete an ergodic MCMC method, we also need transition operators that can alter the likelihood (within the constraint). [sent-74, score-0.155]
35 We must initialize the Markov chain for each new sample somewhere. [sent-76, score-0.046]
36 One possibility is to start at the position of the deleted point, θi−1 , on the contour constraint, which is independent of the other points and not far from the bulk of the required uniform distribution. [sent-77, score-0.114]
37 However, if the Markov chain mixes slowly amongst modes, the new point starting at θi−1 may be trapped in an insignificant mode. [sent-78, score-0.087]
38 In this case it would be better to start at one of the other N −1 existing points inside the contour constraint. [sent-79, score-0.156]
39 They are all draws from the correct distribution, π (θ|L(θi−1 )), ˘ so represent modes fairly. [sent-80, score-0.077]
40 However, this method may also require many Markov chain steps, this time to make the new point effectively independent of the point it cloned. [sent-81, score-0.046]
41 The test system was a 40-dimensional hypercube of length 100 with uniform prior centered on the origin. [sent-89, score-0.068]
42 The ˆ mean of a distribution over log(Z) can be found by simple Monte Carlo estimation: log(Z) ≈ ˆ log(Z(x))p(x|N ) dx ≈ 1 S S ˆ log(Z(x(s) )) x(s) ∼ p(x|N ). [sent-97, score-0.054]
43 (5) s=1 This scheme is easily implemented for any expectation under p(x|N ), including ˆ error bars from the variance of log(Z). [sent-98, score-0.05]
44 To reduce noise in comparisons between runs it is advisable to reuse the same samples from p(x|N ) (e. [sent-99, score-0.047]
45 Figure 2 shows sampled trajectories of xi as the algorithm progresses. [sent-103, score-0.116]
46 The geometric mean path, xi ≈ exp( p(xi |N ) log xi dxi ) = e−i/N , follows the path of typical settings of x. [sent-104, score-0.338]
47 ˆ Typically the trapezoidal estimate of the integral, Z, is dominated by a small number of trapezoids, around iteration i∗ say. [sent-106, score-0.051]
48 Considering uncertainty on just √ log xi∗ = −i∗ /N ± i∗ /N provides reasonable and convenient error bars. [sent-107, score-0.054]
49 , sn ), each taking on one of q distinct “colors”: P (s|J, q) = 1 exp ZP (J, q) J(δsi sj − 1) . [sent-111, score-0.092]
50 The Kronecker delta, δsi sj is one when si and sj are the same color and zero otherwise. [sent-113, score-0.158]
51 1 Swendsen–Wang sampling We will take advantage of the “Fortuin-Kasteleyn-Swendsen-Wang” (FKSW) joint distribution identified explicitly in Edwards and Sokal [6] over color variables s and a bond variable for each edge in E, dij ∈ {0, 1}: P (s, d) = 1 ZP (J, q) p ≡ (1 − e−J ). [sent-121, score-0.505]
52 (1 − p)δdij ,0 + pδdij ,1 δsi ,sj , (7) (ij)∈E The marginal distribution over s in the FKSW model is the Potts distribution, equation (6). [sent-122, score-0.059]
53 The algorithm of Swendsen and Wang [8] performs block Gibbs sampling on the joint model by alternately sampling from P (dij |s) and P (s|dij ). [sent-125, score-0.454]
54 2 Nested Sampling A simple approximate nested sampler uses a fixed number of Gibbs sampling updates of π . [sent-128, score-0.864]
55 Focusing on ˘ the random cluster model, we rewrite equation (8): 1 L(d)π(d) where ZN L(d) = exp(D log(eJ − 1)), P (d) = (9) 1 C(d) ZP (J, q) π(d) = ZN = exp(J|E|), q . [sent-130, score-0.069]
56 Zπ Zπ Likelihood thresholds are thresholds on the total number of bonds D. [sent-131, score-0.178]
57 Many states have identical D, which requires careful treatment, see appendix A. [sent-132, score-0.071]
58 Nested sampling on this system will give the ratio of ZP /Zπ . [sent-133, score-0.253]
59 The prior normalization, Zπ , can be found from the partition function of a Potts system at J = log(2). [sent-134, score-0.094]
60 The following steps give two MCMC operators to change the bonds d → d : 1. [sent-135, score-0.207]
61 Create a random coloring, s, uniformly from the q C(d) colorings satisfying the bond constraints d, as in the Swendsen–Wang algorithm. [sent-136, score-0.119]
62 Either, operator 1: record the number of bonds D = (ij)∈E dij Or, operator 2: draw D from Q(D |E(s)) ∝ E(s) D . [sent-140, score-0.466]
63 Throw away the old bonds, d, and pick uniformly from one of the E(s) D ways of setting D bonds in the E available sites. [sent-142, score-0.205]
64 The probability of proposing a particular coloring and new setting of the bonds is 1 1 Q(s, d |d) = Q(d |s, D )Q(D |E(s))Q(s|d) = E(s) Q(D |E(s)) C(d) . [sent-143, score-0.219]
65 Method Gibbs AIS Swendsen–Wang AIS Gibbs nested sampling Random-cluster nested sampling Acceptance ratio q = 2 (Ising), J = 1 q = 10, J = 1. [sent-145, score-1.62]
66 We tested nested samplers using Gibbs sampling and the cluster-based algorithm, annealed importance sampling (AIS) [9] using both Gibbs sampling and Swendsen–Wang cluster updates. [sent-166, score-1.378]
67 We also developed an acceptance ratio method [10] based on our representation in equation (9), which we ran extensively and should give nearly correct results. [sent-167, score-0.157]
68 Annealed importance sampling (AIS) was run 100 times, with a geometric spacing of 104 settings of J as the annealing schedule. [sent-168, score-0.391]
69 Nested sampling used N = 100 particles and 100 full-system MCMC updates to approximate each draw from π . [sent-169, score-0.479]
70 ˘ Each Markov chain was initialized at one of the N−1 particles satisfying the current constraint. [sent-170, score-0.135]
71 1) the Gibbs nested sampler could get stuck permanently in a local maximum of the likelihood, while the cluster method gave erroneous answers for the Ising system. [sent-172, score-0.669]
72 We took advantage of its performance in easy parameter regimes to compute Zπ for use in the cluster-based nested sampler. [sent-174, score-0.57]
73 However, with a “temperature-based” annealing schedule, AIS was unable to give useful answers for the q = 10 system. [sent-175, score-0.085]
74 While nested sampling appears to be correct within its error bars. [sent-176, score-0.827]
75 It is known that even the efficient Swendsen–Wang algorithm mixes slowly for Potts models with q > 4 near critical values of J [11], see figure 4. [sent-177, score-0.041]
76 Typical Potts model states are either entirely disordered or ordered; disordered states contain a jumble of small regions with different colors (e. [sent-178, score-0.281]
77 figure 4(b)), in ordered states the system is predominantly one color (e. [sent-180, score-0.136]
78 Moving between these two phases is difficult; defining a valid MCMC method that moves between distinct phases requires knowledge of the relative probability of the whole collections of states in those phases. [sent-183, score-0.16]
79 Temperature-based annealing algorithms explore the model for a range of settings of J and fail to capture the correct behavior near the transition. [sent-184, score-0.115]
80 Despite using closely related Markov chains to those used in AIS, nested sampling can work in all parameter regimes. [sent-185, score-0.827]
81 Figure 4(e) shows how nested sampling can explore a mixture of ordered and disordered phases. [sent-186, score-0.929]
82 By moving steadily through these states, nested sampling is able to estimate the prior mass associated with each likelihood value. [sent-187, score-0.994]
83 2 Proof: with finite probability all si are given the same color, then any allowable D is possible, in turn all allowable d have finite probability. [sent-188, score-0.136]
84 ⇒ (a) ⇒ (b) (c) (d) (e) Figure 4: Two 256 × 256, q = 10 Potts models with starting states (a) and (c) were simulated with 5 × 106 full-system Swendsen–Wang updates with J = 1. [sent-189, score-0.082]
85 The corresponding results, (b) and (d) are typical of all the intermediate samples: Swendsen– Wang is unable to take (a) into an ordered phase, or (c) into a disordered phase, although both phases are typical at this J. [sent-191, score-0.215]
86 (e) in contrast shows an intermediate state of nested sampling, which succeeds in bridging the phases. [sent-192, score-0.608]
87 The challenge for nested sampling remains the invention of effective sampling schemes that keep a system at or near constant energy. [sent-200, score-1.024]
88 Berg and Neuhaus [12] study the Potts model using the multicanonical ensemble. [sent-205, score-0.051]
89 Nested sampling has some unique properties compared to the established method. [sent-206, score-0.227]
90 Unless problems with multiple modes demand otherwise, N = 1 often reveals useful information, and if the error bars on Z are too large further runs with larger N may be performed. [sent-208, score-0.097]
91 5 Conclusions We have applied nested sampling to compute the normalizing constant of a system that is challenging for many Monte Carlo methods. [sent-209, score-0.843]
92 • Nested sampling’s key technical requirement, an ability to draw samples uniformly from a constrained prior, is largely solved by efficient MCMC methods. [sent-210, score-0.2]
93 • No complex schedules are required; steady progress towards compact regions of large likelihood is controlled by a single free parameter, N , the number of particles. [sent-211, score-0.09]
94 • Nested sampling has no special difficulties on systems with first order phasetransitions, whereas all temperature-based methods fail. [sent-213, score-0.227]
95 We believe that nested sampling’s unique properties will be found useful in a variety of statistical applications. [sent-214, score-0.57]
96 A Degenerate likelihoods The description in section 1 assumed that the likelihood function provides a total ordering of elements of the parameter space. [sent-215, score-0.146]
97 However, distinct elements dx and dx could have the same likelihood, either because the parameters are discrete, or because the likelihood is degenerate. [sent-216, score-0.255]
98 Assuming we have a likelihood constraint Li , we need to be able to draw from P (θ , m |θ, m, Li ) ∝ π(θ )π(m ) L(θ )L(m ) > Li , 0 otherwise. [sent-221, score-0.216]
99 Therefore, the probability of states with likelihood L(θi ) are weighted by (1 − m ). [sent-223, score-0.135]
100 Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. [sent-231, score-0.567]
wordName wordTfidf (topN-words)
[('nested', 0.57), ('potts', 0.333), ('sampling', 0.227), ('swendsen', 0.2), ('bonds', 0.178), ('dij', 0.162), ('zp', 0.162), ('di', 0.134), ('ais', 0.132), ('skilling', 0.128), ('draw', 0.126), ('erent', 0.102), ('carlo', 0.101), ('monte', 0.101), ('li', 0.1), ('ising', 0.094), ('xi', 0.091), ('likelihood', 0.09), ('wang', 0.089), ('particles', 0.089), ('contour', 0.088), ('mcmc', 0.088), ('gibbs', 0.084), ('disordered', 0.081), ('prior', 0.068), ('acceptance', 0.067), ('edwards', 0.061), ('dx', 0.054), ('si', 0.054), ('log', 0.054), ('annealed', 0.053), ('ordered', 0.051), ('colorings', 0.051), ('fksw', 0.051), ('fortuin', 0.051), ('multicanonical', 0.051), ('sokal', 0.051), ('trapezoidal', 0.051), ('annealing', 0.051), ('bars', 0.05), ('gure', 0.048), ('ij', 0.047), ('samples', 0.047), ('modes', 0.047), ('zoubin', 0.047), ('chain', 0.046), ('normalizing', 0.046), ('phases', 0.045), ('states', 0.045), ('berg', 0.044), ('mcdonald', 0.044), ('inside', 0.042), ('mixes', 0.041), ('coloring', 0.041), ('beal', 0.041), ('jij', 0.041), ('allowable', 0.041), ('bond', 0.041), ('geometric', 0.04), ('color', 0.04), ('mass', 0.039), ('importance', 0.039), ('intermediate', 0.038), ('gelman', 0.038), ('updates', 0.037), ('ergodic', 0.036), ('edge', 0.035), ('exp', 0.035), ('cluster', 0.035), ('phase', 0.034), ('settings', 0.034), ('answers', 0.034), ('equation', 0.034), ('january', 0.032), ('sj', 0.032), ('elements', 0.032), ('bayesian', 0.031), ('markov', 0.031), ('zm', 0.031), ('correct', 0.03), ('chains', 0.03), ('mackay', 0.03), ('sampler', 0.03), ('ej', 0.03), ('operators', 0.029), ('colors', 0.029), ('deterministic', 0.029), ('path', 0.028), ('uniformly', 0.027), ('cult', 0.027), ('zn', 0.026), ('points', 0.026), ('ratio', 0.026), ('partition', 0.026), ('gatsby', 0.026), ('appendix', 0.026), ('marginal', 0.025), ('distinct', 0.025), ('trajectories', 0.025), ('ordering', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 133 nips-2005-Nested sampling for Potts models
Author: Iain Murray, David MacKay, Zoubin Ghahramani, John Skilling
Abstract: Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model. 1
Author: Firas Hamze, Nando de Freitas
Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.
3 0.10651117 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
Author: Yunsong Huang, B. Keith Jenkins
Abstract: We develop an approach for estimation with Gaussian Markov processes that imposes a smoothness prior while allowing for discontinuities. Instead of propagating information laterally between neighboring nodes in a graph, we study the posterior distribution of the hidden nodes as a whole—how it is perturbed by invoking discontinuities, or weakening the edges, in the graph. We show that the resulting computation amounts to feed-forward fan-in operations reminiscent of V1 neurons. Moreover, using suitable matrix preconditioners, the incurred matrix inverse and determinant can be approximated, without iteration, in the same computational style. Simulation results illustrate the merits of this approach.
4 0.0970826 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
5 0.080282778 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
Author: Liam Paninski
Abstract: We discuss a method for obtaining a subject’s a priori beliefs from his/her behavior in a psychophysics context, under the assumption that the behavior is (nearly) optimal from a Bayesian perspective. The method is nonparametric in the sense that we do not assume that the prior belongs to any fixed class of distributions (e.g., Gaussian). Despite this increased generality, the method is relatively simple to implement, being based in the simplest case on a linear programming algorithm, and more generally on a straightforward maximum likelihood or maximum a posteriori formulation, which turns out to be a convex optimization problem (with no non-global local maxima) in many important cases. In addition, we develop methods for analyzing the uncertainty of these estimates. We demonstrate the accuracy of the method in a simple simulated coin-flipping setting; in particular, the method is able to precisely track the evolution of the subject’s posterior distribution as more and more data are observed. We close by briefly discussing an interesting connection to recent models of neural population coding.
6 0.06961266 79 nips-2005-Fusion of Similarity Data in Clustering
7 0.068782821 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
8 0.068381988 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models
9 0.066400349 102 nips-2005-Kernelized Infomax Clustering
10 0.065927081 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
11 0.064749114 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours
12 0.061639819 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
13 0.060310513 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
14 0.055173777 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
15 0.053901862 74 nips-2005-Faster Rates in Regression via Active Learning
16 0.053033542 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
17 0.049853574 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
18 0.04806859 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
19 0.047842603 160 nips-2005-Query by Committee Made Real
20 0.04666353 136 nips-2005-Noise and the two-thirds power Law
topicId topicWeight
[(0, 0.181), (1, 0.044), (2, 0.005), (3, 0.044), (4, 0.032), (5, -0.113), (6, -0.07), (7, 0.024), (8, -0.015), (9, 0.053), (10, 0.054), (11, 0.0), (12, 0.132), (13, 0.059), (14, -0.043), (15, -0.002), (16, 0.047), (17, 0.079), (18, 0.077), (19, -0.085), (20, 0.051), (21, 0.033), (22, 0.019), (23, 0.106), (24, 0.076), (25, 0.099), (26, 0.159), (27, -0.101), (28, -0.057), (29, 0.086), (30, -0.077), (31, 0.028), (32, -0.004), (33, -0.038), (34, 0.023), (35, -0.13), (36, 0.197), (37, -0.032), (38, -0.028), (39, 0.002), (40, 0.045), (41, 0.045), (42, 0.093), (43, 0.046), (44, -0.183), (45, 0.111), (46, -0.164), (47, -0.006), (48, 0.012), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.93985838 133 nips-2005-Nested sampling for Potts models
Author: Iain Murray, David MacKay, Zoubin Ghahramani, John Skilling
Abstract: Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model. 1
2 0.7511338 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
Author: Firas Hamze, Nando de Freitas
Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.
3 0.55045754 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes
Author: Yunsong Huang, B. Keith Jenkins
Abstract: We develop an approach for estimation with Gaussian Markov processes that imposes a smoothness prior while allowing for discontinuities. Instead of propagating information laterally between neighboring nodes in a graph, we study the posterior distribution of the hidden nodes as a whole—how it is perturbed by invoking discontinuities, or weakening the edges, in the graph. We show that the resulting computation amounts to feed-forward fan-in operations reminiscent of V1 neurons. Moreover, using suitable matrix preconditioners, the incurred matrix inverse and determinant can be approximated, without iteration, in the same computational style. Simulation results illustrate the merits of this approach.
4 0.55043548 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distribution. We provide guarantees for the first two approaches and evaluate the performance of all three approaches in synthetic experiments and on real data from species habitat modeling, where maxent has been successfully applied and where sample selection bias is a significant problem. 1
5 0.48303688 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.45878884 156 nips-2005-Prediction and Change Detection
7 0.45497957 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
8 0.45325372 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours
9 0.39508876 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
10 0.39059144 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
11 0.36262172 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence
12 0.33239791 160 nips-2005-Query by Committee Made Real
13 0.33172986 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
14 0.33058095 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
15 0.31007138 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
16 0.30572408 74 nips-2005-Faster Rates in Regression via Active Learning
17 0.29749462 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
18 0.29352558 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
19 0.28936136 2 nips-2005-A Bayes Rule for Density Matrices
20 0.28924912 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
topicId topicWeight
[(3, 0.047), (10, 0.04), (18, 0.014), (27, 0.037), (31, 0.046), (34, 0.095), (35, 0.029), (41, 0.018), (55, 0.026), (69, 0.061), (73, 0.057), (88, 0.092), (91, 0.062), (92, 0.285)]
simIndex simValue paperId paperTitle
1 0.8209604 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing
Author: Guido Dornhege, Benjamin Blankertz, Matthias Krauledat, Florian Losch, Gabriel Curio, Klaus-Robert Müller
Abstract: Brain-Computer Interface (BCI) systems create a novel communication channel from the brain to an output device by bypassing conventional motor output pathways of nerves and muscles. Therefore they could provide a new communication and control option for paralyzed patients. Modern BCI technology is essentially based on techniques for the classification of single-trial brain signals. Here we present a novel technique that allows the simultaneous optimization of a spatial and a spectral filter enhancing discriminability of multi-channel EEG single-trials. The evaluation of 60 experiments involving 22 different subjects demonstrates the superiority of the proposed algorithm. Apart from the enhanced classification, the spatial and/or the spectral filter that are determined by the algorithm can also be used for further analysis of the data, e.g., for source localization of the respective brain rhythms.
2 0.80148458 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
same-paper 3 0.79077727 133 nips-2005-Nested sampling for Potts models
Author: Iain Murray, David MacKay, Zoubin Ghahramani, John Skilling
Abstract: Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model. 1
4 0.70577586 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
Author: Rory Sayres, David Ress, Kalanit Grill-spector
Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1
5 0.60920393 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface
Author: Le Song, Evian Gordon, Elly Gysels
Abstract: Motor imagery attenuates EEG µ and β rhythms over sensorimotor cortices. These amplitude changes are most successfully captured by the method of Common Spatial Patterns (CSP) and widely used in braincomputer interfaces (BCI). BCI methods based on amplitude information, however, have not incoporated the rich phase dynamics in the EEG rhythm. This study reports on a BCI method based on phase synchrony rate (SR). SR, computed from binarized phase locking value, describes the number of discrete synchronization events within a window. Statistical nonparametric tests show that SRs contain significant differences between 2 types of motor imageries. Classifiers trained on SRs consistently demonstrate satisfactory results for all 5 subjects. It is further observed that, for 3 subjects, phase is more discriminative than amplitude in the first 1.5-2.0 s, which suggests that phase has the potential to boost the information transfer rate in BCIs. 1
6 0.5429669 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
7 0.52281153 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
8 0.52246618 102 nips-2005-Kernelized Infomax Clustering
9 0.51995903 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
10 0.5174154 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
11 0.51716709 144 nips-2005-Off-policy Learning with Options and Recognizers
12 0.51699561 30 nips-2005-Assessing Approximations for Gaussian Process Classification
13 0.51594132 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
14 0.51553136 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
15 0.51534194 105 nips-2005-Large-Scale Multiclass Transduction
16 0.51512057 184 nips-2005-Structured Prediction via the Extragradient Method
17 0.51461506 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
18 0.51427805 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
19 0.51426476 121 nips-2005-Location-based activity recognition
20 0.5137558 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs