jmlr jmlr2011 jmlr2011-82 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
Reference: text
sentIndex sentText sentNum sentScore
1 In nonlinear Gaussian process (GP) regression context the outlier rejection is more complicated and one may consider the posterior distribution of the unknown function values fi = f (xi ) locally near some input locations xi . [sent-41, score-0.516]
2 The robust Student-t observation model p(yi | fi , σ2 , ν) = (yi − fi )2 Γ((ν + 1)/2) √ 1+ νσ2 Γ(ν/2) νπσ −(ν+1)/2 , where fi = f (xi ), ν is the degrees of freedom and σ the scale parameter (Gelman et al. [sent-96, score-0.974]
3 Σ−1 is the Hessian of the f LA negative log conditional posterior at the mode, that is, Σ−1 = −∇∇ log p(f|D , θ, σ2 , ν)|f=ˆ = K−1 + W, LA f (4) where W is a diagonal matrix with entries Wii = ∇ fi ∇ fi log p(y| fi , σ2 , ν)| fi = fˆi . [sent-120, score-1.391]
4 Assuming the factorizing posterior (5) and minimizing the KL-divergence from q(f, V) to the true posterior p(f, V|D , θ, σ2 , ν) results in a Gaussian approximation for the latent values, and inverse-χ2 (or equivalently inverse gamma) approximations for the residual variances Vi . [sent-141, score-0.47]
5 The method is based on forming a Gaussian lower bound for each likelihood term independently: p(yi | fi ) ≥ exp(− fi2 /(2γi ) + bi fi − h(γi )/2), which can be used to construct a lower bound on the marginal likelihood: p(y|X, θ, ν, σ) ≥ ZVB . [sent-152, score-0.763]
6 In 2 , ν) are approximated by un-normalized Gaussian site Equation (6) the likelihood terms p(yi | fi , σ ˜ ˜ ˜i ˜ ˜ functions ti ( fi |Zi , µi , σ2 ) = Zi N ( fi |˜ i , σ2 ). [sent-166, score-1.239]
7 µ ˜i 3232 G AUSSIAN P ROCESS R EGRESSION WITH A S TUDENT-t L IKELIHOOD ˜ ˜ ˜i The EP algorithm updates the site parameters Zi , µi and σ2 and the posterior approximation (6) sequentially. [sent-167, score-0.488]
8 At each iteration (i), first the i’th site is removed from the i’th marginal posterior to obtain a cavity distribution ˜ q−i ( fi ) ∝ q( fi |D , θ, σ2 , ν)ti ( fi )−1 . [sent-168, score-1.629]
9 Then the i’th site is replaced with the exact likelihood term to form a tilted distribution pi ( fi ) = ˆ ˆ i−1 q−i ( fi )p(yi | fi ) which is a more refined non-Gaussian approximation to the true i’th marginal Z distribution. [sent-169, score-1.552]
10 Then the parameters ˆ ˜ of the local approximation ti are updated so that the moments of q( fi ) match with qi ( fi ): ˆ ˆ ˜ q( fi |D , θ, σ2 , ν) ∝ q−i ( fi )ti ( fi ) ≡ Zi N ( fi |ˆ i , σ2 ). [sent-171, score-1.957]
11 µ ˆi (7) Finally, the parameters µ and Σ of the approximate posterior (6) are updated according to the ˜ changes in site ti . [sent-172, score-0.407]
12 The normalization terms Zi are required for the marginal likelihood approximation ZEP ≈ p(y|X, θ, σ2 , ν) which is computed after convergence of the algorithm, and they can be determined by integrating over fi in Equation (7) which gives ˜ ˆ Zi = Zi ( q−i ( fi )N ( fi |˜ i , σ2 )d fi )−1 . [sent-175, score-1.435]
13 In parallel EP the site updates are calculated with fixed posterior marginals ˜ µ and diag(Σ) for all ti , i = 1, . [sent-181, score-0.533]
14 The marginal likelihood approximation is given by 1 1 ˜ ˜ ˜ log ZEP = − log |K + Σ| − µT K + Σ 2 2 −1 n ˆ µ + ∑ log Zi (σ2 , ν) +CEP , ˜ (8) i=1 where CEP = − n log(2π) − ∑i log q−i ( fi )N ( fi |˜ i , σ2 )d fi collects terms that are not explicit funcµ ˜i 2 tions of θ, σ2 or ν. [sent-187, score-1.105]
15 If the algorithm has converged, that is, pi ( fi ) is consistent (has the same means ˆ ˜ and µ can be considered constants when differentiatand variances) with q( fi ) for all sites, CEP , Σ ˜ ing (8) with respect to the hyperparameters (Seeger, 2005; Opper and Winther, 2005). [sent-188, score-0.754]
16 The inner maximization affects only the first two terms and ensures that the marginal moments of the current posterior approximation q(f) are equal to the moments of the tilted distributions pi ( fi ) for fixed λs . [sent-207, score-0.918]
17 ˆ The outer minimization ensures that the moments qsi ( fi ) are equal to marginal moments of q(f). [sent-208, score-0.627]
18 At the convergence, q( fi ), pi ( fi ), and qsi ( fi ) share the same moments up to the second order. [sent-209, score-1.122]
19 The Hessian of the first term with respect to λ− is well defined (and negative semi-definite) only if the tilted distributions pi ( fi ) ∝ p(yi | fi )q−i ( fi ) are proper ˆ probability distributions with finite moments up to the fourth order. [sent-213, score-1.216]
20 Therefore, to ensure that the product of q−i ( fi ) and the Student-t site p(yi | fi ) has finite moments and that the inner-loop moment matching remains meaningful, the cavity precisions τ−i have to be kept positive. [sent-214, score-1.465]
21 Furthermore, since the cavity distributions can be regarded as estimates for the leave-one-out (LOO) distributions of the latent values, τ−i = 0 would correspond to a situation where q( fi |y−i , X) has infinite variance, ˜ which does not make sense given the Gaussian prior assumption (1). [sent-215, score-0.579]
22 In fractional EP the cavity distributions are defined as q−i ( fi ) ∝ q( fi |D , θ, ν, σ2 )/ti ( fi )η η for a fraction parameter η ∈ (0, 1]. [sent-219, score-1.273]
23 The site and the tilted distribution as pi ( fi ) ∝ q−i ( fi )p(yi | fi ) ˆ ˜ parameters are updated so that the moments of q−i ( fi )ti ( fi )η ∝ q( fi ) match with q−i ( fi )p(yi | fi )η . [sent-220, score-2.953]
24 In fractional EP the natural parameters of the cavity distribution are given by τ−i = σ−2 − η˜ i τ i ˜ and ν−i = σ−2 µi − ηνi , i (11) and the site updates (with damping factor δ) by ˆi ∆˜ i = δη−1 (σ−2 − σ−2 ) τ i ˜ ˆi ˆ and ∆νi = δη−1 (σ−2 µi − σ−2 µi ). [sent-222, score-0.76]
25 i (12) The fractional update step minq KL( pi ( fi )||q( fi )) can be viewed as minimization of the αˆ divergence with α = η (Minka, 2005). [sent-223, score-0.772]
26 Compared to the KL-divergence, minimizing the α-divergence with 0 < α < 1 does not force q( fi ) to cover as much of the probability mass of pi ( fi ) whenever ˆ pi ( fi ) > 0. [sent-224, score-1.001]
27 As a consequence, fractional EP tends to underestimate the variance and normalization ˆ constant of q−i ( fi )p(yi | fi )η , and also the approximate marginal likelihood ZEP . [sent-225, score-0.912]
28 In case of multiple modes, the approximation tries to represent the overall uncertainty in pi ( fi ) the more exactly the closer α is to 1. [sent-227, score-0.414]
29 First, when evaluating the moments of q−i ( fi )p(yi | fi )η , setting η < 1 flattens the likelihood term which alleviates the possible converge problems related to multimodality. [sent-231, score-0.737]
30 Second, the fractional updates help to avoid the cavity precisions becoming too small, or even negative. [sent-233, score-0.659]
31 This decreases the cavity variances which in turn makes the tilted moment integrations and the subsequent EP updates (12) more robust. [sent-235, score-0.574]
32 For example, Seeger (2008) reports that with an underdetermined linear model combined with a log-concave Laplace prior the cavity precisions remain positive but they may become very small which induces numerical inaccuracies in the analytical moment evaluations. [sent-237, score-0.526]
33 Robust Implementation of the Parallel EP Algorithm The sequential EP updates are shown to be stable for models in which the exact site terms (in our case the likelihood functions p(yi | fi )) are log-concave (Seeger, 2008). [sent-241, score-0.717]
34 It follows that the variances of the cavity distributions q−i ( fi ) are positive and thus also the subsequent moment evaluations of q−i ( fi )p(yi | fi ) are numerically robust. [sent-243, score-1.256]
35 The non-log-concave Student-t likelihood is problematic because both the conditional posterior p(f|D , θ, ν, σ) as well as the tilted distributions pi ( fi ) may become multimodal. [sent-244, score-0.749]
36 The double-loop algorithm is a rigorous approach that is guaranteed to converge to a stationary point of the objective function (10) when the site terms p(yi | fi ) are bounded from below. [sent-246, score-0.544]
37 The downside is that the double-loop algorithm can be much slower than for example parallel EP because it spends much computational effort during the inner loop iterations, especially in the early stages when qsi ( fi ) are poor approximations for the true marginals. [sent-247, score-0.523]
38 Parallel EP can also be interpreted as a variant of the double-loop algorithm where only one inner-loop optimization step is done by moment matching (7) and each such update is followed by an outer-loop refinement of the marginal approximations qsi ( fi ). [sent-251, score-0.607]
39 , n} with qsi ( fi ) = q( fi ) = N (µi , Σii ), updating the µ ˆi sites (9), and updating the posterior (6). [sent-255, score-0.972]
40 The outer-loop step consists of setting qsi ( fi ) equal to the new marginal distributions q( fi ). [sent-256, score-0.8]
41 The downside is that this requires one additional evaluation of the tilted moments for every site per iteration, 3236 G AUSSIAN P ROCESS R EGRESSION WITH A S TUDENT-t L IKELIHOOD but if these one-dimensional integrals are implemented efficiently this is a reasonable price for stability. [sent-262, score-0.47]
42 The oscillations can be reduced by updating qsi ( fi ) only after the moments of pi ( fi ) and q( fi ) are consistent for all i = 1, . [sent-269, score-1.166]
43 At each update, check also that the new cavity precisions are positive, and if not, con10 tinue the inner-loop iterations with the previous qsi ( fi ) until better moment consistency is achieved or switch to fractional updates. [sent-273, score-1.081]
44 If no parallel initialization is done, often during the first 5-10 iterations when the step size δ is limited according to modification 2, the consistency between pi ( fi ) and q( fi ) cannot be achieved. [sent-275, score-0.744]
45 This is an indication of q(f) being a too inflexˆ ible approximation for the tilted distributions with the current qsi ( fi ). [sent-276, score-0.605]
46 An outer-loop update qsi ( fi ) = q( fi ) usually helps in these cases. [sent-277, score-0.708]
47 This may not be enough to suppress all oscillations of the site parameters but in practice more 3237 ¨ J YL ANKI , VANHATALO AND V EHTARI frequent outer loop refinements of qsi ( fi ) were found to require fewer computationally expensive objective evaluations for convergence. [sent-297, score-0.714]
48 In some rare cases, for example, when the noise level σ is very small, the outer-loop update of qsi ( fi ) may result in negative values for some of the cavity variances even though the inner-loop ˜ optimality is satisfied. [sent-298, score-0.717]
49 1 Other Implementation Details The EP updates require evaluation of moments mk = fik gi ( fi )d fi for k = 0, 1, 2, where we have defined gi ( fi ) = q−i ( fi )p(yi | fi )η . [sent-303, score-1.659]
50 The integrand gi ( fi ) may have one or two modes between the cavity mean µ−i and the observation yi . [sent-307, score-0.609]
51 The first sum term can be evaluated safely if the cavity ˆi precisions τ−i and the tilted variances σ2 remain positive during the EP updates because at converˆi gence τsi = σ−2 . [sent-312, score-0.753]
52 The amount of increase depends on how far the posterior mean estimate of the unknown function value, E( fi |D ), is from the observation yi . [sent-340, score-0.514]
53 Some insight into this behavior is obtained by considering the negative Hessian of log p(yi | fi , ν, σ2 ), that is, Wi = −∇2i log p(yi | fi ), as a function of fi (compare to the Laplace approximation in Section f √ √ 3. [sent-341, score-0.979]
54 √ i is positive when yi − σ ν < fi < yi + σ ν, attains its negative minimum when fi = W yi√ ±σ 3ν and approaches zero as | fi | → ∞. [sent-343, score-1.018]
55 If there is a disagreement between the cavity distribution q−i ( fi ) = N (µ−i , σ2 ) and the likelihood p(yi | fi ) but the observation is not a clear outlier, the uncertainty in −i the tilted distribution increases towards the observation and the tilted distribution can even become two-modal. [sent-351, score-1.323]
56 The site precisions corresponding to the outlying observations may become negative but their absolute values remain small compared to the site precisions of the regular observations. [sent-354, score-0.999]
57 However, if some of the negative sites become very small they may notably decrease the approximate marginal precisions τi = σ−2 of the a priori dependent sites because of the prior correlations defined by K. [sent-355, score-0.592]
58 It follows that i the uncertainty in the cavity distributions may increase considerably, that is, the cavity precisions, ˜ τ−i = τi − τi , may become very small or negative. [sent-356, score-0.507]
59 In this example the posterior mass of the length-scale is concentrated to sufficiently large value so that the GP prior is stiff and keeps the marginal posterior p(f|D ) (shown in the lower left panel) and the conditional posterior ˆ ˆ ˆ p(f|D , θ, ν, σ2 ) at the MAP estimate unimodal. [sent-361, score-0.524]
60 , yn , the cavity distributions q−1 ( f1 ) and q−2 ( f2 ) would differ strongly from the approximative marginal posterior distributions q( f1 ) and q( f2 ). [sent-375, score-0.55]
61 This difference would lead to a very small (or even negative) cavity precisions τ−1 and τ−2 during the EP iterations which causes stability problems as will be illustrated in section 5. [sent-376, score-0.505]
62 , yn ) obtained by one round of undamped ˜ sequential EP updates on sites ti ( fi ), for i = 3, . [sent-477, score-0.561]
63 5 of the likelihood terms included in the tilted distribution, which corresponds to fractional EP updates on these sites. [sent-486, score-0.413]
64 Therefore, during the second sweep the cavity precision τ−1 = σ−2 − τ1 1 becomes negative, and site 1 can no longer be updated. [sent-499, score-0.478]
65 If the EP updates were done in parallel, both the cavity and the site precisions would be positive after the first posterior update, but q( f1 , f2 ) would be tightly centered between the modes. [sent-500, score-0.923]
66 After a couple of parallel loops over all the sites, one of the problematic sites gets a too small negative precision because the approximation needs to be expanded to cover all the marginal uncertainty in the tilted distributions which leads to a negative cavity precision for the other site. [sent-501, score-0.825]
67 Skipping updates on the sites with negative cavity variances can keep the algorithm numerically stable (see, for example, Minka and Lafferty, 2002). [sent-502, score-0.502]
68 Also increasing damping reduces ∆˜ i so τ that the negative cavity precisions are less likely to emerge. [sent-503, score-0.594]
69 Combining such cavity distribution with the likelihood term, p(y1 | f1 ), gives a ˜ ˜ tilted distribution with significant mass around both observations. [sent-507, score-0.465]
70 The site precisions are decreased for a few iterations after which the posterior marginals are so wide that the variances of the tilted distributions are smaller than the posterior marginal variances. [sent-510, score-1.098]
71 At this point the site precisions start again to increase gradually. [sent-511, score-0.472]
72 This leads to oscillation between small and large site precisions as illustrated in Figure 3. [sent-512, score-0.472]
73 For example, a sequential EP update can be considered as a one inner-loop step where only one site is updated, followed by an outer-loop step which updates all the marginal posteriors as qsi ( fi ) = q( fi ). [sent-515, score-1.151]
74 It follows that also the one-dimensional tilted distributions have smaller variances and the consecutive fractional updates (12) of the sites 1 and 2 do not widen the marginal variances σ2 and σ2 as much. [sent-528, score-0.66]
75 This helps to 1 2 keep the cavity precisions positive by increasing the approximate marginal posterior precisions and ˜ ˜ reducing the possible negative increments on the site precisions τ1 and τ2 . [sent-529, score-1.44]
76 In addition, the property that a fraction (1 − η) of the site precisions is left in the cavity distributions helps to keep the cavity precisions positive during the algorithm. [sent-531, score-1.183]
77 For each method both the ˜ objective − log ZEP and the site precisions τi related to data points y1 , . [sent-544, score-0.472]
78 the moments of pi ( fi ) and q( fi ) are consistent for fixed λs before updating qsi ( fi ). [sent-550, score-1.122]
79 For example, a ˆ poor choice of δ may require many iterations for achieving inner-loop consistency in the examples 1 or 2, and a too large δ can easily lead to a decrease of the inner-loop objective function or even negative cavity precisions for the sites 1 or 2. [sent-551, score-0.652]
80 The panel below shows the site precisions corresponding to 3244 G AUSSIAN P ROCESS R EGRESSION WITH A S TUDENT-t L IKELIHOOD the observations y1 , . [sent-560, score-0.531]
81 Whenever τ1 and τ2 become very small they also inflict large decrease in the site precisions of the nearby sites 3 and 4. [sent-565, score-0.592]
82 Also the marginal likelihood is not defined at many iterations because of negative cavity precisions. [sent-572, score-0.452]
83 However, compared to sequential or parallel EP, the convergence is very slow and it takes over 100 iterations to get the site parameters to the level that sequential EP attains with only a couple of iterations. [sent-575, score-0.443]
84 There is still some slow drift visible in the site parameters after 20 iterations but changes in the marginal likelihood estimate are very small. [sent-577, score-0.425]
85 Because of the different divergence measure the posterior approximation is more localized (see Figure 1) and also the cavity distributions are closer to the respective marginal distributions. [sent-581, score-0.512]
86 It follows that the site precisions related to y1 and y2 are larger and no damping is required to keep the updates stable. [sent-582, score-0.635]
87 In the second example there are more visible differences: standard EP tends to overestimate the marginal likelihood due to the larger posterior uncertainties (see Figure 1) whereas fractional EP underestimates it slightly. [sent-625, score-0.416]
88 With a Gaussian prior on f and a log-concave likelihood, each site approximation increases the posterior precision and all the site precisions remain positive throughout the EP iterations as was shown by Seeger (2008). [sent-829, score-0.925]
89 With a non-log-concave likelihood, however, negative site precisions may occur. [sent-830, score-0.499]
90 The negative site precisions are natural and well justified because a non-log-concave likelihood can generate local increases of the posterior uncertainty which cannot otherwise be modeled with the Gaussian approximation. [sent-831, score-0.733]
91 (2009), with the Student-t model the negative site precisions correspond to the outlying observations. [sent-833, score-0.527]
92 Through the prior covariances of f, the negative site precisions decrease also the approximate marginal posterior precisions of the other site approximations with positive site precisions. [sent-834, score-1.48]
93 This may become a problem during the sequential or the parallel EP iterations if some of the approximate marginal posterior precisions decrease close to the level of the corresponding site precisions. [sent-835, score-0.843]
94 In such cases the respective cavity precisions become very small which can both induce numerical instabilities in the tilted moment integrations (Seeger, 2008) and make the respective sites very sensitive to the subsequent EP updates. [sent-836, score-0.811]
95 If the EP updates are not constrained some of the cavity precisions may also become negative in which case the tilted moments and the following updates are no longer well defined. [sent-837, score-0.866]
96 Damping takes more conservative update steps so that the negative site precision increments are less likely to decrease the other cavity precisions too much. [sent-839, score-0.738]
97 Fractional EP keeps the cavity precisions larger by leaving a fraction of the site precisions in the cavity but leads to different approximation which may underestimate the posterior uncertainties. [sent-840, score-1.394]
98 3), for instance a scheme, where the inner-loop optimization of the more difficult sites (whose cavity distributions differ notably from the respective marginals) was done sequentially and the remaining sites were optimized with parallel updates, could lead to better overall performance. [sent-855, score-0.537]
99 In our examples these situations were related to two modes in the conditional posterior (caused by two outliers) quite far away from each other which requires a very large local increase of the marginal variances from the unimodal posterior approximation. [sent-860, score-0.455]
100 If for certain sites most of the LOO information comes from the corresponding site approximations there is reason to suspect that the approximation is not suit3253 ¨ J YL ANKI , VANHATALO AND V EHTARI able. [sent-867, score-0.43]
wordName wordTfidf (topN-words)
[('ep', 0.556), ('fi', 0.305), ('cavity', 0.239), ('site', 0.239), ('precisions', 0.233), ('fvb', 0.183), ('vanhatalo', 0.176), ('tilted', 0.165), ('posterior', 0.144), ('sites', 0.12), ('fractional', 0.119), ('nickisch', 0.111), ('vb', 0.101), ('hyperparameters', 0.101), ('anki', 0.098), ('ehtari', 0.098), ('qsi', 0.098), ('damping', 0.095), ('laplace', 0.094), ('marginal', 0.092), ('outliers', 0.076), ('mcmc', 0.076), ('approximative', 0.075), ('yl', 0.075), ('hyperparameter', 0.073), ('updates', 0.068), ('moments', 0.066), ('la', 0.062), ('likelihood', 0.061), ('housing', 0.061), ('panel', 0.059), ('parallel', 0.058), ('ikelihood', 0.057), ('rocess', 0.057), ('gp', 0.057), ('moment', 0.054), ('aussian', 0.054), ('egression', 0.052), ('rasmussen', 0.05), ('predictive', 0.05), ('kuss', 0.05), ('zep', 0.05), ('compressive', 0.049), ('variances', 0.048), ('mae', 0.046), ('oscillations', 0.044), ('opper', 0.044), ('sequential', 0.044), ('minka', 0.043), ('seeger', 0.043), ('pi', 0.043), ('heskes', 0.041), ('winther', 0.04), ('ga', 0.04), ('vehtari', 0.039), ('yi', 0.038), ('approximation', 0.037), ('sm', 0.037), ('outlier', 0.037), ('latent', 0.035), ('approximations', 0.034), ('strength', 0.034), ('variational', 0.033), ('aki', 0.033), ('aalto', 0.033), ('fep', 0.033), ('jarno', 0.033), ('iterations', 0.033), ('si', 0.033), ('robust', 0.032), ('mode', 0.032), ('propagation', 0.032), ('problematic', 0.031), ('rejection', 0.03), ('ais', 0.03), ('std', 0.03), ('underestimate', 0.03), ('gelman', 0.03), ('uncertainty', 0.029), ('inference', 0.029), ('loop', 0.028), ('factorizing', 0.028), ('zoeter', 0.028), ('outlying', 0.028), ('observation', 0.027), ('negative', 0.027), ('unimodal', 0.027), ('gaussian', 0.026), ('hannes', 0.026), ('mlpd', 0.026), ('pasi', 0.026), ('credible', 0.025), ('damped', 0.025), ('convergence', 0.025), ('matching', 0.024), ('se', 0.024), ('ti', 0.024), ('concrete', 0.023), ('lengthscale', 0.022), ('gpml', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
2 0.47757989 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
3 0.13590088 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
4 0.11200319 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.08921285 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
6 0.076851748 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
7 0.06917505 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
8 0.059832621 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
9 0.057155233 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
10 0.051435776 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
11 0.046884716 61 jmlr-2011-Logistic Stick-Breaking Process
12 0.045888551 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
13 0.045606814 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
14 0.045600954 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
15 0.045461643 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.040955883 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
17 0.03380933 12 jmlr-2011-Bayesian Co-Training
18 0.033691306 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
19 0.032346055 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
20 0.032056771 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
topicId topicWeight
[(0, 0.229), (1, -0.2), (2, -0.361), (3, 0.149), (4, 0.209), (5, -0.359), (6, -0.103), (7, -0.148), (8, 0.332), (9, -0.276), (10, 0.122), (11, -0.052), (12, -0.014), (13, 0.076), (14, 0.018), (15, -0.047), (16, 0.058), (17, 0.021), (18, 0.031), (19, -0.016), (20, -0.014), (21, -0.06), (22, -0.061), (23, 0.052), (24, 0.018), (25, -0.015), (26, 0.013), (27, 0.007), (28, 0.01), (29, -0.012), (30, -0.006), (31, 0.021), (32, 0.001), (33, 0.01), (34, 0.009), (35, 0.019), (36, 0.014), (37, -0.049), (38, 0.084), (39, 0.021), (40, 0.03), (41, 0.008), (42, -0.009), (43, 0.005), (44, -0.001), (45, 0.03), (46, 0.04), (47, -0.02), (48, -0.01), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.97350073 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
2 0.94376332 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
3 0.44157481 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
4 0.27919421 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
5 0.26275966 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
6 0.25643131 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
7 0.23500825 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
8 0.22047788 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
9 0.20153721 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
10 0.19983929 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
11 0.17746249 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
12 0.17041823 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
13 0.16663714 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
14 0.15532923 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
15 0.14688201 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
16 0.14317796 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
17 0.14268051 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
18 0.14151569 61 jmlr-2011-Logistic Stick-Breaking Process
19 0.13378552 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
20 0.12929274 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
topicId topicWeight
[(4, 0.025), (9, 0.018), (10, 0.039), (24, 0.039), (31, 0.084), (32, 0.028), (41, 0.022), (60, 0.024), (65, 0.011), (70, 0.394), (73, 0.131), (78, 0.069), (90, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.8140645 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
2 0.74921161 35 jmlr-2011-Forest Density Estimation
Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman
Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency
3 0.68363398 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
Author: John Duchi, Elad Hazan, Yoram Singer
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
4 0.45498878 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
5 0.41014564 55 jmlr-2011-Learning Multi-modal Similarity
Author: Brian McFee, Gert Lanckriet
Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity
6 0.40526816 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
7 0.38711476 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
8 0.37176654 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
9 0.37083653 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
10 0.36533418 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
11 0.36374763 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
12 0.36288437 12 jmlr-2011-Bayesian Co-Training
13 0.36079261 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.35951862 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
15 0.3568705 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
16 0.35678414 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
17 0.35636094 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
18 0.35099012 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
19 0.34596038 48 jmlr-2011-Kernel Analysis of Deep Networks