jmlr jmlr2008 jmlr2008-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Source: pdf

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classiﬁcation. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classiﬁcation, Laplaces’s approximation, expectation propagation, variational bounding, mean ﬁeld methods, marginal likelihood evidence, MCMC

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. [sent-7, score-0.312]

2 Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. [sent-8, score-0.312]

3 The marginal likelihood for any Gaussian approximate posterior can be lower bounded using Jensen’s inequality, but the speciﬁc approximation schemes also come with their own marginal likelihood approximations. [sent-39, score-0.681]

4 This is achieved by using a latent function f whose value is mapped into the unit interval by means of a sigmoid function sig : R → [0, 1] such that the class membership probability P (y = +1|x) can be written as sig ( f (x)). [sent-47, score-0.85]

5 If the sigmoid function satisﬁes the point symmetry condition sig(t) = 1 − sig(−t), the likelihood can be compactly written as P (y|x) = sig (y · f (x)) . [sent-49, score-0.516]

6 Given the latent function f , the class labels are assumed to be Bernoulli distributed and independent random variables, which gives rise to a factorial likelihood, factorizing over data points (see Figure 1) n P (y| f ) = P (y|f) = ∏ P (yi | fi ) = i=1 n ∏ sig (yi fi ) . [sent-75, score-0.992]

7 Finally, the predictive class membership probability p∗ := P (y∗ = 1|x∗ , y, X, θ) is obtained by averaging out the test set latent variables P (y∗ |x∗ , y, X, θ) = Z P (y∗ | f∗ ) P ( f∗ |x∗ , y, X, θ) d f∗ = Z sig (y∗ f∗ ) P ( f∗ |x∗ , y, X, θ) d f∗ . [sent-84, score-0.509]

8 Labels yi and latent function values fi are connected through the sigmoid likelihood; all latent function values f i are fully connected, since they are drawn from the same GP. [sent-92, score-0.479]

9 In the case of very small latent scales (σ f → 0), the likelihood is ﬂat causing the posterior to equal the prior. [sent-117, score-0.408]

10 b+c) the prior, d+e) a posterior with n = 7 observations and f+g) a posterior with n = 20 observations along with the n observations with binary labels. [sent-140, score-0.378]

11 2 Gaussian Approximations Unfortunately, the posterior over the latent values (Equation 2) is not Gaussian due to the nonGaussian likelihood (Equation 1). [sent-145, score-0.408]

12 Therefore, the latent distribution (Equation 3), the predictive distribution (Equation 4) and the marginal likelihood Z cannot be written as analytical expressions. [sent-146, score-0.372]

13 However, if sig is concave in the logarithmic domain, the posterior can be shown to be unimodal motivating Gaussian approximations to the posterior. [sent-148, score-0.589]

14 ˜ A quadratic approximation to the log likelihood φ( f i ) := ln P (yi | fi ) at fi 1 1 φ( fi ) ≈ φ( f˜i ) + φ ( f˜i )( fi − f˜i ) + φ ( f˜i )( fi − f˜i )2 = − wi fi2 + bi fi + const fi 2 2 motivates the following approximate posterior Q (f|y, X, θ) ln P (f|y, X, θ) (2) = quad. [sent-150, score-2.931]

15 A simple toy example employing the cumulative Gaussian likelihood and a squared exponential covariance k(x, x ) = σ2 exp(− x − x 2 /2 2 ) with length scales ln = {0, 1, 2. [sent-160, score-0.734]

16 2 However, all algorithms maintaining a Gaussian posterior approximation work with a diagonal W to enforce the effective likelihood to factorize over examples (as the true likelihood does, see Figure 1) in order to reduce the number of parameters. [sent-186, score-0.473]

17 Another approach to model selection is maximum likelihood type II also known as the evidence framework (MacKay, 1992), where the hyperparameters θ are chosen to maximize the marginal likelihood or evidence P (y|X, θ). [sent-193, score-0.499]

18 The likelihood implements a mechanism, for smoothly restricting the posterior along the axis of f i to the side corresponding 2. [sent-199, score-0.315]

19 Some posterior approximations (Sections 3 and 4) provide an approximation to the marginal likelihood, other methods provide a lower bound (Sections 5 and 6). [sent-217, score-0.385]

20 Any Gaussian approximation Q (f|θ) = N (f|m, V) to the posterior P (f|y, X, θ) gives rise to a lower bound Z B to the marginal likelihood Z by application of Jensen’s inequality. [sent-218, score-0.478]

21 = ln Z Jensen Z Q (f|θ) ln ln Z = ln P (y|X, θ) ≥ P (y|f) P (f|X, θ) df = ln Z Q (f|θ) P (y|f) P (f|X, θ) df Q (f|θ) P (y|f) P (f|X, θ) df =: ln ZB . [sent-220, score-3.735]

22 3) leads to the following expression for ln Z B : n ∑ i=1 Z N ( f |, 0, 1) ln sig yi √ Vii f + mi 1 df + [n − m K−1 m + ln VK−1 − tr VK−1 ]. [sent-222, score-2.325]

23 2 2) data ﬁt (9) 3) regularizer 1) data ﬁt Model selection means maximization of ln ZB . [sent-223, score-0.542]

24 The third term can be rewritten as − ln |I + KW| − tr (I + KW)−1 and 1 yields − ∑n ln(1 + λi ) + 1+λi with λi ≥ 0 being the eigenvalues of KW. [sent-235, score-0.567]

25 Furthermore, the bound P (f|y, X, θ) P (y|X) df = ln Z − KL (Q (f|θ) P (f|y, X, θ)) (10) Q (f|θ) can be decomposed into the exact marginal likelihood minus the Kullback-Leibler (KL) divergence between the exact posterior and the approximate posterior. [sent-237, score-1.177]

26 Thus by maximizing the lower bound ln ZB on ln Z, we effectively minimize the KL-divergence between P (f|y, X, θ) and Q (f|θ) = N (f|m, V). [sent-238, score-1.111]

27 Laplace Approximation (LA) A second order Taylor expansion around the posterior mode m leads to a natural way of constructing a Gaussian approximation to the log-posterior Ψ(f) = ln P (f|y, X, θ) (Williams and Barber, 1998; Rasmussen and Williams, 2006, Ch. [sent-241, score-0.812]

30 2 2 ≈ ln h + ln Z 2044 A PPROXIMATE G AUSSIAN P ROCESS C LASSIFICATION 4. [sent-254, score-1.084]

31 αi ≈ W−1 ii R ∂ 2 ∂ f P (yi | f i ) N ( f i |µ¬i , σ¬i )d f i R i , 2 P (yi | fi ) N ( fi |µ¬i , σ¬i )d fi 1 ≈ σ2 −1 . [sent-279, score-0.656]

32 KL-Divergence Minimization (KL) In principle, we simply want to minimize a dissimilarity measure between the approximate posterior Q (f|θ) = N (f|m, V) and the exact posterior P (f|y, X, θ). [sent-292, score-0.378]

33 One quantity to minimize is the KLdivergence KL (P (f|y, X, θ) Q (f|θ)) = Z P (f|y, X, θ) ln P (f|y, X, θ) df. [sent-293, score-0.542]

35 Constant terms have been dropped from the expression: c KL(m, V) = − Z N (f) n ∑ ln sig ( i=1 √ 1 1 1 vii yi f + mi yi ) d f − ln |V| + m K−1 m + tr K−1 V . [sent-299, score-1.817]

36 Individual likelihood bounds P (yi | fi ) ≥ exp ai fi2 + bi yi fi + ci , ∀ fi ∈ R ∀i ⇒ P (y|f) ≥ exp f Af + (b y) f + c =: Q (y|f, A, b, c) , ∀f ∈ R are deﬁned in terms of coefﬁcients ai , bi and ci , where denotes the element-wise product of two vectors. [sent-325, score-0.876]

37 Z= Z P (f|X) P (y|f) df ≥ Z P (f|X) Q (y|f, A, b, c) df = ZB . [sent-327, score-0.322]

38 4) 1 1 −1 + (b y) K−1 − 2A (b y) − ln |I − 2AK| (13) 2 2 which can now be maximized with respect to the coefﬁcients a i , bi and ci . [sent-329, score-0.542]

39 In order to get an efﬁcient algorithm, one has to calculate the ﬁrst and second derivatives ∂ ln Z B /∂ς, ∂2 ln ZB /∂ς∂ς (as done in Appendix A. [sent-330, score-1.118]

40 Hyperparameters can be optimized using the gradient ∂ ln Z B /∂θ. [sent-332, score-0.542]

41 1 Logit Bound Optimizing the logistic likelihood function (Gibbs and MacKay, 2000), we obtain the necessary conditions bς := −Λς , 1 , := 2 Aς 1 cς,i := ς2 λ(ςi ) − ςi + ln siglogit (ςi ) i 2 where we deﬁne λ(ςi ) = 2siglogit (ςi ) − 1 / (4ςi ) and Λς = [λ(ςi )]ii . [sent-334, score-0.721]

42 5) for the cumulative Gaussian likelihood sigprobit ( fi ) with necessary conditions := − 1 , 2 aς bς,i := ςi + (14) N (ςi ) sigprobit (ςi ) , ςi − bi ςi + ln sigprobit (ςi ) 2 which again depend only on a single vector of parameters we optimize using Newton’s method. [sent-340, score-1.207]

43 3 Posterior Based on these local approximations, the approximate posterior can be written as P (f|y, X, θ) ≈ N (f|m, V) = N f|m, K−1 + W W = −2Aς , m = V (y bς ) = K−1 − 2Aς −1 (y −1 , bς ) , where we have expressed the posterior parameters directly as a function of the coefﬁcients. [sent-343, score-0.378]

44 Therefore, the variational posterior is more constrained than the general Gaussian posterior and thus easier to optimize. [sent-346, score-0.429]

45 Factorial Variational Method (FV) Instead of approximating the posterior P (f|y, X, θ) by the closest Gaussian distribution, one can use the closest factorial distribution Q (f|y, X, θ) = ∏i Q ( fi ), also called ensemble learning (Csató et al. [sent-351, score-0.51]

46 Another kind of factorial approximation Q (f) = Q (f + ) Q (f− )—a posterior factorizing over classes—is used in multi-class classiﬁcation (Girolami and Rogers, 2006). [sent-353, score-0.331]

47 6), one ﬁnds the best approximation to be of the following form: Q ( fi ) ∝ N fi µi , σ2 P (yi | fi ) , i µi = mi − σ2 K−1 m i = [Kα]i − σ2 αi , i i σ2 = i mi = K−1 Z −1 , ii fi Q ( fi ) d fi . [sent-356, score-1.495]

48 Since the posterior is factorial, the effective likelihood of the factorial approximation has an odd shape. [sent-359, score-0.457]

49 , 2000): n ln Z ≥ ∑ ln sig i=1 yi mi σi 1 − α 2 K − Dg( σ2 , . [sent-371, score-1.597]

50 One can simply ignore the binary nature and use the regression marginal likelihood ln Z reg as proxy for ln Z—an approach we only mention but not use in the experiments 1 n 1 ln Zreg = − α K + σ2 I α − ln K + σ2 I − ln 2π. [sent-389, score-2.94]

51 n n 2 2 2 Alternatively, the Jensen bound (8) yields a lower bound ln Z ≥ ln Z B —which seems more in line with the classiﬁcation scenario than ln Zreg . [sent-390, score-1.68]

52 Global methods R minimize the KL-divergence KL(Q||P) = Q (f) ln Q (f) /P (f) df between the posterior P (f) and a tractable family of distributions Q (f). [sent-394, score-0.892]

55 The ﬁnite temperature change bias can be removed by combining results Z r from 1 R different runs by their arithmetic mean R ∑r Zr (Neal, 2001) ln Z = ln Z P (y|f) P (f|X) df ≈ ln 1 R ∑ Zr . [sent-440, score-1.787]

56 At a second level, building on the “low-level” features, we compare predictive performance in terms of the predictive probability p∗ given by (Equations 4 and 6): p∗ := P (y∗ = 1|x∗ , y, X, θ) ≈ Z sig ( f∗ ) N f∗ |µ∗ , σ2 d f∗ . [sent-507, score-0.465]

57 Based on the logistic likelihood function and the squared exponential covariance function with parameters ln = 2. [sent-578, score-0.713]

58 1 M EAN m AND ( CO )VARIANCE V The posterior process, or equivalently the posterior distribution over the latent values f, is determined by its location parameter m and its width parameter V. [sent-598, score-0.471]

59 We chose the hyperparameters for the non Gaussian case of Figure 6 to maximize the EP marginal likelihood (see Figure 9), whereas the hyperparameters of Figure 8 were selected to yield a posterior that is almost Gaussian but still has reasonable predictive performance. [sent-620, score-0.554]

60 For large latent function scales σ2 , in the limit σ2 → ∞, the likelihood becomes a step function, the mode apf f proaches the origin and the curvature at the mode becomes larger. [sent-626, score-0.317]

61 That means, iterative matching of approximate marginal moments leads to accurate marginal moments of the posterior. [sent-629, score-0.334]

62 R The KL method minimizes the KL-divergence KL (Q (f) P (f)) = Q (f) ln Q(f) df with the avP(f) erage taken to the approximate distribution Q (f). [sent-630, score-0.703]

63 5 for a close-to-Gaussian posterior: Using the squared exponential covariance and the logistic likelihood function with parameters ln = 3 and ln σ f = 0. [sent-676, score-1.255]

64 Due to the required lower bounding property of each individual likelihood term, the approximate posterior has to obey severe restrictions. [sent-686, score-0.315]

65 The FV method has a special rôle because it does not lead to a Gaussian approximation to the posterior but to the closest (in terms of KL-divergence) factorial distribution. [sent-688, score-0.331]

66 Consider the predictive probability from Equation 16 using a cumulative Gaussian likelihood p∗ = Z sigprobit ( f∗ )N ( f∗ |µ∗ , σ2 )d f∗ = sigprobit (µ∗ / ∗ 1 + σ2 ). [sent-700, score-0.407]

67 , ln ≈ 2) and choose a latent function scale σ f above some threshold (e. [sent-711, score-0.635]

68 , ln ≈ 2, ln σ f ≈ 2) and compensate a harder cutting likelihood (σ2 ↑) by making the data points more similar to each other ( 2 ↑). [sent-716, score-1.21]

69 One can read-off the divergence between posterior and approximation by recalling KL(Q||P) = ln Z − ln ZB from Equation 10 and assuming ln ZEP ≈ ln Z. [sent-782, score-2.417]

70 The logistic likelihood function (Figure 9(c)) yields much better results than the cumulative Gaussian likelihood function (Figure 11(c)). [sent-806, score-0.311]

71 In Figure 9(d) at ln = 2 and ln σ f = 4 only 16 errors are made by the LA method while the information score (Figure 9(c)) is only of 0. [sent-866, score-1.104]

72 Hyperparameters ln can conveniently be optimized using Z not least because the gradient ∂∂θZ can be analytically and efﬁciently computed for all methods. [sent-872, score-0.542]

73 In principle, evidences are bounded by ln Z ≤ 0 where ln Z = 0 corresponds to a perfect model. [sent-875, score-1.084]

74 1, the marginal likelihood for a model ignoring the data and having equiprobable targets has the value ln Z = −n ln 2, which serves as a baseline. [sent-878, score-1.314]

75 If the posterior is very skew, the bound inherently underestimates the marginal likelihood. [sent-892, score-0.337]

76 Finally, the FV method only yields a poor approximation to the marginal likelihood due to the factorial approximation, Figure 10. [sent-899, score-0.372]

77 For strongly correlated priors (large ) the evidence drops even below the baseline ln Z = −n ln 2. [sent-901, score-1.134]

78 Practically, the different slopes result in a shift of the latent function length 1 scale in the order of ln 4 − ln √1 ≈ 0. [sent-909, score-1.177]

79 VARIATIONAL PARAMETERS ςi Y IELDING = ∂ ln ZB ∂ςi = ˜ l ς , Kς = ∂ ln ZB ∂ς = lς = rς = = 2067 . [sent-967, score-1.084]

80 cς ln ZB THE ¨ aς N ICKISCH AND R ASMUSSEN A. [sent-969, score-0.542]

81 VARIATIONAL PARAMETERS ςi Y IELDING ∂2 ln ZB ∂ς j ∂ςi ∂2 ln ZB ∂ς∂ς 2 ∂rς,i ∂2 c i ˜ ∂Aς Kς ∂Aς + Kς ∂ Aς ˜ ˜ + + tr 2Kς ∂ς j ∂ςi ∂ς j ∂ς j ∂ςi ∂ς j ∂ςi = = ∂2 c i ∂ς2 i + ii ∂rς ˜ ˙ + 2 Kς Aς ∂ς ¨ ˜ ˙ ˙ = Cς + Rς + 2 Kς Aς ˜ ˙ Kς Aς A. [sent-974, score-1.132]

82 H YPER - θi ∂2 ln ZB ∂θi ∂ς ˙ = aς ∂ lς ∂θi ˙ = aς 2lς ˜ ˙ Kς Aς AND ˜ lς + dg Kς H ESSIAN , ˜ + Dg dg(Kς ) ˜ + Dg dg(Kς ) THE ¨ aς ¨ aς . [sent-979, score-0.666]

83 H YPERPARAMETERS θi : For a gradient optimization with respect to θ, we need the gradient of the objective ∂ ln Z B /∂θ. [sent-986, score-0.542]

84 Naïvely, the gradient is given by: ∂ ln ZB ∂θi = lς = 1 ˜ ˜ −1 ∂K −1 ˜ ˜ ∂K bς K ς K K Kς bς + tr (I − 2Aς K)− Aς 2 ∂θi ∂θi ∂K −1 ∂K 1 . [sent-987, score-0.567]

85 lς K−1 K lς + tr (I − 2Aς K)− Aς 2 ∂θi ∂θi However, the optimal variational parameter ς ∗ depends implicitly on the actual choice of θ and one has to account for that in the derivative by adding an extra “implicit” term ∂ ln ZB (θ, ς) ∂θi = ς=ς ∗ n ∂ ln ZB (θ, ς ∗ ) ∂ ln ZB (θ, ς ∗ ) ∂ς∗ j +∑ . [sent-988, score-1.702]

86 ∂x leads to ∗ ∂ςθ ∂θ ∗ ∂2 ln ZB (θ, ςθ ) = − ∂ς∂ς −1 ∗ ∂2 ln ZB (θ, ςθ ) ∂θ ∂ς and in turn combines to ∂ ln ZB ∂θi = ς=ς ∗ ∂ ln ZB ∂ ln ZB − ∂θi ∂ς where all terms are known. [sent-990, score-2.71]

87 2068 ∂2 ln ZB ∂ς∂ς −1 ∂2 ln ZB ∂θi ∂ς A PPROXIMATE G AUSSIAN P ROCESS C LASSIFICATION A. [sent-991, score-1.084]

88 2 Derivatives for KL The lower bound ln ZB to the log marginal likelihood ln Z is given by Equation 9 as n 1 1 1 ln Z ≥ = ln ZB (m, V) = a(y, m, V) + ln VK−1 + − m K−1 m − tr VK−1 2 2 2 2 where we used the shortcut a(y, m, V) = ∑n i=1 N ( f i |mi , vii ) ln sig(yi f i )d f i . [sent-992, score-3.693]

89 As a ﬁrst step, we calculate the ﬁrst derivatives of ln ZB with respect to the posterior moments m and V to derive necessary conditions for the optimum by equating them with zero: R ∂ ln ZB ∂a(y, m, V) 1 −1 1 −1 ! [sent-993, score-1.359]

90 ∂a = + V − K = 0 ⇒ V = K−1 − 2Dgdg ∂V ∂V 2 2 ∂V ∂ ln ZB ∂a(y, m, V) ∂a ! [sent-994, score-0.542]

91 ∂m ∂m ∂m −1 , These two expressions are plugged in the original expression for ln Z B using A = (I − 2KΛ)−1 and ∂a Λ = Dgdg ∂V to yield: 1 1 n 1 ln ZB (α, Λ) = a y, Kα, (K−1 − 2Λ)−1 + ln |A| − trA + − α Kα. [sent-996, score-1.626]

92 PARAMETERS α, Λ YIELDING THE ∂a ∂ ln ZB = + dg(V) − dg(VA ) and ∂λ ∂λ GRADIENT ∂ ln ZB ∂a = − Kα. [sent-1003, score-1.084]

93 H YPERPARAMETERS θi : The direct gradient is given by the following equation where we have marked the dependency of the covariance K on θi by subscripts ∂ ln ZB (α, Λ) ∂θi ∂Kθ ∂a(y, m, V) ∂Kθ ∂a(y, m, V) A + dg A ∂θi ∂m ∂θi ∂dgV ∂Kθ 1 ∂Kθ ∂Kθ − tr A ΛA − α α. [sent-1029, score-0.734]

94 The sigmoids are normalized sig (− f i ) + sig ( fi ) = 1 and the Gaussian is symmetric N ( f i ) = N (− fi ). [sent-1039, score-1.156]

95 f 2 0 = 0 (− fi |0, σ2 )d fi + f Z0 ∞ = Z ∞ 0 0 ( fi |0, σ2 )d fi f The marginal likelihood is given by Z = Z = Z P (y|f) P (f|X, θ) df n ∏ sig (yi fi ) |2πK|− i=1 1 2 1 exp(− f K−1 f)df. [sent-1041, score-1.813]

96 1 L ENGTHSCALE TO Z ERO For K = σ2 I the prior factorizes and we get f n Z →0 = ∏ Z n 1 i=1 (17) = sig (yi fi ) ∏ 2 = 2−n . [sent-1044, score-0.604]

97 = TO yi i=1 σf 2−n+1 sig √ · t N (t)dt n Z σf −n+1 sig √ · r N (r)dr 2 n = B. [sent-1049, score-0.793]

98 i=1 There are three Gaussian integrals to evaluate; the entropy of the approximate posterior and two other expectations n n 1 KL (Q (f|θ) P (f|y, X, θ)) = ln Z − ln |V| − − ln 2π 2 2 2 Z n √ − N ( f ) ∑ ln sig ( vii yi f + mi yi ) d f i=1 1 1 1 n + ln 2π + ln |K| + m K−1 m + tr K−1 V . [sent-1056, score-4.174]

99 m and V) terms, we arrive at c KL(m, V) = − Z n ∑ ln sig ( N (f) i=1 √ 1 1 1 vii yi f + mi yi ) d f − ln |V| + m K−1 m + tr K−1 V . [sent-1060, score-1.817]

100 i=1 Free-form optimization proceeds by equating the functional derivative with zero δKL δQ ( fi ) 1 δ 2 δQ ( fi ) = ln Q ( fi ) + 1 − ln P (yi | fi ) + Z n ∏ Q ( fi ) f K−1 fdf. [sent-1068, score-2.139]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ln', 0.542), ('sig', 0.367), ('kl', 0.222), ('fi', 0.211), ('vb', 0.193), ('posterior', 0.189), ('zb', 0.189), ('fv', 0.186), ('df', 0.161), ('ep', 0.161), ('la', 0.138), ('vii', 0.136), ('likelihood', 0.126), ('dg', 0.124), ('ickisch', 0.119), ('factorial', 0.11), ('marginal', 0.104), ('asmussen', 0.101), ('dgv', 0.096), ('sigprobit', 0.096), ('rocess', 0.096), ('latent', 0.093), ('mi', 0.087), ('pproximate', 0.079), ('aussian', 0.069), ('lassification', 0.069), ('regimes', 0.064), ('gaussian', 0.061), ('yi', 0.059), ('zt', 0.058), ('mcmc', 0.055), ('moments', 0.052), ('variational', 0.051), ('evidence', 0.05), ('mode', 0.049), ('predictive', 0.049), ('kw', 0.048), ('rasmussen', 0.043), ('bits', 0.043), ('hyperparameters', 0.043), ('regime', 0.04), ('cumulative', 0.04), ('jensen', 0.038), ('derivatives', 0.034), ('siglogit', 0.034), ('gp', 0.034), ('erivatives', 0.034), ('opper', 0.033), ('approximations', 0.033), ('marginals', 0.032), ('laplace', 0.032), ('approximation', 0.032), ('figures', 0.031), ('newton', 0.029), ('fl', 0.029), ('exp', 0.029), ('williams', 0.028), ('lim', 0.028), ('divergence', 0.028), ('bound', 0.027), ('usps', 0.026), ('covariance', 0.026), ('likelihoods', 0.026), ('prior', 0.026), ('tr', 0.025), ('vk', 0.025), ('hyperparameter', 0.024), ('kuss', 0.024), ('tap', 0.024), ('log', 0.023), ('ii', 0.023), ('gray', 0.023), ('sigmoid', 0.023), ('constf', 0.023), ('irst', 0.023), ('lengthscales', 0.023), ('underestimation', 0.023), ('matching', 0.022), ('overcon', 0.021), ('propagation', 0.02), ('score', 0.02), ('logistic', 0.019), ('mackay', 0.019), ('csat', 0.019), ('lr', 0.018), ('ft', 0.018), ('sonar', 0.018), ('qi', 0.018), ('minka', 0.018), ('equation', 0.017), ('fs', 0.017), ('five', 0.017), ('underestimates', 0.017), ('winther', 0.017), ('tight', 0.017), ('ais', 0.017), ('archambeau', 0.017), ('fid', 0.017), ('ielding', 0.017), ('ntrain', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

2 0.27378064 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

Author: David C. Hoyle

Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at ﬁxed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference

3 0.20990983 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.17014936 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classiﬁcation and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning

5 0.082137108 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efﬁcient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the ﬁtting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classiﬁcation, kernel logistic regression, hierarchical classiﬁcation, cross validation optimization, Newton-Raphson optimization

6 0.081907377 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

7 0.077285208 83 jmlr-2008-Robust Submodular Observation Selection

8 0.07196033 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

9 0.070695385 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.062202375 54 jmlr-2008-Learning to Select Features using their Properties

11 0.04895895 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

12 0.046988565 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

13 0.046689197 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

14 0.046161838 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

15 0.044276003 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

16 0.042392511 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

17 0.041787732 47 jmlr-2008-Learning Balls of Strings from Edit Corrections

18 0.039813861 58 jmlr-2008-Max-margin Classification of Data with Absent Features

19 0.03957675 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)

20 0.038015205 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.246), (1, -0.108), (2, -0.311), (3, -0.06), (4, 0.079), (5, 0.308), (6, -0.11), (7, -0.293), (8, -0.008), (9, 0.183), (10, 0.006), (11, 0.139), (12, 0.218), (13, -0.083), (14, -0.139), (15, -0.115), (16, -0.061), (17, 0.1), (18, 0.014), (19, 0.011), (20, -0.125), (21, -0.101), (22, 0.03), (23, -0.06), (24, 0.079), (25, -0.05), (26, -0.03), (27, 0.134), (28, 0.019), (29, -0.067), (30, 0.064), (31, -0.027), (32, 0.052), (33, -0.033), (34, -0.078), (35, -0.029), (36, 0.029), (37, -0.017), (38, -0.061), (39, -0.026), (40, -0.064), (41, -0.02), (42, -0.018), (43, 0.006), (44, 0.072), (45, -0.014), (46, 0.051), (47, -0.007), (48, 0.064), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98276514 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

2 0.73106211 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

Author: David C. Hoyle

3 0.57592744 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

4 0.46622992 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

5 0.33055314 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

Author: Sivan Sabato, Shai Shalev-Shwartz

Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of ﬁnding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassiﬁcation error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efﬁciency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees

6 0.29581895 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

7 0.24188724 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

8 0.24178685 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

9 0.23675472 83 jmlr-2008-Robust Submodular Observation Selection

10 0.21229082 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

11 0.20261355 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

12 0.18540694 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

13 0.17958915 96 jmlr-2008-Visualizing Data using t-SNE

14 0.17179812 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

15 0.16833815 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

16 0.16453701 54 jmlr-2008-Learning to Select Features using their Properties

17 0.15443829 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

18 0.15212545 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

19 0.14333777 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

20 0.13980094 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (5, 0.024), (9, 0.021), (31, 0.016), (35, 0.345), (40, 0.031), (53, 0.037), (54, 0.034), (58, 0.042), (66, 0.047), (72, 0.011), (76, 0.023), (88, 0.062), (92, 0.039), (94, 0.077), (99, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77224714 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

2 0.57610214 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Author: Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classiﬁcation and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while ﬁxing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efﬁcient and stable than state of the art methods such as Pegasos and TRON. Keywords: linear support vector machines, document classiﬁcation, coordinate descent

3 0.40231794 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

4 0.35208622 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

5 0.35201722 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efﬁcient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments conﬁrm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efﬁciently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random ﬁelds ∗. These authors contributed equally. c 2008 Michael Col

6 0.34951782 86 jmlr-2008-SimpleMKL

7 0.34464115 83 jmlr-2008-Robust Submodular Observation Selection

8 0.33804521 56 jmlr-2008-Magic Moments for Structured Output Prediction

9 0.33793336 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

10 0.33703941 57 jmlr-2008-Manifold Learning: The Price of Normalization

11 0.33698726 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

12 0.3358036 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

13 0.33534238 58 jmlr-2008-Max-margin Classification of Data with Absent Features

14 0.33398595 7 jmlr-2008-A Tutorial on Conformal Prediction

15 0.33391878 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

16 0.33253878 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

17 0.33222559 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

18 0.32940364 9 jmlr-2008-Active Learning by Spherical Subdivision

19 0.32733706 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

20 0.32601574 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies