jmlr jmlr2012 jmlr2012-118 knowledge-graph by maker-knowledge-mining

118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Source: pdf

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a ﬂexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classiﬁcation, which is the categorization of an item into one of several ﬁxed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difﬁcult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-ﬁeld bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classiﬁcation, multinomial logistic, variational approximation, sparse approximation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. [sent-8, score-0.627]

2 This paper addresses the variational approximation of the multinomial logit Gaussian process model, where the likelihood function is the multinomial logistic. [sent-26, score-0.748]

3 Although the variational approximation provides a lower bound on the marginal likelihood, approximate prediction in the usual straightforward manner does not necessarily give a lower bound on the predictive likelihood. [sent-38, score-0.591]

4 In a natural manner, this sparse approximation combines our proposed variational approximation with the variational sparse approximation that has been introduced for regression (Titsias, 2009a). [sent-42, score-0.765]

5 1 Overview In Section 2, we describe the latent Gaussian process model with the multinomial logistic likelihood, and we give the variational lower bound on the marginal likelihood for approximate inference. [sent-47, score-0.796]

6 Among others, we compare the tightness of our variational approximation to the variational mean-ﬁeld approximation (Girolami and Rogers, 2006), and the errors of our classiﬁcation results with those given by four singlemachine multi-class SVMs. [sent-58, score-0.666]

7 , fiC )T at xi , the likelihood for the class label is the multinomial logistic def p(yc = 1|fi ) = i exp fic ′. [sent-99, score-0.804]

8 Then we obtain a lower bound to the approximate predictive probability for class c: def log q(yc = 1|y) = log ∗ p(yc = 1|f∗ ) q(f∗ |y) df∗ ∗ (6) q(f∗ |y) log p(yc = 1|f∗ ) df∗ ∗ ≥ = ℓ∗ (yc = 1; q), ∗ where the inequality is due to Jensen’s inequality. [sent-138, score-0.866]

9 1 VARIATIONAL B OUNDS FOR E XPECTED L OG - LIKELIHOOD Equations 3 to 7 require the computation of the expected log-likelihood under q(f|y): def ℓ(y; q) = q(f|y) log p(y|f) df, (8) where we have suppressed the datum indices i and ∗ here and henceforth for this section. [sent-151, score-0.641]

10 Two trivial lower bounds can be obtained by expanding p(y|f) and using the Jensen’s inequality: C 1 ℓ(y; q) ≥ mT y − log ∑ exp mT ec + (y − ec )TV (y − ec ) , 2 c=1 C 1 ℓ(y; q) ≥ mT y − log ∑ exp mT ec + (ec )TV ec . [sent-154, score-1.33]

11 In this section, we give a variational lower bound, and we have found this bound to be quite tight when the variational parameters are optimized. [sent-156, score-0.658]

12 Denote by r(f) the corresponding prior distribution that gives this posterior when combined with the exact data likelihood p(y|f); that is r(f|y) = p(y|f) r(f)/r(y), where def r(y) = p(y|f) r(f)df. [sent-163, score-0.714]

13 Deﬁne S = V −1 +W , def b = W (m − a) + y, and 1 def gc (y; q, a,W ) = exp mT ec + (b − ec )T S−1 (b − ec ) . [sent-176, score-1.798]

14 Let 1 def gc (q, b, S) = exp mT ec + (b − ec )T S−1 (b − ec ) 2 1751 (16) C HAI be a function of the mean m of distribution q and of b and S ≻ 0. [sent-189, score-1.3]

15 , gC )T , ¯ gc = gc / ∑C′ =1 gc , ¯ def c (17) ¯ ¯ and let G be the diagonal matrix with g along its diagonal. [sent-194, score-0.922]

16 We further deﬁne def ¯ ¯ A = ∑C gc (b − ec )(b − ec )T = bbT − b¯ T − gbT + G g c=1 ¯ 0. [sent-195, score-1.057]

17 Let def h(y; q, b, S) = C C 1 1 + log |SV | − tr SV + mT y − log ∑ gc (q, b, S) 2 2 2 c=1 (19) be a function of b and S, where gc (q, b, S) is given by (16). [sent-202, score-1.013]

18 In place of deﬁnitions (16) and (19) for functions gc and h, suppose we had used 1 def gc (b′ , S) = exp (b′ − ec )T S−1 (b′ − ec ), 2 C 1 C 1 def h(y; q, b, S) = + log |SV | − tr S(V + mmT ) + mT (y − b′ ) − log ∑ gc (q, b′ , S) 2 2 2 c=1 def as functions of b′ and S ≻ 0. [sent-241, score-2.596]

19 This gives the relaxed bound C 1 h(y; q, b,V −1 ) = mT y − log ∑ exp mT ec + (b − ec )TV (b − ec ) . [sent-250, score-0.802]

20 We can also choose to ﬁx b to y, giving C 1 h(y; q, y,V −1 ) = mT y − log ∑ exp mT ec + (y − ec )TV (y − ec ) . [sent-252, score-0.759]

21 5: def Lemma 7 ℓ(y; q) ≤ log p(y|m) = mT y − log ∑C exp mT ec . [sent-259, score-0.913]

22 2 VARIATIONAL B OUNDS FOR M ARGINAL L IKELIHOOD To consolidate, the log marginal likelihood is lower bounded via the sequence n def log p(y) ≥ log ZB ≥ log Zh = −KL(q(f|y) p(f)) + ∑ h(yi ; qi , bi , Si ), i=1 where the datum subscript i is reintroduced. [sent-262, score-1.147]

23 Let 1 1 1 log Zh = nC + log |K −1V | − tr K −1V − mT K −1 m + mT y 2 2 2 n n C 1 1 + ∑ log |SiVi | − tr SiVi − ∑ log ∑ exp mT ec + (bi − ec )T Si−1 (bi − ec ) , (21) i 2 i=1 2 i=1 c=1 where Vi is the ith C-by-C diagonal block of V , and mi is the ith C-vector of m. [sent-277, score-1.252]

24 1, the C latent function values f∗ of a test input x∗ and the latent function values of the n observed data have prior f f∗ ∼N 0, K K∗ T K∗ K∗∗ , def def where K∗ = kx ⊗ K c , K∗∗ = kx (x∗ , x∗ )K c , and kx is the vector of covariances between the observed ∗ ∗ inputs X and the test input x∗ . [sent-293, score-1.428]

25 ∗ ∗ Expanding h using deﬁnitions (16) and (19) gives log p(yc = 1|y) ∗ C + mT ec + max ∗ b∗ ,S∗ 2 C 1 1 ′ log |S∗V∗ | − tr S∗V∗ − log ∑ gc (q∗ , b∗ , S∗ ) , (24) 2 2 c′ =1 1755 C HAI where the probed class ec is used only outside the max operator. [sent-296, score-0.902]

26 4 Parameter V For the gradient with respect to V we have ∂hi 1 −T 1 T 1 = Vi − Si = − WiT , ∂Vi 2 2 2 ∂ log Zh 1 −1 1 −1 1 −1 = V − K − W , ∂V 2 2 2 def where Wi = Si −Vi−1 , and W is the block diagonal matrix of the Wi s. [sent-386, score-0.649]

27 To explain, we recall that log Zh = −KL(q(f|y) p(f)) + h, where 1758 VARIATIONAL M ULTINOMIAL L OGIT G AUSSIAN P ROCESS def h = ∑n hi is the sum of functions each concave in V . [sent-392, score-0.615]

28 To remedy, we def follow the strategy used for updating S: we use V cc = (1 − η)V + ηV fx and optimize with respect to η ∈ [0, 1]. [sent-399, score-0.673]

29 In the second step, we use the factorization def q(f, z|y) = p(f|z) q(z|y), (29) where p(f|z) is the marginal of f from the prior p(f, z). [sent-422, score-0.619]

30 Under this approximate posterior, we have the bound n ˜ log p(y) ≥ log ZB = −KL(q(z|y) p(z)) + ∑ ℓi (yi ; q), i=1 def where ℓi (yi ; q) = q(fi |y) log p(yi |fi ) dfi , and q(fi |y) is the marginal distribution of fi from the joint distribution q(f, z|y); see Appendix B. [sent-425, score-0.923]

31 Applying Theorem 6 on the ℓi s gives n ˜ ˜ def log ZB ≥ log Zh = −KL(q(z|y) p(z)) + ∑ h(yi ; qi , bi , Si ), (30) i=1 where h is deﬁned by Equation 19, and the qi within h is the marginal distribution q(fi |y). [sent-434, score-0.81]

32 2 Optimization ˜ The sparse approximation requires optimizing log Zh with respect to the variational parameters: mean m and covariance V of the inducing variables; and {bi } and {Si } for the lower bound on the expected log-likelihood. [sent-467, score-0.764]

33 The ﬁxed point update for V is def W =K f f f V fx = K −1 +W −1 , (33) which is obtained by setting ∂Zh /∂V at V fx to zero. [sent-481, score-0.783]

34 In the case where V fx does not yield an improvement to the objective def ˜ log Zh , we search for a V cc = (1 − η)V + ηV fx , η ∈ [0, 1], using the false position method along η. [sent-483, score-0.887]

35 The stationary point of the lower bound Zh in the sparse approximation has the self-consistent def ¯ equation m = Kf (y − g); see Section 4. [sent-524, score-0.637]

36 For the non-sparse variational approximation to the multinomial logit Gaussian process, this is ∗ def log Zh (θ) = max log Zh (m,V, {bi }, {Si }; X, y, θ), m,V,{bi },{Si } which is the maximal lower bound on log marginal likelihood on the observed data (X, y). [sent-537, score-1.5]

37 The maximization is achieved by ascending the gradient ∗ dlog Zh 1 ∂K ∂K ∂K −1 1 1 = − tr K −1 K m + tr K −1V K −1 + mT K −1 dθj 2 ∂θj 2 ∂θj 2 ∂θj 1 ∂K , = tr ααT − K −1 + K −1V K −1 2 ∂θj def where α = K −1 m. [sent-538, score-0.753]

38 4 gives ˜∗ dlog Zh 1 = − tr dθj 2 ααT − K −1 + K −1V K −1 +W + tr ∂K ∂θj ¯ (y − g)αT +Wf KfT K −1 − K −1V K −1 ∂Kf ∂θj 1 ∂Kff − tr Wf 2 ∂θj , def where α = K −1 m, and matrices Wf and W are deﬁned in Section 4. [sent-543, score-0.753]

39 Given the current set of inducing sites X, the inclusion of x∗ gives the new set def ˜∗ . [sent-560, score-0.717]

40 m,V,{bi },{Si } ˜∗ ˜ In words, Zh (X) is the optimized lower bound on marginal likelihood with the current inducing set ˜ ˜∗ ˜ ˜ X, while Zh (X∗ ) is the optimized lower bound with the proposed new inducing set X∗ . [sent-569, score-0.709]

41 Further setting {b∗i } ≡ {bi } and {S∗i } ≡ {Si }, the additional variational covariance V in log Z ˜ parameters are those in the posterior of the inducing points for the additional site x∗ . [sent-576, score-0.745]

42 Since we are optimizing over only a subset of the possible parameters, we obtain a lower bound on d(˜ ∗ |X): x ˜ ˜ ˜ ˜∗ ˜ d(˜ ∗ |X) ≥ d1 (˜ ∗ |X) = max log Zh (m∗ ,V∗ , {bi }, {Si }; X∗ ) − log Zh (X), x ˜ x ˜ def m∗ ,v∗∗ ,v∗ (39) ˜ where m∗ and V∗ are as deﬁned in Equation 38. [sent-577, score-0.74]

43 ∗ ∗ ∗ Proposition 17 For the sparse variational approximation to the multinomial logit Gaussian pro˜∗ cess, any site added to the inducing set can never decrease the lower bound Zh to the marginal likelihood. [sent-594, score-0.891]

44 ,C, and determine the class by arg maxc uc , The multinomial logistic is in this class, and it is obtained when each auxiliary variable uc is Gumbel distributed c c def with p(uc | f c ) = te−t , where t = e−(u − f ) (McFadden, 1974). [sent-670, score-0.656]

45 From this perspective, a model with the threshold likelihood function and prior covariance function kx (x, x′ ) + δ(x, x′ ), where δ is the Kronecker delta function, is essentially the same as the model with the multinomial probit likelihood and prior covariance function kx (x, x′ ). [sent-673, score-0.732]

46 1770 VARIATIONAL M ULTINOMIAL L OGIT G AUSSIAN P ROCESS exponential covariance function 1 k (x, x ) = usqexpard(x, x ; θ) = exp − 2 x ′ ′ def d ∑ j=1 x j − x′j θj 2 , (44) where x j (resp. [sent-757, score-0.601]

47 KL-MNL is our variational approach with multinomial logistic likelihood, and MF-MNP is the variational mean-ﬁeld with multinomial probit likelihood (Girolami and Rogers, 2006). [sent-767, score-1.019]

48 This suggests that the theoretic bounds may be useful as sanity checks for variational approaches, although we must qualify that the theoretic bounds are for the multinomial logistic likelihood rather than for the probit one. [sent-778, score-0.6]

49 For i=1 (s) def a sample f(s) from the variational posterior, let w(s) = ∑n log p(yi |fi ). [sent-987, score-0.878]

50 The topmost curve log Z is for the marginal likelihood obtained using importance sampling; the middle curve log ZB approximates the posterior of the latent process with a Gaussian; while the bottom curve log Zh further approximates the expected log-likelihood. [sent-1008, score-0.624]

51 As outlined in Section 8, the Gaussian process latent model with covariance function kx (x, x′ ) on the inputs and with the multinomial probit likelihood is equivalent to the model with covariance function kx (x, x′ ) + δ(x, x′ ) and the threshold likelihood function. [sent-1018, score-0.761]

52 For this evaluation, the covariance function on the inputs in Rd is the squared-exponential covariance function with equal length-scales along all the dimensions: 2 def kx (x, x′ ) = sqexpiso(x, x′ ; σx , θ) = σx exp − 1 2 d ∑ j=1 x j − x′j θ 2 . [sent-1039, score-0.747]

53 To obtain Monte Carlo estimates that are better than the variational ones for this data set, we instead sample from the multivariate-t distribution that has covariance 2V instead of 2K, where V is the covariance of the variational approximation. [sent-1071, score-0.738]

54 In Section 4, the proposed variational approximation has been combined with the sparse variational approximation approach previously advocated for regression (Titsias, 2009a). [sent-1290, score-0.696]

55 From the variational approximation perspective, further constraints can be placed on the any of the variational parameters: m, V , the bi s and the Si s. [sent-1315, score-0.689]

56 1 Gaussians Lemma 18 Let x1 ∈ Rn1 and x2 ∈ Rn2 be random vectors with two jointly normal distributions def def p(x1 , x2 ) = N (n,U) and q(x1 , x2 ) = N (m,V ), where the parameters are partitioned as n1 , n2 def n= U11 U12 def , U= U21 U22 def m= m1 , m2 V V def V = 11 12 . [sent-1328, score-2.988]

58 def def Proof Let X = A + auuT and b = 1 + auT A−1 u. [sent-1347, score-0.996]

59 q(f, z|y) p(y, f, z) dz df = log where ˜ def log ZB = p(y, f, z) (47) Given the model, the joint distribution p(y, f, z) factorizes into p(y|f)p(f|z)p(z) exactly; there is no approximation involved. [sent-1360, score-0.858]

61 Since the joint likelihood factorizes across the n data points, this is def ˜ also log ZB = −KL (q(z|y) p(z)) + ∑n ℓi (yi ; q), where ℓi (yi ; q) = q(fi |y) log p(yi |fi ) dfi . [sent-1363, score-0.776]

62 i=1 The bound ZB in the non-sparse case can be obtained in a similar manner with def log ZB = B. [sent-1364, score-0.627]

63 We begin by constraining the approximate posterior in log Z : ˜ log ZB B ∗ def log ZB = max log ZB ≥ max log ZB q(f|y) q(z|y) (where q(f|y) = p(f|z)q(z|y)dz) . [sent-1368, score-1.02]

64 2 Derivation of r(f) and r(y) for Lemma 2 Let r(f) be a prior distribution such that the posterior r(f|y) = p(y|f) r(f)/r(y), def where r(y) = p(y|f) r(f)df, is a C-variate Gaussian density on f with mean a and precision W . [sent-1375, score-0.633]

65 Normalization gives r(y) = exp ∑C exp c=1 1 T 2a Wa 1 c T c 2 (a ) W a def γc = , 1 c T c 2 (a ) W a ′ T ′ 1 ∑c′ exp 2 (ac ) W ac exp . [sent-1380, score-0.61]

66 3 Derivation of Lower Bound h on the Expected Log-likelihood for Lemma 5 Recall from (12) that C def h(y; q, r) = q(f|y) log r(f|y)df + log r(y) − log ∑ γ c q(f|y)rc (f)df. [sent-1382, score-0.756]

67 By pulling out the terms independent of the dummy variable c in the last term of (57), we can rewrite C log ∑ γ c c=1 C q(f|y)rc (f)df = − log ∑ exp c=1 1 1 c T c C (a ) W a + log |W | − log 2π 2 2 2 C 1 1 1 − log |SV | − (m − a)TW (m − a) + aTW a − mT y + log ∑ gc (y; q, r). [sent-1388, score-0.673]

68 b,S 2 2 2 ∑c=1 exp 2mT ec def Proof Let gc (b, S) = exp(b − ec )T S−1 (b − ec ). [sent-1443, score-1.3]

69 Using Cauchy-Schwarz inequality on ∑C gc gives ˜ c=1 C log ∑ gc ≤ c=1 C C 1 1 ˜ log ∑ exp 2mT ec + log ∑ gc . [sent-1444, score-0.888]

70 2 2 c=1 c=1 2 We use this inequality together with the choice of distribution q, which has variance σv in all direc˜ q, b, S), where tions, to obtain h(y; q, b, S) ≥ h(y; C C C 1 1 σ2 1 exp 2mT y 2 def ˜ h(y; q, b, S) = + log σv + log |S| − v tr S − log ∑ gc (b, S) + log C ˜ . [sent-1445, score-1.084]

71 2 2 2 2 2 2 ∑c=1 exp 2mT ec c=1 ′ ¯ def ¯ ¯ ¯ def ˜ ˜ ˜ ˜ ˜ Let gc = gc / ∑C′ =1 gc and g = (g1 , . [sent-1446, score-1.626]

72 Therefore ˜ max h(y; q, b, S) b,S = = C C 1 σ 2 −2 −2 2 −2 −2 + log σv + log σv (σv λ)C−1 − v σv + (C − 1)σv λ 2 2 2 2 1 exp 2mT y 1 1 σ2 − logC exp v 1 − + log C 2 λ C 2 ∑c=1 exp 2mT ec 2 1 1 σv C C−1 + log λ − (1 + (C − 1)λ) − 2 2 2 2 λ 1791 1− 1 C 1 1 exp 2mT y − logC + log C . [sent-1471, score-0.757]

73 Proof (of Theorem 9) log p(y) ≥ max log Zh ≥ max log Zh |q(f|y)=N (0,σv I) 2 q,{bi },{Si } {bi },{Si } = n 1 σ2 nC nC 2 2 + log σv − log |K| − v tr K −1 + ∑ max h(yi , N (0, σv I), bi , Si ) bi ,Si 2 2 2 2 i=1 = 1 σ2 nC nC 2 2 + log σv − log |K| − v tr K −1 + n max h(y, N (0, σv I), b, S). [sent-1482, score-0.896]

74 Proof (of Theorem 10) log p(y) ≥ max log Zh q,{bi },{Si } ≥ max log Zh |q(f)=p(f) {bi },{Si } = max {bi },{Si } C n 1 nC 1 n + ∑ log |Si Ki | − tr Si Ki − ∑ log ∑ exp (bi − ec )T Si−1 (bi − ec ) 2 2 i=1 2 c=1 i=1 . [sent-1484, score-0.973]

75 Kn = K c , we have C 1 1 1 C 1 log p(y) ≥ max + log |SK c | − tr SK c − log ∑ exp (b − ec )T S−1 (b − ec ) b,S n 2 2 2 2 c=1 def For the choice of K c = σ 2 I, we apply Lemma 29. [sent-1489, score-1.299]

76 Matrices W and and W fx def are positive semi-deﬁnite; see Lemma 28. [sent-1536, score-0.626]

77 Otherwise, we point V def use V cc = (1 − η)V + ηV fx , and we search for a η ∈ [0, 1] that optimizes the bound using the false position method. [sent-1574, score-0.716]

78 The search gradient is 1 n ∂ log Zh (V cc ) 1 1 1 n = tr (V cc )−1 ∆ − tr K −1 ∆ + ∑ tr (Vicc )−1 ∆i − ∑ tr (Si ∆i ) , ∂η 2 2 2 i=1 2 i=1 where Vicc and ∆i are the ith blocks along the diagonal of V cc and ∆ respectively. [sent-1578, score-0.604]

79 1, we can guarantee that there is a maximum between V and V fx by showing that ∂Zh /∂η is non-negative at η = 0: ∂ log Zh (V cc ) ∂η 1 tr 2 1 = tr 2 1 = tr 2 1 = tr 2 1 = tr 2 1 = tr 2 ≥ 0. [sent-1581, score-0.771]

80 = η=0 1 1 n V −1 ∆ − tr K −1 ∆ − ∑ tr (Wi ∆i ) 2 2 i=1 1 1 V −1 ∆ − tr K −1 ∆ − tr (W ∆) 2 2 1 1 V −1 ∆ − tr K −1 ∆ − tr (V fx )−1 − K −1 ∆ 2 2 (since W is block diagonal) (since V fx = (K −1 +W )−1 ) V −1 − (V fx )−1 ∆ (V fx )−1 (V fx −V )V −1 ∆ (V fx )−1 ∆V −1 ∆ C. [sent-1582, score-1.306]

81 Let the covariance of the inducing variables def be V cc = (1 − η)V + ηV fx . [sent-1585, score-0.918]

82 Let def def ∆ = dV cc /dη = V fx −V , and let ∆f = dVfcc /dη = Vffx −Vf = KfT K −1 ∆K −1 Kf . [sent-1587, score-1.171]

83 First, we introduce def T = K −1 − K −1V K −1 , def Γj = K ∂K −1 Kf ∂θj = ∂Kf ∂K −1 − K Kf . [sent-1592, score-0.996]

84 Using (68), the deﬁnition of Γj in (66) and the invariance of trace under cyclic permutations, we obtain tr Wf ∂Vf ∂θj = tr Wf ∂Kff ∂θj − tr W ∂K ∂θj − 2 tr Wf KfT T ∂Kf ∂θj + 2 tr W KT ∂K ∂θj , def where we have used W = K −1 KfWf KfT K −1 . [sent-1599, score-0.923]

85 The prior on z and f is ˜∗ = ˜ x∗ z∗ = (z ∗ ∗ p z∗ f def =N 0, K∗ Kf∗ T Kf∗ Kff , where def K∗ = K k∗ kT k∗∗ ∗ def Kf∗ = Kf . [sent-1608, score-1.537]

86 Then a lower bound on the increase is ∗ ˜ ˜ d1 (˜ ∗ |X) = max log Zh (m∗ ,V∗ , {bi }, {Si }; X∗ ) − log Zh (X), x ˜ def m∗ ,v∗∗ ,v∗ (69) where the mean m∗ and covariance V∗ of the approximate posterior on z∗ are constrained: def m∗ = m , m∗ def V∗ = V v∗ . [sent-1611, score-1.903]

87 This is the approximate posterior using the inducing sites X∗ , while def ˜ ˜ q(f|y) = q(f|y, X) is the posterior using X. [sent-1613, score-0.901]

88 Instead of using m∗ , v∗∗ and v∗ as the variational parameters, we can treat µ, ν and ν as the variational parameters, and then deﬁne m∗∗ , v∗∗ and v∗ as functions of them: def m∗ = µ + kT K −1 m, ∗ def v∗∗ = ν + vTV −1 v∗ , ∗ def v∗ = ν +V K −1 k∗ . [sent-1632, score-2.082]

89 1 N EWTON -R APHSON U PDATES FOR µ T ′ def def ¯ ¯ ¯ ¯∗i ¯∗i Let gc = gc exp µκT ec , gc = gc / ∑C′ =1 gc , g∗i = g1 , . [sent-1641, score-1.884]

90 , g∗n , ¯i ¯∗i def ∗i c i ∗i ∗i ¯ ¯ ˜ ¯ G∗ be the diagonal matrix with g∗ down its diagonal, and G∗ be a nC-by-nC block diagonal matrix ¯ ¯ ∗i where the ith block is g∗i gT . [sent-1647, score-0.628]

91 To avoid O(C3 n3 ) computation, −T def ¯ ˜ ˜ def we apply the Woodbury’s inversion lemma thrice. [sent-1678, score-0.996]

92 def B = LW W = W W The optimal update is given by the best convex combination of V and V fx . [sent-1689, score-0.655]

93 (75) The best convex combination is the one optimized over η ∈ [0, 1] in V cc = K − KT cc K, where def T cc = (1 − η)T old + ηT . [sent-1691, score-0.674]

94 The update for V cc implies that T cc is the T old for the next iteration, def so (75) is always possible. [sent-1692, score-0.621]

95 def The computation of Vf at this ﬁxed point requires T = K −1 − K −1V fx K −1 . [sent-1702, score-0.626]

96 3 Initialization Our variational lower bound (21) on the marginal likelihood is concave with respect to all the variational parameters, so the initialization of parameters does not affect the converged answer in theory. [sent-1706, score-0.848]

97 The easier case is during model learning when we can use the optimized variational parameters from the previous model to initialize the variational parameters of the current model. [sent-1711, score-0.623]

98 Its unnormalized weight (s) is w p(y|f(s) )p(f(s) ) p(y)p(f(s) |y) def = , w(s) = ps (f(s) ) ps (f(s) ) which can be computed exactly for the multinomial logistic likelihood. [sent-1758, score-0.726]

99 A Monte Carlo estimate of def p(y) is p(y) = ∑s w(s) /ns , which is the sample mean of the w(s) s, because ˆ def p(y) = p(y|f)p(f)df = p(y|f)p(f) 1 ps (f)df ≈ ∑ w(s) . [sent-1759, score-1.031]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('def', 0.498), ('zh', 0.37), ('variational', 0.294), ('ec', 0.215), ('inducing', 0.17), ('kf', 0.168), ('sfx', 0.157), ('hai', 0.13), ('gc', 0.129), ('fx', 0.128), ('multinomial', 0.125), ('ogit', 0.122), ('ultinomial', 0.122), ('zb', 0.116), ('vf', 0.102), ('yc', 0.101), ('kft', 0.098), ('rocess', 0.094), ('df', 0.094), ('posterior', 0.092), ('rogers', 0.091), ('thyroid', 0.091), ('latent', 0.088), ('log', 0.086), ('tr', 0.085), ('girolami', 0.085), ('likelihood', 0.081), ('marginal', 0.078), ('covariance', 0.075), ('kx', 0.071), ('lk', 0.071), ('scc', 0.071), ('si', 0.07), ('aussian', 0.069), ('probit', 0.067), ('mt', 0.067), ('wf', 0.064), ('bi', 0.062), ('kff', 0.059), ('datum', 0.057), ('logit', 0.057), ('barber', 0.055), ('vfi', 0.054), ('sites', 0.049), ('gaussian', 0.049), ('likelihoods', 0.048), ('titsias', 0.047), ('cc', 0.047), ('fi', 0.046), ('bound', 0.043), ('prior', 0.043), ('seeger', 0.042), ('iris', 0.042), ('lt', 0.04), ('predictive', 0.04), ('fic', 0.039), ('approximation', 0.039), ('diagonal', 0.037), ('fty', 0.036), ('optimized', 0.035), ('cth', 0.035), ('fxvw', 0.035), ('hyperthyroidism', 0.035), ('ps', 0.035), ('williams', 0.035), ('ns', 0.034), ('glass', 0.034), ('dh', 0.034), ('logistic', 0.033), ('dkl', 0.032), ('euthyroidism', 0.031), ('kfwf', 0.031), ('tvfi', 0.031), ('violin', 0.031), ('null', 0.031), ('concave', 0.031), ('kl', 0.031), ('ww', 0.03), ('sparse', 0.03), ('dz', 0.03), ('update', 0.029), ('kt', 0.029), ('exp', 0.028), ('block', 0.028), ('tw', 0.028), ('site', 0.028), ('mtv', 0.028), ('uvt', 0.028), ('rasmussen', 0.027), ('rc', 0.027), ('inference', 0.027), ('logc', 0.027), ('lower', 0.027), ('jj', 0.027), ('process', 0.027), ('ghahramani', 0.026), ('factorizes', 0.025), ('nc', 0.025), ('ut', 0.025), ('gibbs', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

2 0.21494661 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of ﬁnite dimension P) of a given large (possibly inﬁnite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is deﬁned as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefﬁcients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a speciﬁc regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction

3 0.094708525 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts

4 0.086379297 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean ﬁeld. Scalable and efﬁcient inference in fully-connected undirected graphical models or Markov random ﬁelds with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX ﬁles to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classiﬁcation as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classiﬁcation, penalised least squares estimation, lazy evaluation matrix class

5 0.062215295 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun

Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classiﬁers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classiﬁcation algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be signiﬁcantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classiﬁcation

6 0.061556138 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

7 0.05848781 107 jmlr-2012-Smoothing Multivariate Performance Measures

8 0.054732706 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.053800628 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

10 0.053193934 73 jmlr-2012-Multi-task Regression using Minimal Penalties

11 0.052327048 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

12 0.051912796 82 jmlr-2012-On the Necessity of Irrelevant Variables

13 0.051476404 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

14 0.042746626 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

15 0.04252832 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

16 0.041354857 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

17 0.04114088 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

18 0.040679067 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

19 0.039761957 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

20 0.039422195 97 jmlr-2012-Regularization Techniques for Learning with Matrices

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.2), (1, 0.073), (2, 0.054), (3, -0.062), (4, 0.103), (5, -0.004), (6, 0.028), (7, 0.11), (8, -0.345), (9, -0.046), (10, -0.067), (11, 0.042), (12, -0.196), (13, 0.221), (14, -0.053), (15, -0.108), (16, -0.07), (17, 0.059), (18, -0.16), (19, 0.033), (20, 0.046), (21, 0.0), (22, -0.148), (23, -0.128), (24, 0.008), (25, -0.037), (26, -0.115), (27, -0.038), (28, 0.153), (29, 0.036), (30, -0.071), (31, -0.04), (32, -0.122), (33, -0.198), (34, 0.013), (35, -0.014), (36, 0.105), (37, 0.013), (38, -0.063), (39, -0.073), (40, -0.108), (41, -0.012), (42, -0.033), (43, -0.014), (44, 0.059), (45, 0.062), (46, 0.021), (47, -0.069), (48, 0.015), (49, -0.107)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9264968 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

2 0.61494571 59 jmlr-2012-Linear Regression With Random Projections

Author: Odalric-Ambrym Maillard, Rémi Munos

3 0.50585383 107 jmlr-2012-Smoothing Multivariate Performance Measures

Author: Xinhua Zhang, Ankan Saha, S.V.N. Vishwanathan

Abstract: Optimizing multivariate performance measure is an important task in Machine Learning. Joachims (2005) introduced a Support Vector Method whose underlying optimization problem is commonly solved by cutting plane methods (CPMs) such as SVM-Perf and BMRM. It can be shown that CPMs 1 converge to an ε accurate solution in O λε iterations, where λ is the trade-off parameter between the regularizer and the loss function. Motivated by the impressive convergence rate of CPM on a number of practical problems, it was conjectured that these rates can be further improved. We disprove this conjecture in this paper by constructing counter examples. However, surprisingly, we further discover that these problems are not inherently hard, and we develop a novel smoothing strategy, which in conjunction with Nesterov’s accelerated gradient method, can ﬁnd an ε accuiterations. Computationally, our smoothing technique is also rate solution in O∗ min 1 , √1 ε λε particularly advantageous for optimizing multivariate performance scores such as precision/recall break-even point and ROCArea; the cost per iteration remains the same as that of CPMs. Empirical evaluation on some of the largest publicly available data sets shows that our method converges signiﬁcantly faster than CPMs without sacriﬁcing generalization ability. Keywords: non-smooth optimization, max-margin methods, multivariate performance measures, Support Vector Machines, smoothing

4 0.39202949 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

5 0.38959208 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

6 0.32929736 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

7 0.31595713 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

8 0.28585672 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

9 0.27985653 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

10 0.27489856 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

11 0.27020225 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

12 0.26142427 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

13 0.24204789 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

14 0.24204534 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

15 0.23408055 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

16 0.22952433 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

17 0.22683249 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

18 0.22607602 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

19 0.221672 73 jmlr-2012-Multi-task Regression using Minimal Penalties

20 0.22123115 82 jmlr-2012-On the Necessity of Irrelevant Variables

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.392), (21, 0.018), (26, 0.041), (29, 0.048), (35, 0.013), (49, 0.042), (56, 0.015), (69, 0.019), (75, 0.052), (77, 0.018), (79, 0.019), (81, 0.017), (92, 0.085), (96, 0.102)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7047863 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

2 0.36543798 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity

3 0.36366236 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

4 0.36160666 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is deﬁned by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identiﬁcation property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identiﬁcation

5 0.36103851 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

Author: Neil D. Lawrence

Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random ﬁelds (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter ﬁtting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random ﬁeld via the graphical lasso.

6 0.35948962 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

7 0.35742053 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

8 0.35619777 13 jmlr-2012-Active Learning via Perfect Selective Classification

9 0.35566252 80 jmlr-2012-On Ranking and Generalization Bounds

10 0.35534251 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

11 0.35490948 103 jmlr-2012-Sampling Methods for the Nyström Method

12 0.35464895 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

13 0.35445791 73 jmlr-2012-Multi-task Regression using Minimal Penalties

14 0.35300249 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

15 0.35276777 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

16 0.35261965 82 jmlr-2012-On the Necessity of Irrelevant Variables

17 0.35247293 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

18 0.35245898 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.35156107 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

20 0.34850079 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices