jmlr jmlr2011 jmlr2011-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
Reference: text
sentIndex sentText sentNum sentScore
1 In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. [sent-9, score-0.287]
2 The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. [sent-11, score-0.277]
3 We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. [sent-12, score-0.376]
4 (2009) propose an integrated nested Laplace approximation to approximate these posterior marginal distributions. [sent-24, score-0.327]
5 The crucial contribution is the improved marginal posterior approximation in step 2), based on the approach of Tierney and Kadane (1986), that goes beyond the Gaussian approximation and takes into account higher order characteristics of (all) likelihood terms. [sent-30, score-0.408]
6 However, we will see that, using a parallel instead of a sequential updating scheme, expectation propagation is at most a small constant factor slower than the Laplace method in applications on sparse Gaussian models with many latent variables. [sent-40, score-0.256]
7 Moreover, along the way we will arrive at further approximations (both for expectation propagation and the Laplace method) that yield an order of magnitude speed-up, with hardly any degradation in performance. [sent-41, score-0.295]
8 2 i=1 We take y fixed and we consider the problem of computing accurate approximations of the posterior marginal densities of the latent variables p (xi |y, θ), given a fixed hyper-parameter value. [sent-70, score-0.387]
9 Then we integrate these marginals over the approximations of the hyper-parameters posterior density p (θ|y). [sent-71, score-0.447]
10 2 An Outline of the Main Methods Presented in the Paper In this paper, we will discuss a variety of methods for approximating marginals in latent Gaussian models. [sent-78, score-0.282]
11 The posterior probability density p(x) is proportional to a (sparse) multivariate Gaussian distribution over all latent variables and a product of non-Gaussian terms t j (x j ), each of which depends on just a single latent variable. [sent-81, score-0.244]
12 Hence, the approximation resulting from the Laplace method can always be written as the original prior p0 (x) times a product of so-called term approximations ˜ t j (x j ), each of which has a Gaussian form (not necessarily normalizable) depending on just a single latent variable. [sent-87, score-0.316]
13 ˜ Expectation propagation (EP) aims to iteratively refine these term approximations t j (x j ). [sent-88, score-0.244]
14 In its original setting, expectation propagation refines by this new term approximation t j ˜ term approximations t j (x j ) sequentially. [sent-96, score-0.407]
15 We can then write the exact non-Gaussian posterior as this Gaussian approximation q(x) times a product of correction terms, where each correction term is nothing but the original term t j (x j ) divided by its term approx˜ imation t j (x j ). [sent-100, score-0.393]
16 We are interested in accurate approximations of marginals p(xi ) on a single variable, say xi . [sent-105, score-0.452]
17 Decomposing the global Gaussian approximation q(x) into the product of q(xi ) and the conditional q(x\i |xi ), we can take both q(xi ) and the correction term depending on xi outside of the integral over x\i . [sent-107, score-0.427]
18 The remaining integrand is then the conditional Gaussian q(x\i |xi ) times the product of all correction terms, except the one for xi . [sent-108, score-0.277]
19 Doing the same in conjunction with expectation propagation leads to the approximation in Section 3. [sent-113, score-0.287]
20 However, both easily become very expensive, since we have to apply the Laplace method or run a full expectation propagation for each setting of xi . [sent-115, score-0.31]
21 The first, obvious approximation is to replace these correction terms within the integral by 1, leaving only the product of q(xi ) and the correction term depending on xi . [sent-117, score-0.454]
22 In the case of expectation propagation it is exactly the corresponding marginal of the tilted distribution and we refer to it by EP - L (Section 3 ). [sent-119, score-0.295]
23 The term approximations inside the integral are optimized for the global Gaussian approximation, that is, when averaging over xi . [sent-133, score-0.322]
24 A full run of expectation propagation would give the term approximations that are optimal conditioned upon xi , instead of marginalized over xi . [sent-134, score-0.565]
25 In their original work, they apply this Taylor expansion not only for the correction terms inside the integral, but also for the correction term depending on xi outside of the integral, which is unnecessary in the current context. [sent-146, score-0.345]
26 In this section, we review two approximation schemes that approximate such integrals: the Laplace method and expectation propagation (Minka, 2001). [sent-159, score-0.318]
27 The marginal approximation methods we propose for expectation propagation in Section 3 can be, under mild conditions, translated to the variational approximation in Opper and Archambeau (2009). [sent-163, score-0.48]
28 1 The Laplace Method The Laplace method approximates the evidence Z p and, as a side product, it provides Gaussian approximation that is characterized by the local properties of the distribution at its mode x∗ = argmaxx log p (x). [sent-166, score-0.231]
29 ˜ xx A closer look at (3) and (4) suggests that choosing x = x∗ and using the approximation ˜ R2 [log p] (x; x) ≈ 0 yields an approximation of the log evidence ˜ 1 log Z p ≈ log p (x∗ ) − log | − ∇2 log p (x∗ ) |. [sent-175, score-0.418]
30 1 shows how this behavior influences the approximation of the marginals in case of a two dimensional toy model. [sent-186, score-0.309]
31 , n, j ˜ dx j 1, x j , x2 q\ j (x j )t j (x j ) = j and so, the approximation for Z p has the form Zp ≈ ˜ dx p0 (x) ∏ t j (x j ). [sent-209, score-0.272]
32 Approximation of the Posterior Marginals The global approximations provide Gaussian approximations q of p and approximations of the evidence Z p . [sent-239, score-0.434]
33 The Gaussian approximation q can be used to compute Gaussian approximations of posterior marginals. [sent-240, score-0.335]
34 In case of the Laplace method this only requires linear algebraic methods (computing the diagonal elements of the Hessian’s inverse), while in the case of EP, the approximate marginals are a side product of the method itself. [sent-241, score-0.228]
35 In case of the Laplace method, one can easily check that the residual term in (3) decomposes as R2 [log p] (x; x) = ∑ j R2 [logt j ] (x j ; x j ), thus, when approximating the marginal of xi it is sufficient ˜ ˜ to assume R2 [logt j ] (x j ; x j ) ≈ 0 only for j = i. [sent-244, score-0.244]
36 2, EP is built on exploiting the low-dimensionality of ti (xi ) and approximating the tilted marginals ti (xi )q\i (xi ). [sent-247, score-0.43]
37 These are known to be better approximations of the marginals p(xi ) than q(xi ) (e. [sent-248, score-0.317]
38 These observations show that there are ways to improve the marginals of the global approximation q by exploiting the properties of the methods. [sent-253, score-0.347]
39 The exact marginals can be computed as p (xi ) = 1 ti (xi ) Zp dx\i p0 x\i , xi ∏ t j (x j ) , (9) j=i thus, as mentioned earlier, computing the marginal for a fixed xi leads to computing the normalization constant of the distribution p0 x\i |xi ∏ j=i t j (x j ). [sent-256, score-0.659]
40 This means that in principle we have exact x \i estimates of the error and that any reasonable approximation of the integral can improve the quality of the approximation in (10). [sent-274, score-0.253]
41 The procedure is as follows: (1) fix xi and compute the canonical parameters of p0 (x\i |xi ) given by h\i − Q\i,i xi and Q\i,\i and (2) use EP to approximate the integral in (9). [sent-279, score-0.33]
42 Approximation of the Posterior Marginals by Correcting the Global Approximations As we have seen in the previous section, computing the marginal for a given fixed xi value can be as expensive as the global procedure itself. [sent-282, score-0.254]
43 On the other hand, however, there are ways to improve the marginals of the global approximation. [sent-283, score-0.235]
44 In this section, we start from the “direct” approach and try to re-use the results of the global approximation to improve on the locally improved marginals LM - L and EP - L . [sent-284, score-0.347]
45 In case of EP, the term approximations ti (xi ) are chosen to be close to the terms ti (xi ) in average w. [sent-292, score-0.286]
46 Equation (12) is still exact and it shows that there are two corrections to the Gaussian approximation q(xi ): one direct, local correction through εi (xi ) and one more indirect correction through the (weighted integral over) ε j (x j )s for j = i. [sent-299, score-0.38]
47 Z ˜ We use the notations piEP - L (xi ) and piLM - L (xi ) for the approximations following the global Gaussian ˜ approximations by EP and Laplace method, respectively. [sent-301, score-0.278]
48 (13) j=i Again, for each xi , we are in fact back to the form (9): we have to estimate the normalization constant of a latent Gaussian model, where q x\i |xi now plays the role of an (n − 1)-dimensional Gaussian prior and the ε j (x j )s are terms depending on a single variable. [sent-303, score-0.247]
49 1 I MPROVING THE M ARGINALS R ESULTING FROM EP ˜ Let us write ε j (x j ; xi ) for the term approximation of ε j (x j ) in the context of approximating ci (xi ). [sent-309, score-0.311]
50 Since the term approximations of the global EP approximation are ˜ ˜ tuned to make t j (x j ) close to t j (x j ) w. [sent-311, score-0.27]
51 Replacing ˜ the terms ε j (x j ) in (13) by their term approximations ε j (x j ; xi ) yields an estimate for ci (xi ). [sent-318, score-0.291]
52 The corresponding approximation p(xi ) ≈ 1 εi (xi )q(xi ) Z dx\i q x\i |xi ˜ ∏ ε j (x j ; xi ) j=i is referred to as piEP -1 STEP (xi ). [sent-319, score-0.247]
53 \i (2009) propose to re-use the global approximation by approximating the conditional mode with the −1 conditional mean, that is, x∗ (xi ) ≈ m\i + V\i,iVi,i (xi − mi ), where m = x∗ (= argmaxx log p(x)). [sent-329, score-0.309]
54 Using the conditional mean as an approximation of the conditional mode leads to ignoring the terms ε j (x j ) and using the mode of q(x\i |xi ). [sent-337, score-0.274]
55 1), ˜ where now ε j (x j ; xi ) follows from a second-order Taylor expansion of log ε j (x j ) around the mode or mean of q(x j |xi ) instead of the mode of f (x\i ; xi ). [sent-340, score-0.442]
56 We therefore propose the general family of approximations ci (xi ) = ∏ (α) j=i dx j q(x j |xi )ε j (x j )α 1/α . [sent-382, score-0.236]
57 Note that when EP converges, this approximation always exists, ˜ because q(x j |xi )ε j (x j ) is proportional to the conditional marginal of the so called tilted distributions ˜ t j (x j )t j (x j )−1 q(x). [sent-389, score-0.256]
58 1) we would ˜ ˜ replace q(x\i |xi ) by the factorization ∏ j=i q(x j |xi ), that is, as if the variables x j in the global Gaussian approximation are conditionally independent given xi . [sent-392, score-0.32]
59 Here, we compute the univari˜ ate integrals with the Laplace method and using the approximation x∗ (xi ) ≈ Eq [x j |xi ], with q(x) j being the global approximation resulting from the Laplace method. [sent-394, score-0.262]
60 In this way, we can obtain higher order corrections of the approximate marginals and the evidence approximation. [sent-397, score-0.325]
61 all ε j (x j ) − 1, they obtain a first order approximation of the exact p in terms of the global approximation q and the tilted distributions t j (x j )q\i (x). [sent-417, score-0.301]
62 The marginalization of this expansion yields the marginal approximation piEP - OPW (xi ) ≡ ˜ Zq q(xi ) 1 + ∑ Zp j dx j q(x j |xi )[ε j (x j ) − 1] . [sent-418, score-0.305]
63 431 C SEKE AND H ESKES Figure 3: The posterior marginals of the first components of a 3-dimensional model with Heaviside terms with (v, c) = (4, 0. [sent-446, score-0.3]
64 The local correction EP - L yields sufficiently accurate approximations when the correlations are weak (top), but is clearly insufficient when they are strong (center). [sent-461, score-0.237]
65 These corrections suffer from essentially the same problems as the global Gaussian approximation based on Laplace’s method: the mode and the inverse Hessian represent the mean and the covariance badly and fail to sufficiently improve it. [sent-470, score-0.268]
66 The panels of Figure 4 show a few posterior marginals of the regression coefficients x j given the maximum a posteriori (MAP) hyper-parameters v and λ. [sent-512, score-0.354]
67 The approximations are accurate but in this case, the local approximations EP - L fail dramatically when the mass of the distribution is not close to zero. [sent-514, score-0.24]
68 2 and 433 C SEKE AND H ESKES Figure 5: The posterior marginal approximation EP - FACT of the coefficients in a toy logistic regression model with Gaussian prior on the coefficients and moderate posterior correlations. [sent-522, score-0.426]
69 The panels show that even when the non-Gaussian terms depend on more than one variable and the posterior the approximation EP - FACT might still be accurate. [sent-524, score-0.269]
70 The panels of Figure 5 show a few marginals of a model where j we have chosen ui ∼ N (0, 10) and an independent Gaussian prior p0 (x) = ∏ j N(x j |0, v−1 ) with v = 0. [sent-527, score-0.419]
71 Although one would expect that the factorization might lead to poor approximations, EP - FACT seems to approximate the marginals significantly better than the global approximation EP - G. [sent-530, score-0.413]
72 1) Computing the Cholesky factor, L of a matrix Q, for EP - G LM - G example, corresponding to the posterior approximation p ˜ or p ˜ , with the same sparsity structure as the prior precision matrix Q. [sent-540, score-0.27]
73 Thus, computing the lowest order (Gaussian) marginals piLM - G for all variables xi , i = 1, . [sent-562, score-0.332]
74 2 E XPECTATION P ROPAGATION ˜ In order to update a term approximation t j (x j ), we compute q\ j (x j ) using the marginals q (x j ) from the current global approximation q (x) and re-estimate the normalization constant and the first two ˜ moments of t j (x j ) q\ j (x j ). [sent-568, score-0.513]
75 After convergence, EP yields the lowest order steps marginals piEP - G for all variables xi , i = 1, . [sent-579, score-0.332]
76 435 C SEKE AND H ESKES steps \ methods q (x j |xi ) ˜ ε (x j ; xi ) LA - CM LA - FACT EP -1 STEP EP - FACT ctria + n × ngrid ctria + n × ngrid ctria + n × ngrid ctria + n × ngrid Norm. [sent-588, score-0.839]
77 -s cchol × ngrid n × ngrid n × ngrid n × ngrid n × ngrid × nquad cchol × ngrid n × ngrid × nquad n × ngrid Table 1: Computational complexities of the steps for computing an improved marginal approximation for a particular node i using the various methods. [sent-590, score-1.249]
78 ngrid refers to the number of grid points and nquad to the number of Gauss-Hermite quadrature nodes for xi . [sent-593, score-0.32]
79 7 Computational Complexities of Marginal Approximations After running the global approximation to obtain the lowest order approximation, we are left with some Gaussian q (x) with known precision matrix, a corresponding Cholesky factor and single-node marginals q(xi ). [sent-598, score-0.375]
80 The conditional variance is independent of xi , the conditional mean is a linear function of xi . [sent-602, score-0.318]
81 ˜ To arrive at the term approximations ε(x j ; xi ), we need to compute second order derivatives for the Laplace approximation and numerical quadratures for EP, which is about nquad times more expensive. [sent-604, score-0.411]
82 In this section, we consider the estimation of the posterior marginals that follow by integrating over the hyper-parameters. [sent-610, score-0.3]
83 The rest of the panels show the posterior density approximations of f50 and µ. [sent-673, score-0.304]
84 The Laplace and EP approximation of the evidence are nearly indistinguishable (top-row), as are the posterior marginals of the hyper-parameters (second row). [sent-679, score-0.448]
85 The posterior marginals of f50 and µ obtained using the more involved methods (bottom rows) are practically indistinguishable from each other and the gold (sampling) standard. [sent-681, score-0.3]
86 2 A log-Gaussian Cox Process Model As a large sized example, we implemented the Laplace approximation and expectation propagation for the log-Gaussian Cox process model applied to the tropical rainforest bio-diversity data as presented in Rue et al. [sent-684, score-0.287]
87 A PPROXIMATE M ARGINALS IN L ATENT G AUSSIAN M ODELS Figure 10: The posterior approximations of the evidence (top) and βa and βg (bottom). [sent-743, score-0.259]
88 The marginal approximations show marginals for the approximate MAP hyper-parameters. [sent-745, score-0.429]
89 Expectation propagation was initialized using the term approximations corresponding to the Laplace method. [sent-749, score-0.244]
90 The top panels of Figure 10 show the evidence approximations while the bottom panels show the marginal approximations for the corresponding MAP hyper-parameters. [sent-751, score-0.465]
91 05 0 −2 0 2 4 x41 6 0 −2 8 0 2 4 x41 6 8 Figure 11: The approximate posterior marginal approximations of a Gaussian process binary classification model, with the hyper-parameters set to the approximate MAP values yielded by EP. [sent-798, score-0.366]
92 The behavior of the marginals is similar to that in Figure 2, however, in this case, the correction provided by LA - CM2 is not that significant as in Figure 2. [sent-799, score-0.286]
93 It turned out that many posterior marginal densities are skewed, however, most of the skewed marginals are well approximated by EP - L (the marginal of EP’s tilted distribution). [sent-806, score-0.527]
94 The panels of Figure 11 show the approximate posterior marginal densities of the latent variable with j = 41. [sent-807, score-0.352]
95 These approximate posterior marginals exhibit a similar behavior like the ones in Figure 2. [sent-808, score-0.331]
96 The right panel shows that the factorized approximations EP - FACT, can indeed improve on the Gaussian marginal approximations computed by EP even in models where non-Gaussian terms depend on more than one variable. [sent-848, score-0.394]
97 Discussion We introduced several methods to improve on the marginal approximations obtained by marginalizing the global approximations. [sent-862, score-0.239]
98 The approximation pEP -1 STEP (xi ) is computed by ˜ ˜ defining ε j (x j ; xi ) ≡ Collapse(q(x j |xi )ε j (x j ))/q(x j |xi ) and using the approximation ci (xi ) ≈ ˜ dx\i q(x\i |xi ) ∏ j=i ε j (x j ; xi ) (see Section 4. [sent-954, score-0.53]
99 The approximation pEP - FACT (xi ) is computed ˜ using the approximation ci (xi ) ≈ ∏ j=i dx j q(x j |xi )ε j (x j ), where the univariate integrals are computed numerically or analytically, if it is the case. [sent-967, score-0.34]
100 A similar approximation as EP - FACT , but here, the univariate integrals are computed with the Laplace method and using the approximation x∗ (xi ) ≈ Eq [x j |xi ], with q(x) j being the global approximation resulting from the Laplace method. [sent-971, score-0.374]
wordName wordTfidf (topN-words)
[('ep', 0.602), ('laplace', 0.356), ('marginals', 0.197), ('uit', 0.19), ('eskes', 0.146), ('seke', 0.146), ('ui', 0.141), ('rue', 0.139), ('xi', 0.135), ('arginals', 0.124), ('propagation', 0.124), ('approximations', 0.12), ('atent', 0.112), ('pproximate', 0.112), ('approximation', 0.112), ('ngrid', 0.11), ('takahashi', 0.11), ('posterior', 0.103), ('piep', 0.095), ('correction', 0.089), ('la', 0.083), ('ti', 0.083), ('opper', 0.081), ('cm', 0.081), ('marginal', 0.081), ('opw', 0.08), ('dx', 0.08), ('aussian', 0.073), ('lm', 0.072), ('gaussian', 0.068), ('players', 0.067), ('ctria', 0.066), ('cholesky', 0.064), ('corrections', 0.061), ('odels', 0.059), ('mode', 0.057), ('latent', 0.057), ('pep', 0.056), ('zq', 0.056), ('panels', 0.054), ('expectation', 0.051), ('xt', 0.045), ('cchol', 0.044), ('nquad', 0.044), ('kuss', 0.044), ('heskes', 0.041), ('tierney', 0.039), ('tilted', 0.039), ('global', 0.038), ('collapse', 0.037), ('volatility', 0.037), ('factorized', 0.037), ('logt', 0.037), ('ci', 0.036), ('evidence', 0.036), ('panel', 0.036), ('factorization', 0.035), ('expansion', 0.032), ('quadrature', 0.031), ('zp', 0.031), ('approximate', 0.031), ('kadane', 0.031), ('eq', 0.03), ('integral', 0.029), ('agassi', 0.029), ('altitude', 0.029), ('ctaka', 0.029), ('lii', 0.029), ('nnzeros', 0.029), ('pila', 0.029), ('pilm', 0.029), ('integrand', 0.029), ('approximating', 0.028), ('qi', 0.028), ('minka', 0.028), ('xx', 0.028), ('normalization', 0.028), ('taylor', 0.028), ('precision', 0.028), ('ft', 0.028), ('correlations', 0.028), ('av', 0.027), ('cox', 0.027), ('density', 0.027), ('prior', 0.027), ('log', 0.026), ('moments', 0.026), ('densities', 0.026), ('cseke', 0.025), ('modal', 0.025), ('zoeter', 0.025), ('tk', 0.024), ('conditional', 0.024), ('sparse', 0.024), ('zi', 0.023), ('variances', 0.022), ('birlutiu', 0.022), ('erisman', 0.022), ('logti', 0.022), ('mnew', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000019 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
2 0.47757989 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
3 0.13229126 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
4 0.12148633 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.098069333 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
6 0.060250901 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.05464429 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
8 0.052717019 35 jmlr-2011-Forest Density Estimation
9 0.052596617 14 jmlr-2011-Better Algorithms for Benign Bandits
10 0.052014008 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
11 0.049872056 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
12 0.047997911 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
13 0.04683847 54 jmlr-2011-Learning Latent Tree Graphical Models
14 0.045672823 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
15 0.04127143 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
16 0.040824283 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
17 0.039543178 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
18 0.039084032 91 jmlr-2011-The Sample Complexity of Dictionary Learning
19 0.037252918 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
20 0.03630643 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
topicId topicWeight
[(0, 0.243), (1, -0.164), (2, -0.346), (3, 0.161), (4, 0.254), (5, -0.351), (6, -0.074), (7, -0.157), (8, 0.337), (9, -0.265), (10, 0.097), (11, -0.042), (12, -0.049), (13, 0.052), (14, -0.013), (15, -0.045), (16, 0.068), (17, 0.031), (18, 0.044), (19, -0.05), (20, -0.011), (21, -0.032), (22, -0.037), (23, 0.037), (24, -0.012), (25, -0.056), (26, 0.021), (27, -0.024), (28, 0.035), (29, -0.03), (30, -0.036), (31, -0.001), (32, -0.014), (33, 0.057), (34, 0.014), (35, -0.011), (36, -0.008), (37, -0.069), (38, 0.072), (39, 0.012), (40, 0.02), (41, -0.01), (42, -0.012), (43, 0.027), (44, 0.008), (45, 0.028), (46, 0.047), (47, -0.017), (48, -0.014), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.96186483 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
2 0.95242882 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
3 0.43665156 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
4 0.28844085 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.27847049 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
6 0.24348411 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
7 0.2322378 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
8 0.21389517 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
9 0.19134395 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
10 0.19061564 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
11 0.18408084 35 jmlr-2011-Forest Density Estimation
12 0.1765821 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
13 0.15514447 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
14 0.15471427 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
15 0.14467619 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
16 0.14406346 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
17 0.13507102 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
18 0.13022755 12 jmlr-2011-Bayesian Co-Training
19 0.12874334 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
20 0.12644862 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model
topicId topicWeight
[(4, 0.035), (9, 0.015), (10, 0.026), (24, 0.034), (31, 0.059), (32, 0.024), (41, 0.018), (60, 0.011), (70, 0.04), (71, 0.014), (73, 0.568), (78, 0.053), (90, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.93938464 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
2 0.93270159 55 jmlr-2011-Learning Multi-modal Similarity
Author: Brian McFee, Gert Lanckriet
Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity
3 0.50717878 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
4 0.47325084 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
5 0.46600255 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama
Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel
6 0.45571396 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
7 0.45004347 48 jmlr-2011-Kernel Analysis of Deep Networks
8 0.44891903 66 jmlr-2011-Multiple Kernel Learning Algorithms
9 0.44488204 35 jmlr-2011-Forest Density Estimation
10 0.42090815 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
11 0.39036265 91 jmlr-2011-The Sample Complexity of Dictionary Learning
12 0.38778988 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
13 0.38581198 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
14 0.37177548 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
15 0.35884908 12 jmlr-2011-Bayesian Co-Training
16 0.35099944 105 jmlr-2011-lp-Norm Multiple Kernel Learning
17 0.35009643 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
19 0.34859025 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization
20 0.3472639 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets