jmlr jmlr2011 jmlr2011-13 knowledge-graph by maker-knowledge-mining

13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Source: pdf

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a ﬂexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). [sent-9, score-0.117]

2 In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. [sent-12, score-0.264]

3 When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). [sent-13, score-0.18]

4 Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes 1. [sent-15, score-0.164]

5 Introduction Supervised learning based on reproducing kernel Hilbert spaces (RKHSs) has become increasingly popular since the support vector machine (SVM) (Vapnik, 1998) and its variants such as penalized kernel logistic regression models (Zhu and Hastie, 2005) have been proposed. [sent-16, score-0.199]

6 Penalized kernel logistic regression models, however, are not naturally sparse. [sent-19, score-0.117]

7 Thus, Zhu and Hastie (2005) proposed a methodology that they refer to as the import vector machine (IVM), where a fraction of the training data—called import vectors by analogy to the support vectors of the SVM—are used to index kernel basis functions. [sent-20, score-0.244]

8 (2005), for example, since conjugate priors for the regression vector do not exist, a sampling methodology based on data augmentation was employed to update the regression vector. [sent-35, score-0.14]

9 In this paper we propose generalized kernel models (GKMs) as a framework in which sparsity can be given an explicit treatment and in which a fully Bayesian methodology can be carried out. [sent-39, score-0.115]

10 We deﬁne active vectors to be those input vectors that are indexed by the nonzero components of the regression vector in GKMs. [sent-41, score-0.207]

11 Third, the mixture of the point-mass prior and the Silverman g-prior allows a fraction of regression coefﬁcients in question to be zero and thus provides an explicit Bayesian approach to the selection of active vectors. [sent-48, score-0.149]

12 The algorithm uses a reversible jump procedure (Green, 1995) for the automatic selection of active vectors and a Bayesian model averaging method (Raftery et al. [sent-54, score-0.303]

13 In particular, we show that our reversible jump method can be used to implement a “subset of regressors” approximation method for GP-based classiﬁcation. [sent-73, score-0.134]

14 By the representer theorem (Wahba, 1990), the solution for (1) is of the form n f (x) = u + ∑ β j K(x, x j ), (2) j=1 where u is called an offset term, K(·, ·) is the kernel function and the β j are referred to as regression coefﬁcients. [sent-84, score-0.117]

15 113 Z HANG , DAI AND J ORDAN The predictive function f (x) in (2) is based on a basis expansion of kernel functions. [sent-102, score-0.114]

16 3 Sparse Models Recall that the number of active vectors is equal to the number of nonzero components of β. [sent-165, score-0.143]

17 That is, if β j = 0, the jth input vector is excluded from the basis expansion in (2), otherwise the jth input vector is an active vector. [sent-166, score-0.146]

18 , γn )′ such that γ j = 1 if x j is an active vector and γ j = 0 if it is not. [sent-172, score-0.114]

19 Let nγ = ∑n γ j be the number of active vectors, and let Kγ be the n×nγ j=1 submatrix of K consisting of those columns of K for which γ j = 1. [sent-173, score-0.114]

20 Since τ is deﬁned as the probit link in our FBGKM, we have σ2 = 1 and yi = 1 if si > 0 0 otherwise. [sent-195, score-0.107]

21 For example, the Gaussian kernel K(xi , x j ) = exp(− xi − x j 2 /θ2 ) is a function of the width parameter θ. [sent-210, score-0.107]

22 γ β xi g a θj p b θj p θ si σ yi η u α aα bα ag bg aη bη n Figure 1: A graphical representation for the hierarchical model. [sent-224, score-0.139]

23 , 2005), in which computing the inverses and determinants of two consecutive full kernel matrices K and K∗ is required at each sweep of MCMC sampling. [sent-259, score-0.112]

24 Step (c) is used for the automatic choice of active vectors. [sent-260, score-0.114]

25 This method was derived from the reversible jump methodology of Green (1995). [sent-262, score-0.167]

26 118 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS Letting k = nγ , we denote the probabilities of birth, death and swap by bk , dk and 1−bk −dk , respectively. [sent-265, score-0.141]

27 3 for 1 ≤ k ≤ kmax −1, and dk = 1 and bk = 0 for kmax ≤ k ≤ n. [sent-268, score-0.829]

28 Here, kmax is a speciﬁed maximum number of active vectors such that kmax ≤ n. [sent-269, score-0.889]

29 This method also employs birth, death and swap moves; it differs from the reversible jump procedure because it does not incorporate the probabilities of birth, death and swap into its acceptance probabilities. [sent-271, score-0.28]

30 It is worth noting that when n is relatively large, we can reduce the computational burden by giving kmax a value far less than n, that is, kmax ≪ n, and then computing: Q−1 = In − Kγ ϒ−1 K′ γ γ γ and |Qγ | = |ϒγ ||Σγ |−1 = η−1 g−nγ |Kγγ |−1 |ϒγ |. [sent-273, score-0.746]

31 (11) For example, for both the USPS and NewsGroups data sets used in our experiments, we set kmax = 200 ≪ n. [sent-274, score-0.373]

32 Finally, in the reversible jump method, the matrices obtained before and after each move only change a column and a row. [sent-278, score-0.134]

33 This implies that the Bayesian model averaging method uses 1, 000 (T = (10, 000−5, 000)/5) active sets for prediction. [sent-295, score-0.14]

34 In fact, our reversible jump MCMC algorithm deals with parameter estimation, model selection and posterior prediction jointly in a single paradigm. [sent-297, score-0.168]

35 Moreover, the reversible jump method is a sequential approach for model selection and posterior prediction. [sent-298, score-0.168]

36 This implies that after the burn-in the selection of active vectors and the prediction of responses are simultaneously implemented. [sent-299, score-0.143]

37 1 Gaussian Process Priors The following proposition summarizes the connection between the Gaussian process and the feature basis expansion ∑r bk ψk (x) given in (4). [sent-309, score-0.115]

38 , ψr (x))′ from X to Rr (r is possibly inﬁnite) such that K(xi , x j ) = ψ(xi )′ ψ(x j ) for xi , x j ∈ X and ζ(x) = r ∑ bk ψk (x) i. [sent-313, score-0.108]

39 120 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS As discussed in Section 2, the Karhunen-Lo` ve expansion can also be approximated by a ﬁnitee dimensional expansion over the training data set; that is, n ζ(x) = ∑ βi K(x, xi ) with β = (β1 , . [sent-326, score-0.146]

40 If the kernel function is stationary, then v is near zero and the posterior predictive mean of s∗ reverts to u when x∗ is far from points in the set I . [sent-367, score-0.116]

41 That is, the prediction is based on the average over T active sets. [sent-370, score-0.114]

42 In the following experiments the average is taken on 1, 000 active sets. [sent-371, score-0.114]

43 121 Z HANG , DAI AND J ORDAN It is again worth noting that the reversible jump MCMC algorithm devised in this paper deals with parameter estimation and posterior prediction jointly in a single paradigm. [sent-372, score-0.168]

44 Moreover, the reversible jump methodology is a sequential approach to model selection and posterior prediction. [sent-373, score-0.201]

45 The main computational burden of the MCMC algorithm comes from the sampling procedure for parameter estimation, and the MCMC algorithm does not require extra computational complexity for the selection of active vectors and the prediction of responses. [sent-374, score-0.143]

46 All of the methods that we implement are based on a Gaussian RBF kernel with a single width parameter; that is, K(xi , x j ) = exp − xi −x j 2 /θ2 . [sent-449, score-0.107]

47 This implies that the Bayesian model averaging component of our Bayesian methods uses 1, 000 (T = (10, 000−5, 000)/5) active sets for test. [sent-468, score-0.14]

48 124 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS BSVM err (±std) BCI 28. [sent-477, score-0.205]

49 In the following experiments, we attempt to analyze the performance of the methods with respect to different values of the training size n and the maximum number kmax of active vectors. [sent-622, score-0.51]

50 Tables 4 and 5 show the experimental results when changing the training size n and ﬁxing the maximum number of active vectors to kmax = 200. [sent-624, score-0.539]

51 Table 6 shows the experimental results for our FBGKM and SGPC methods with respect to different values of the maximum number kmax of active vectors and for a ﬁxed training size of n = 800. [sent-628, score-0.539]

52 The performance of these two methods is roughly similar for each setting kmax ; that is, they are insensitive to kmax . [sent-629, score-0.746]

53 However, their computational costs tend to slightly increase as the maximum number kmax of active vectors increases. [sent-630, score-0.516]

54 Additionally, in order to study the MCMC mixing performance of our FBGKM and SGPC methods we report the numbers of active vectors over different data sets. [sent-631, score-0.143]

55 In particular, Figure 2 125 Z HANG , DAI AND J ORDAN Training size n n=300 n=400 n=500 n=600 n=700 n=800 BSVM err (±std) 5. [sent-632, score-0.205]

56 34) Table 4: Experimental results for the ﬁve methods corresponding to different training sizes n on the NewsGroups data set with kmax = 200: err− the test error rates (%); std− the corresponding standard deviation. [sent-692, score-0.396]

57 327 × 103 Table 5: The computational times (s) for the ﬁve methods corresponding to different training sizes n on the NewsGroups data set with kmax = 200. [sent-723, score-0.396]

58 kmax of active vectors kmax = 300 kmax = 400 kmax = 500 kmax = 600 kmax = 700 FBGKM err (±std) time 4. [sent-724, score-2.586]

59 057 × 104 Table 6: Experimental results for FBGKM and SGPC corresponding to different maximum numbers kmax of active vectors on the NewsGroups data set with n = 800: err− the test error rates (%); std− the corresponding standard deviation; time− the corresponding computational time (s). [sent-754, score-0.516]

60 depicts the output of the numbers nγ of active vectors corresponding to the ﬁrst 6000 sweeps in the MCMC inference procedure on BCI, Digit1, Letters {(A vs. [sent-755, score-0.171]

61 of active vectors (n ) 95 90 85 80 75 70 0 90 85 80 75 70 1000 2000 3000 4000 5000 65 0 6000 1000 2000 No. [sent-760, score-0.143]

62 of iteration (a) FBGKM, kmax = 100 5000 6000 5000 6000 5000 6000 5000 6000 200 190 190 180 γ No. [sent-761, score-0.373]

63 of active vectors (nγ) 4000 (b) SGPC, kmax = 100 200 170 160 150 180 170 160 150 140 130 0 3000 No. [sent-763, score-0.516]

64 of iteration (c) FBGKM, kmax = 200 (d) SGPC, kmax = 200 100 100 95 95 No. [sent-766, score-0.746]

65 of active vectors (n ) 90 85 80 75 90 85 80 70 75 65 60 0 1000 2000 3000 4000 5000 70 0 6000 1000 2000 No. [sent-768, score-0.143]

66 of iteration (e) FBGKM, kmax = 100 4000 (f) SGPC, kmax = 100 200 195 195 190 190 γ No. [sent-769, score-0.746]

67 of iteration (g) FBGKM, kmax = 200 (h) SGPC, kmax = 200 Figure 2: MCMC Output for the numbers nγ of active vectors of our FBGKM and SGPC methods on the four data sets: (a/b) BCI; (c/d) Digit1; (e/f) Letters (A vs. [sent-774, score-0.889]

68 127 Z HANG , DAI AND J ORDAN In probit-type models, since the posterior distribution of each si is truncated normal, we are able to update the si using the Gibbs sampler. [sent-776, score-0.112]

69 Table 7 describes distributions of active vectors to appear after the burn-in (in the last 5, 000 sweeps). [sent-782, score-0.143]

70 As we can see, the number nγ of active vectors jumps between a small range for different data sets, due to the rapid mixing. [sent-783, score-0.143]

71 The maximum frequency of active vectors corresponding to the number nγ of active vectors is also given in Table 7. [sent-784, score-0.286]

72 Max—the maximum number of active vectors to appear; Min—the minimum number of active vectors to appear; Most—the number of active vectors with the maximum frequency and the corresponding frequency shown in brackets. [sent-790, score-0.429]

73 3 Evaluation 2 We further evaluate the performance of our sparse Bayesian kernel methods under kernel parameter learning, and compare the FBGKM and SGPC with SGP+FIC and full GP (FGP) (Rasmussen and Williams, 2006). [sent-792, score-0.199]

74 Since for FGP+KL learning the kernel parameters results in a huge computational cost, we set the sizes of the training and test data as 1000 (i. [sent-799, score-0.105]

75 However, to provide an applesto-apples comparison with our sparse Bayesian kernel methods, we employ MCMC inference for SGP+FIC+KL. [sent-804, score-0.117]

76 For the sparse methods compared here, we ﬁx the size of active set to 100, that is, kmax = 100. [sent-805, score-0.522]

77 Data Set Splice Astrop Mushrooms Adult SGP+FIC+KL err (±std) 18. [sent-817, score-0.205]

78 43) Table 8: Experimental results for the four Bayesian kernel methods on the four data sets with learned kernel parameters θ, kmax = 100, n = 1000, and m = 1000: err− the test error rates (%); std− the corresponding standard deviation. [sent-849, score-0.537]

79 557 × 104 Table 9: The computational times (s) of the four Bayesian kernel methods on the four data sets with learned kernel parameters θ, kmax = 100, n = 1000, and m = 1000. [sent-866, score-0.537]

80 In order to further evaluate the performance of our sparse Bayesian kernel methods on some larger data sets, we also conduct comparative experiments of FBGKM, SGPC, and SGP+FIC (Snelson and Ghahramani, 2006) on the Adult1 , Adult2 , Mushrooms, Splice, and Astroparticle data sets. [sent-867, score-0.14]

81 For these sparse methods, we ﬁx the size of active set kmax to 200. [sent-869, score-0.522]

82 Furthermore, it is still difﬁcult for SGP+FIC+MCMC and SGP+FIC+EP to calculate the optimal solution for sparse approximation of full Gaussian process, due to the sensitivity of the performance to the initial active set. [sent-874, score-0.149]

83 28) Table 10: Experimental results for the ﬁve methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200: err− the test error rates (%); std− the corresponding standard deviation. [sent-923, score-0.373]

84 Table 11 and Figure 4 report the average computational times of the compared sparse Bayesian kernel methods on the ﬁve data sets, with Figure 4 depicting the computational times on a logarithm scale. [sent-924, score-0.117]

85 369 × 105 Table 11: The computational times (s) of the four sparse Bayesian kernel methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200. [sent-946, score-0.49]

86 In addition, we conduct a quantitative assessment of convergence of the MCMC algorithms for the three sparse Bayesian kernel methods. [sent-947, score-0.14]

87 From Table 12, we can see that all MCMC algorithms in the three sparse Bayesian kernel methods can achieve convergence on the ﬁve data sets after the ﬁrst 5000 sweeps. [sent-952, score-0.117]

88 142 × 10 3605 Table 12: Monitoring convergence of MCMC algorithms for the three sparse Bayesian kernel methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200: time− the computational time (s) for convergence; burn-in− the length of chain for convergence. [sent-968, score-0.49]

89 In addition, the maximum number kmax of active vectors is set according to Table 1. [sent-976, score-0.516]

90 Data Set Ringnorm Thyroid Twonorm Waveform err 2. [sent-981, score-0.205]

91 In particular, given a kernel function, we can estimate parameters of the kernel function. [sent-1023, score-0.164]

92 , βln )′ and gl ≥ 0, we have l q n l=1 i=1 f (x) = u + ∑ gl ∑ Kl (x, xi )βli . [sent-1038, score-0.115]

93 Now we assign βl ∼ Nn 0, σ2 (K(l) )−1 and gl ∼ ρδ0 (gl ) + (1 − ρ)Ga(gl |ag /2, bg /2), where K(l) = Ψl Ψ′ (n×n) is the lth kernel matrix, δ0 (·) is a point-mass at zero and the user-speciﬁc l parameter ρ ∈ (0, 1) controls the levels of the nonzero gl . [sent-1039, score-0.215]

94 Note that kernel parameter learning and multiple kernel learning can be incorporated together. [sent-1041, score-0.164]

95 As in the binary problem, we also introduce a binary n-vector γ with either γi = 1 if xi is an active vector or γi = 0 if xi is not an active vector. [sent-1074, score-0.278]

96 Thus, the posterior distribution of each si j is truncated normal, [si j |γ, u j , β j , yi j ] ∼ N(u j + k′ β j , 1) i subject to si j > sil for all l = j if yi j = 1. [sent-1096, score-0.112]

97 Conclusion In this paper we have discussed Bayesian generalized kernel mixed models, including Bayesian generalized kernel models and Gaussian processes for classiﬁcation. [sent-1102, score-0.164]

98 In particular, we have proposed fully Bayesian kernel methods based on the Silverman g-prior and a Bayesian model averaging method. [sent-1103, score-0.108]

99 Because of the connection between kernel methods and Gaussian processes, the MCMC algorithm can be immediately applied to sparse Gaussian processes. [sent-1105, score-0.117]

100 k=1 135 Z HANG , DAI AND J ORDAN The Karhunen-Lo` ve expansion of ζ(x) is then given by e ζ(x) = r ∑ bk ψk (x), k=1 where the bk are random variables, which are given by bk = l1 ζ(x)ψk (x)dx. [sent-1125, score-0.315]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fbgkm', 0.399), ('kmax', 0.373), ('sgpc', 0.364), ('sgp', 0.243), ('mcmc', 0.21), ('err', 0.205), ('fic', 0.186), ('std', 0.176), ('bayesian', 0.148), ('silverman', 0.146), ('mushrooms', 0.139), ('ordan', 0.121), ('splice', 0.116), ('active', 0.114), ('gkms', 0.095), ('eneralized', 0.093), ('ixed', 0.093), ('astrop', 0.087), ('fgp', 0.087), ('dai', 0.085), ('hang', 0.085), ('bk', 0.083), ('kernel', 0.082), ('reversible', 0.081), ('nn', 0.081), ('astroparticle', 0.078), ('bgkm', 0.078), ('mallick', 0.078), ('csvm', 0.074), ('probit', 0.068), ('kl', 0.068), ('bsvm', 0.067), ('bci', 0.066), ('newsgroups', 0.064), ('ernel', 0.061), ('thyroid', 0.061), ('zellner', 0.061), ('usps', 0.055), ('odels', 0.054), ('jump', 0.053), ('ringnorm', 0.052), ('twonorm', 0.052), ('ivm', 0.052), ('adult', 0.048), ('gl', 0.045), ('diggle', 0.043), ('gkm', 0.043), ('kohn', 0.043), ('ep', 0.043), ('bg', 0.043), ('si', 0.039), ('augmentation', 0.037), ('glms', 0.037), ('sparse', 0.035), ('regression', 0.035), ('nott', 0.035), ('ve', 0.034), ('posterior', 0.034), ('birth', 0.033), ('death', 0.033), ('methodology', 0.033), ('ag', 0.032), ('expansion', 0.032), ('letters', 0.031), ('snelson', 0.03), ('acceptance', 0.03), ('inverses', 0.03), ('vectors', 0.029), ('bernardo', 0.028), ('holmes', 0.028), ('sweeps', 0.028), ('gaussian', 0.027), ('vec', 0.027), ('waveform', 0.027), ('smith', 0.026), ('classi', 0.026), ('gpc', 0.026), ('gpcs', 0.026), ('guang', 0.026), ('maclehose', 0.026), ('zhihua', 0.026), ('averaging', 0.026), ('xi', 0.025), ('swap', 0.025), ('gp', 0.025), ('normal', 0.025), ('rasmussen', 0.025), ('girolami', 0.024), ('import', 0.024), ('ga', 0.024), ('williams', 0.024), ('liang', 0.024), ('svm', 0.024), ('training', 0.023), ('conduct', 0.023), ('zhang', 0.023), ('pillai', 0.023), ('mcculloch', 0.022), ('regressors', 0.022), ('singular', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

2 0.07995382 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

Author: Lauren A. Hannah, David M. Blei, Warren B. Powell

Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency

3 0.076851748 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efﬁcient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modiﬁcations and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difﬁcult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difﬁcult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

4 0.05464429 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random ﬁelds, Laplace approximation, variational inference, expectation propagation

5 0.051236976 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clariﬁes the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classiﬁers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classiﬁcation on all the samples. Experiments on toy data and several real world data sets illustrate the beneﬁts of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing

6 0.045041721 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

7 0.04451834 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

8 0.040797915 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

9 0.037959851 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

10 0.036898836 61 jmlr-2011-Logistic Stick-Breaking Process

11 0.036008593 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

12 0.035652656 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

13 0.035649478 66 jmlr-2011-Multiple Kernel Learning Algorithms

14 0.035047598 105 jmlr-2011-lp-Norm Multiple Kernel Learning

15 0.032863867 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.032557704 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

17 0.031179598 58 jmlr-2011-Learning from Partial Labels

18 0.030707506 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

19 0.029319255 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

20 0.029097324 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.171), (1, -0.091), (2, -0.06), (3, -0.019), (4, 0.019), (5, -0.075), (6, -0.044), (7, 0.042), (8, -0.011), (9, -0.089), (10, 0.103), (11, -0.012), (12, 0.122), (13, 0.062), (14, 0.055), (15, 0.023), (16, -0.005), (17, -0.053), (18, -0.021), (19, 0.093), (20, -0.059), (21, 0.037), (22, 0.073), (23, 0.038), (24, -0.034), (25, 0.114), (26, -0.11), (27, -0.121), (28, -0.03), (29, 0.066), (30, 0.152), (31, 0.139), (32, -0.053), (33, -0.098), (34, -0.243), (35, 0.036), (36, -0.088), (37, -0.171), (38, -0.014), (39, -0.066), (40, -0.203), (41, -0.002), (42, 0.196), (43, -0.039), (44, -0.137), (45, -0.018), (46, 0.076), (47, 0.166), (48, -0.266), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91658795 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

2 0.46591246 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

Author: Lauren A. Hannah, David M. Blei, Warren B. Powell

3 0.41180441 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

4 0.3373192 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

5 0.27474892 61 jmlr-2011-Logistic Stick-Breaking Process

Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson

Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efﬁcient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation

6 0.27232438 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

7 0.26726928 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

8 0.24511941 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

9 0.2357059 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

10 0.21997096 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

11 0.21428171 65 jmlr-2011-Models of Cooperative Teaching and Learning

12 0.2139706 66 jmlr-2011-Multiple Kernel Learning Algorithms

13 0.20861365 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

14 0.20786667 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

15 0.20545825 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

16 0.20334896 48 jmlr-2011-Kernel Analysis of Deep Networks

17 0.1935675 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

18 0.19298303 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

19 0.18960851 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

20 0.1849245 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.426), (4, 0.043), (6, 0.012), (9, 0.021), (10, 0.033), (11, 0.015), (24, 0.047), (31, 0.083), (32, 0.028), (41, 0.028), (60, 0.013), (65, 0.012), (67, 0.023), (70, 0.018), (73, 0.046), (78, 0.05), (90, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67050308 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

2 0.28709704 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

Author: Lauren A. Hannah, David M. Blei, Warren B. Powell

3 0.28049505 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identiﬁable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efﬁcient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identiﬁable Multivariate modeling), is validated and bench-marked on artiﬁcial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identiﬁability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identiﬁable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identiﬁability, factor models, linear Bayesian networks

4 0.2796478 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-deﬁnite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efﬁcient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efﬁcient approximations, multitask learning, structured outputs, multivariate processes

5 0.2778334 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

6 0.27625299 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.27606642 12 jmlr-2011-Bayesian Co-Training

8 0.27591872 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

9 0.27403942 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

10 0.27402341 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

11 0.27367592 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.27347314 48 jmlr-2011-Kernel Analysis of Deep Networks

13 0.27200329 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

14 0.27185273 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

15 0.27006522 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

16 0.26996389 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

17 0.26992893 16 jmlr-2011-Clustering Algorithms for Chains

18 0.26740709 35 jmlr-2011-Forest Density Estimation

19 0.26649335 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

20 0.2659032 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments