jmlr jmlr2011 jmlr2011-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
Reference: text
sentIndex sentText sentNum sentScore
1 We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). [sent-9, score-0.117]
2 In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. [sent-12, score-0.264]
3 When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). [sent-13, score-0.18]
4 Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes 1. [sent-15, score-0.164]
5 Introduction Supervised learning based on reproducing kernel Hilbert spaces (RKHSs) has become increasingly popular since the support vector machine (SVM) (Vapnik, 1998) and its variants such as penalized kernel logistic regression models (Zhu and Hastie, 2005) have been proposed. [sent-16, score-0.199]
6 Penalized kernel logistic regression models, however, are not naturally sparse. [sent-19, score-0.117]
7 Thus, Zhu and Hastie (2005) proposed a methodology that they refer to as the import vector machine (IVM), where a fraction of the training data—called import vectors by analogy to the support vectors of the SVM—are used to index kernel basis functions. [sent-20, score-0.244]
8 (2005), for example, since conjugate priors for the regression vector do not exist, a sampling methodology based on data augmentation was employed to update the regression vector. [sent-35, score-0.14]
9 In this paper we propose generalized kernel models (GKMs) as a framework in which sparsity can be given an explicit treatment and in which a fully Bayesian methodology can be carried out. [sent-39, score-0.115]
10 We define active vectors to be those input vectors that are indexed by the nonzero components of the regression vector in GKMs. [sent-41, score-0.207]
11 Third, the mixture of the point-mass prior and the Silverman g-prior allows a fraction of regression coefficients in question to be zero and thus provides an explicit Bayesian approach to the selection of active vectors. [sent-48, score-0.149]
12 The algorithm uses a reversible jump procedure (Green, 1995) for the automatic selection of active vectors and a Bayesian model averaging method (Raftery et al. [sent-54, score-0.303]
13 In particular, we show that our reversible jump method can be used to implement a “subset of regressors” approximation method for GP-based classification. [sent-73, score-0.134]
14 By the representer theorem (Wahba, 1990), the solution for (1) is of the form n f (x) = u + ∑ β j K(x, x j ), (2) j=1 where u is called an offset term, K(·, ·) is the kernel function and the β j are referred to as regression coefficients. [sent-84, score-0.117]
15 113 Z HANG , DAI AND J ORDAN The predictive function f (x) in (2) is based on a basis expansion of kernel functions. [sent-102, score-0.114]
16 3 Sparse Models Recall that the number of active vectors is equal to the number of nonzero components of β. [sent-165, score-0.143]
17 That is, if β j = 0, the jth input vector is excluded from the basis expansion in (2), otherwise the jth input vector is an active vector. [sent-166, score-0.146]
18 , γn )′ such that γ j = 1 if x j is an active vector and γ j = 0 if it is not. [sent-172, score-0.114]
19 Let nγ = ∑n γ j be the number of active vectors, and let Kγ be the n×nγ j=1 submatrix of K consisting of those columns of K for which γ j = 1. [sent-173, score-0.114]
20 Since τ is defined as the probit link in our FBGKM, we have σ2 = 1 and yi = 1 if si > 0 0 otherwise. [sent-195, score-0.107]
21 For example, the Gaussian kernel K(xi , x j ) = exp(− xi − x j 2 /θ2 ) is a function of the width parameter θ. [sent-210, score-0.107]
22 γ β xi g a θj p b θj p θ si σ yi η u α aα bα ag bg aη bη n Figure 1: A graphical representation for the hierarchical model. [sent-224, score-0.139]
23 , 2005), in which computing the inverses and determinants of two consecutive full kernel matrices K and K∗ is required at each sweep of MCMC sampling. [sent-259, score-0.112]
24 Step (c) is used for the automatic choice of active vectors. [sent-260, score-0.114]
25 This method was derived from the reversible jump methodology of Green (1995). [sent-262, score-0.167]
26 118 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS Letting k = nγ , we denote the probabilities of birth, death and swap by bk , dk and 1−bk −dk , respectively. [sent-265, score-0.141]
27 3 for 1 ≤ k ≤ kmax −1, and dk = 1 and bk = 0 for kmax ≤ k ≤ n. [sent-268, score-0.829]
28 Here, kmax is a specified maximum number of active vectors such that kmax ≤ n. [sent-269, score-0.889]
29 This method also employs birth, death and swap moves; it differs from the reversible jump procedure because it does not incorporate the probabilities of birth, death and swap into its acceptance probabilities. [sent-271, score-0.28]
30 It is worth noting that when n is relatively large, we can reduce the computational burden by giving kmax a value far less than n, that is, kmax ≪ n, and then computing: Q−1 = In − Kγ ϒ−1 K′ γ γ γ and |Qγ | = |ϒγ ||Σγ |−1 = η−1 g−nγ |Kγγ |−1 |ϒγ |. [sent-273, score-0.746]
31 (11) For example, for both the USPS and NewsGroups data sets used in our experiments, we set kmax = 200 ≪ n. [sent-274, score-0.373]
32 Finally, in the reversible jump method, the matrices obtained before and after each move only change a column and a row. [sent-278, score-0.134]
33 This implies that the Bayesian model averaging method uses 1, 000 (T = (10, 000−5, 000)/5) active sets for prediction. [sent-295, score-0.14]
34 In fact, our reversible jump MCMC algorithm deals with parameter estimation, model selection and posterior prediction jointly in a single paradigm. [sent-297, score-0.168]
35 Moreover, the reversible jump method is a sequential approach for model selection and posterior prediction. [sent-298, score-0.168]
36 This implies that after the burn-in the selection of active vectors and the prediction of responses are simultaneously implemented. [sent-299, score-0.143]
37 1 Gaussian Process Priors The following proposition summarizes the connection between the Gaussian process and the feature basis expansion ∑r bk ψk (x) given in (4). [sent-309, score-0.115]
38 , ψr (x))′ from X to Rr (r is possibly infinite) such that K(xi , x j ) = ψ(xi )′ ψ(x j ) for xi , x j ∈ X and ζ(x) = r ∑ bk ψk (x) i. [sent-313, score-0.108]
39 120 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS As discussed in Section 2, the Karhunen-Lo` ve expansion can also be approximated by a finitee dimensional expansion over the training data set; that is, n ζ(x) = ∑ βi K(x, xi ) with β = (β1 , . [sent-326, score-0.146]
40 If the kernel function is stationary, then v is near zero and the posterior predictive mean of s∗ reverts to u when x∗ is far from points in the set I . [sent-367, score-0.116]
41 That is, the prediction is based on the average over T active sets. [sent-370, score-0.114]
42 In the following experiments the average is taken on 1, 000 active sets. [sent-371, score-0.114]
43 121 Z HANG , DAI AND J ORDAN It is again worth noting that the reversible jump MCMC algorithm devised in this paper deals with parameter estimation and posterior prediction jointly in a single paradigm. [sent-372, score-0.168]
44 Moreover, the reversible jump methodology is a sequential approach to model selection and posterior prediction. [sent-373, score-0.201]
45 The main computational burden of the MCMC algorithm comes from the sampling procedure for parameter estimation, and the MCMC algorithm does not require extra computational complexity for the selection of active vectors and the prediction of responses. [sent-374, score-0.143]
46 All of the methods that we implement are based on a Gaussian RBF kernel with a single width parameter; that is, K(xi , x j ) = exp − xi −x j 2 /θ2 . [sent-449, score-0.107]
47 This implies that the Bayesian model averaging component of our Bayesian methods uses 1, 000 (T = (10, 000−5, 000)/5) active sets for test. [sent-468, score-0.14]
48 124 BAYESIAN G ENERALIZED K ERNEL M IXED M ODELS BSVM err (±std) BCI 28. [sent-477, score-0.205]
49 In the following experiments, we attempt to analyze the performance of the methods with respect to different values of the training size n and the maximum number kmax of active vectors. [sent-622, score-0.51]
50 Tables 4 and 5 show the experimental results when changing the training size n and fixing the maximum number of active vectors to kmax = 200. [sent-624, score-0.539]
51 Table 6 shows the experimental results for our FBGKM and SGPC methods with respect to different values of the maximum number kmax of active vectors and for a fixed training size of n = 800. [sent-628, score-0.539]
52 The performance of these two methods is roughly similar for each setting kmax ; that is, they are insensitive to kmax . [sent-629, score-0.746]
53 However, their computational costs tend to slightly increase as the maximum number kmax of active vectors increases. [sent-630, score-0.516]
54 Additionally, in order to study the MCMC mixing performance of our FBGKM and SGPC methods we report the numbers of active vectors over different data sets. [sent-631, score-0.143]
55 In particular, Figure 2 125 Z HANG , DAI AND J ORDAN Training size n n=300 n=400 n=500 n=600 n=700 n=800 BSVM err (±std) 5. [sent-632, score-0.205]
56 34) Table 4: Experimental results for the five methods corresponding to different training sizes n on the NewsGroups data set with kmax = 200: err− the test error rates (%); std− the corresponding standard deviation. [sent-692, score-0.396]
57 327 × 103 Table 5: The computational times (s) for the five methods corresponding to different training sizes n on the NewsGroups data set with kmax = 200. [sent-723, score-0.396]
58 kmax of active vectors kmax = 300 kmax = 400 kmax = 500 kmax = 600 kmax = 700 FBGKM err (±std) time 4. [sent-724, score-2.586]
59 057 × 104 Table 6: Experimental results for FBGKM and SGPC corresponding to different maximum numbers kmax of active vectors on the NewsGroups data set with n = 800: err− the test error rates (%); std− the corresponding standard deviation; time− the corresponding computational time (s). [sent-754, score-0.516]
60 depicts the output of the numbers nγ of active vectors corresponding to the first 6000 sweeps in the MCMC inference procedure on BCI, Digit1, Letters {(A vs. [sent-755, score-0.171]
61 of active vectors (n ) 95 90 85 80 75 70 0 90 85 80 75 70 1000 2000 3000 4000 5000 65 0 6000 1000 2000 No. [sent-760, score-0.143]
62 of iteration (a) FBGKM, kmax = 100 5000 6000 5000 6000 5000 6000 5000 6000 200 190 190 180 γ No. [sent-761, score-0.373]
63 of active vectors (nγ) 4000 (b) SGPC, kmax = 100 200 170 160 150 180 170 160 150 140 130 0 3000 No. [sent-763, score-0.516]
64 of iteration (c) FBGKM, kmax = 200 (d) SGPC, kmax = 200 100 100 95 95 No. [sent-766, score-0.746]
65 of active vectors (n ) 90 85 80 75 90 85 80 70 75 65 60 0 1000 2000 3000 4000 5000 70 0 6000 1000 2000 No. [sent-768, score-0.143]
66 of iteration (e) FBGKM, kmax = 100 4000 (f) SGPC, kmax = 100 200 195 195 190 190 γ No. [sent-769, score-0.746]
67 of iteration (g) FBGKM, kmax = 200 (h) SGPC, kmax = 200 Figure 2: MCMC Output for the numbers nγ of active vectors of our FBGKM and SGPC methods on the four data sets: (a/b) BCI; (c/d) Digit1; (e/f) Letters (A vs. [sent-774, score-0.889]
68 127 Z HANG , DAI AND J ORDAN In probit-type models, since the posterior distribution of each si is truncated normal, we are able to update the si using the Gibbs sampler. [sent-776, score-0.112]
69 Table 7 describes distributions of active vectors to appear after the burn-in (in the last 5, 000 sweeps). [sent-782, score-0.143]
70 As we can see, the number nγ of active vectors jumps between a small range for different data sets, due to the rapid mixing. [sent-783, score-0.143]
71 The maximum frequency of active vectors corresponding to the number nγ of active vectors is also given in Table 7. [sent-784, score-0.286]
72 Max—the maximum number of active vectors to appear; Min—the minimum number of active vectors to appear; Most—the number of active vectors with the maximum frequency and the corresponding frequency shown in brackets. [sent-790, score-0.429]
73 3 Evaluation 2 We further evaluate the performance of our sparse Bayesian kernel methods under kernel parameter learning, and compare the FBGKM and SGPC with SGP+FIC and full GP (FGP) (Rasmussen and Williams, 2006). [sent-792, score-0.199]
74 Since for FGP+KL learning the kernel parameters results in a huge computational cost, we set the sizes of the training and test data as 1000 (i. [sent-799, score-0.105]
75 However, to provide an applesto-apples comparison with our sparse Bayesian kernel methods, we employ MCMC inference for SGP+FIC+KL. [sent-804, score-0.117]
76 For the sparse methods compared here, we fix the size of active set to 100, that is, kmax = 100. [sent-805, score-0.522]
77 Data Set Splice Astrop Mushrooms Adult SGP+FIC+KL err (±std) 18. [sent-817, score-0.205]
78 43) Table 8: Experimental results for the four Bayesian kernel methods on the four data sets with learned kernel parameters θ, kmax = 100, n = 1000, and m = 1000: err− the test error rates (%); std− the corresponding standard deviation. [sent-849, score-0.537]
79 557 × 104 Table 9: The computational times (s) of the four Bayesian kernel methods on the four data sets with learned kernel parameters θ, kmax = 100, n = 1000, and m = 1000. [sent-866, score-0.537]
80 In order to further evaluate the performance of our sparse Bayesian kernel methods on some larger data sets, we also conduct comparative experiments of FBGKM, SGPC, and SGP+FIC (Snelson and Ghahramani, 2006) on the Adult1 , Adult2 , Mushrooms, Splice, and Astroparticle data sets. [sent-867, score-0.14]
81 For these sparse methods, we fix the size of active set kmax to 200. [sent-869, score-0.522]
82 Furthermore, it is still difficult for SGP+FIC+MCMC and SGP+FIC+EP to calculate the optimal solution for sparse approximation of full Gaussian process, due to the sensitivity of the performance to the initial active set. [sent-874, score-0.149]
83 28) Table 10: Experimental results for the five methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200: err− the test error rates (%); std− the corresponding standard deviation. [sent-923, score-0.373]
84 Table 11 and Figure 4 report the average computational times of the compared sparse Bayesian kernel methods on the five data sets, with Figure 4 depicting the computational times on a logarithm scale. [sent-924, score-0.117]
85 369 × 105 Table 11: The computational times (s) of the four sparse Bayesian kernel methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200. [sent-946, score-0.49]
86 In addition, we conduct a quantitative assessment of convergence of the MCMC algorithms for the three sparse Bayesian kernel methods. [sent-947, score-0.14]
87 From Table 12, we can see that all MCMC algorithms in the three sparse Bayesian kernel methods can achieve convergence on the five data sets after the first 5000 sweeps. [sent-952, score-0.117]
88 142 × 10 3605 Table 12: Monitoring convergence of MCMC algorithms for the three sparse Bayesian kernel methods on the Splice, Astroparticle, Mushrooms, Adult1 , and Adult2 data sets with kmax = 200: time− the computational time (s) for convergence; burn-in− the length of chain for convergence. [sent-968, score-0.49]
89 In addition, the maximum number kmax of active vectors is set according to Table 1. [sent-976, score-0.516]
90 Data Set Ringnorm Thyroid Twonorm Waveform err 2. [sent-981, score-0.205]
91 In particular, given a kernel function, we can estimate parameters of the kernel function. [sent-1023, score-0.164]
92 , βln )′ and gl ≥ 0, we have l q n l=1 i=1 f (x) = u + ∑ gl ∑ Kl (x, xi )βli . [sent-1038, score-0.115]
93 Now we assign βl ∼ Nn 0, σ2 (K(l) )−1 and gl ∼ ρδ0 (gl ) + (1 − ρ)Ga(gl |ag /2, bg /2), where K(l) = Ψl Ψ′ (n×n) is the lth kernel matrix, δ0 (·) is a point-mass at zero and the user-specific l parameter ρ ∈ (0, 1) controls the levels of the nonzero gl . [sent-1039, score-0.215]
94 Note that kernel parameter learning and multiple kernel learning can be incorporated together. [sent-1041, score-0.164]
95 As in the binary problem, we also introduce a binary n-vector γ with either γi = 1 if xi is an active vector or γi = 0 if xi is not an active vector. [sent-1074, score-0.278]
96 Thus, the posterior distribution of each si j is truncated normal, [si j |γ, u j , β j , yi j ] ∼ N(u j + k′ β j , 1) i subject to si j > sil for all l = j if yi j = 1. [sent-1096, score-0.112]
97 Conclusion In this paper we have discussed Bayesian generalized kernel mixed models, including Bayesian generalized kernel models and Gaussian processes for classification. [sent-1102, score-0.164]
98 In particular, we have proposed fully Bayesian kernel methods based on the Silverman g-prior and a Bayesian model averaging method. [sent-1103, score-0.108]
99 Because of the connection between kernel methods and Gaussian processes, the MCMC algorithm can be immediately applied to sparse Gaussian processes. [sent-1105, score-0.117]
100 k=1 135 Z HANG , DAI AND J ORDAN The Karhunen-Lo` ve expansion of ζ(x) is then given by e ζ(x) = r ∑ bk ψk (x), k=1 where the bk are random variables, which are given by bk = l1 ζ(x)ψk (x)dx. [sent-1125, score-0.315]
wordName wordTfidf (topN-words)
[('fbgkm', 0.399), ('kmax', 0.373), ('sgpc', 0.364), ('sgp', 0.243), ('mcmc', 0.21), ('err', 0.205), ('fic', 0.186), ('std', 0.176), ('bayesian', 0.148), ('silverman', 0.146), ('mushrooms', 0.139), ('ordan', 0.121), ('splice', 0.116), ('active', 0.114), ('gkms', 0.095), ('eneralized', 0.093), ('ixed', 0.093), ('astrop', 0.087), ('fgp', 0.087), ('dai', 0.085), ('hang', 0.085), ('bk', 0.083), ('kernel', 0.082), ('reversible', 0.081), ('nn', 0.081), ('astroparticle', 0.078), ('bgkm', 0.078), ('mallick', 0.078), ('csvm', 0.074), ('probit', 0.068), ('kl', 0.068), ('bsvm', 0.067), ('bci', 0.066), ('newsgroups', 0.064), ('ernel', 0.061), ('thyroid', 0.061), ('zellner', 0.061), ('usps', 0.055), ('odels', 0.054), ('jump', 0.053), ('ringnorm', 0.052), ('twonorm', 0.052), ('ivm', 0.052), ('adult', 0.048), ('gl', 0.045), ('diggle', 0.043), ('gkm', 0.043), ('kohn', 0.043), ('ep', 0.043), ('bg', 0.043), ('si', 0.039), ('augmentation', 0.037), ('glms', 0.037), ('sparse', 0.035), ('regression', 0.035), ('nott', 0.035), ('ve', 0.034), ('posterior', 0.034), ('birth', 0.033), ('death', 0.033), ('methodology', 0.033), ('ag', 0.032), ('expansion', 0.032), ('letters', 0.031), ('snelson', 0.03), ('acceptance', 0.03), ('inverses', 0.03), ('vectors', 0.029), ('bernardo', 0.028), ('holmes', 0.028), ('sweeps', 0.028), ('gaussian', 0.027), ('vec', 0.027), ('waveform', 0.027), ('smith', 0.026), ('classi', 0.026), ('gpc', 0.026), ('gpcs', 0.026), ('guang', 0.026), ('maclehose', 0.026), ('zhihua', 0.026), ('averaging', 0.026), ('xi', 0.025), ('swap', 0.025), ('gp', 0.025), ('normal', 0.025), ('rasmussen', 0.025), ('girolami', 0.024), ('import', 0.024), ('ga', 0.024), ('williams', 0.024), ('liang', 0.024), ('svm', 0.024), ('training', 0.023), ('conduct', 0.023), ('zhang', 0.023), ('pillai', 0.023), ('mcculloch', 0.022), ('regressors', 0.022), ('singular', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
2 0.07995382 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
3 0.076851748 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
4 0.05464429 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
5 0.051236976 12 jmlr-2011-Bayesian Co-Training
Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing
6 0.045041721 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
7 0.04451834 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
8 0.040797915 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
9 0.037959851 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
10 0.036898836 61 jmlr-2011-Logistic Stick-Breaking Process
11 0.036008593 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
12 0.035652656 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
13 0.035649478 66 jmlr-2011-Multiple Kernel Learning Algorithms
14 0.035047598 105 jmlr-2011-lp-Norm Multiple Kernel Learning
15 0.032863867 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
16 0.032557704 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
17 0.031179598 58 jmlr-2011-Learning from Partial Labels
18 0.030707506 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
19 0.029319255 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
20 0.029097324 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
topicId topicWeight
[(0, 0.171), (1, -0.091), (2, -0.06), (3, -0.019), (4, 0.019), (5, -0.075), (6, -0.044), (7, 0.042), (8, -0.011), (9, -0.089), (10, 0.103), (11, -0.012), (12, 0.122), (13, 0.062), (14, 0.055), (15, 0.023), (16, -0.005), (17, -0.053), (18, -0.021), (19, 0.093), (20, -0.059), (21, 0.037), (22, 0.073), (23, 0.038), (24, -0.034), (25, 0.114), (26, -0.11), (27, -0.121), (28, -0.03), (29, 0.066), (30, 0.152), (31, 0.139), (32, -0.053), (33, -0.098), (34, -0.243), (35, 0.036), (36, -0.088), (37, -0.171), (38, -0.014), (39, -0.066), (40, -0.203), (41, -0.002), (42, 0.196), (43, -0.039), (44, -0.137), (45, -0.018), (46, 0.076), (47, 0.166), (48, -0.266), (49, 0.08)]
simIndex simValue paperId paperTitle
same-paper 1 0.91658795 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
2 0.46591246 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
3 0.41180441 12 jmlr-2011-Bayesian Co-Training
Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing
4 0.3373192 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
5 0.27474892 61 jmlr-2011-Logistic Stick-Breaking Process
Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson
Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation
6 0.27232438 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
7 0.26726928 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
8 0.24511941 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
9 0.2357059 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
10 0.21997096 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.21428171 65 jmlr-2011-Models of Cooperative Teaching and Learning
12 0.2139706 66 jmlr-2011-Multiple Kernel Learning Algorithms
13 0.20861365 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
14 0.20786667 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
15 0.20545825 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
16 0.20334896 48 jmlr-2011-Kernel Analysis of Deep Networks
17 0.1935675 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
18 0.19298303 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
19 0.18960851 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
20 0.1849245 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
topicId topicWeight
[(2, 0.426), (4, 0.043), (6, 0.012), (9, 0.021), (10, 0.033), (11, 0.015), (24, 0.047), (31, 0.083), (32, 0.028), (41, 0.028), (60, 0.013), (65, 0.012), (67, 0.023), (70, 0.018), (73, 0.046), (78, 0.05), (90, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.67050308 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
2 0.28709704 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
3 0.28049505 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
4 0.2796478 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
Author: Mauricio A. Álvarez, Neil D. Lawrence
Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes
5 0.2778334 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
6 0.27625299 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
7 0.27606642 12 jmlr-2011-Bayesian Co-Training
8 0.27591872 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
9 0.27403942 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
10 0.27402341 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
11 0.27367592 91 jmlr-2011-The Sample Complexity of Dictionary Learning
12 0.27347314 48 jmlr-2011-Kernel Analysis of Deep Networks
13 0.27200329 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.27185273 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
15 0.27006522 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
16 0.26996389 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
17 0.26992893 16 jmlr-2011-Clustering Algorithms for Chains
18 0.26740709 35 jmlr-2011-Forest Density Estimation
19 0.26649335 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
20 0.2659032 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments