jmlr jmlr2009 jmlr2009-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. [sent-8, score-0.459]
2 Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. [sent-9, score-0.36]
3 Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration 1. [sent-11, score-0.455]
4 Intractability for the mixture model arises not because integration is intractable, but because the number of terms in the marginal likelihood is K N . [sent-34, score-0.364]
5 We derive corrections both for the marginal likelihood and the predictive distribution in the form of an expansion in terms of “clusters” of likelihood terms of the posterior. [sent-41, score-0.719]
6 In contrast, the lowest non-trivial correction to the marginal likelihood is of second order, with the number of terms growing as O (N 2 ). [sent-49, score-0.34]
7 Corrections to the marginal likelihood can be tractably computed, for example, for models where the likelihood is a mixture distribution. [sent-50, score-0.444]
8 When the true distribution is multi-modal, EP will in most cases provide a local (single) mode approximation, with lower-order corrections also being local. [sent-53, score-0.31]
9 While the predictive distribution is invariant to this symmetry, the log marginal likelihood usually has to be further corrected by a factor of O (log K! [sent-57, score-0.607]
10 As a tool for improving inference accuracy, we show in a practical example that the first nontrivial correction term to the marginal likelihood approximation can make a clear difference in predicting which K maximizes the marginal likelihood, compared to when the correction was not used. [sent-62, score-0.647]
11 In this paper EC or EP and its resulting corrections are compared with variational Bayes (VB), Minka’s α-divergence message passing scheme, and a gold standard benchmark of parallel tempering (PT) and thermodynamic integration (TI). [sent-63, score-0.59]
12 Contrary to more standard Monte Carlo methods (for example Metropolis-Hastings or Gibbs sampling) it can also provide estimates of the marginal likelihood by TI, which interpolates the expected value of the log likelihood between the prior and the posterior. [sent-65, score-0.582]
13 To increase the stability of estimates obtained by TI, we give a novel generalization of PT, which allows interpolation of the value of the log likelihood between any choice of distribution and the posterior. [sent-66, score-0.357]
14 Section 3 shows an example of corrections for a marginal distribution in a Gaussian Process classification model. [sent-71, score-0.378]
15 2 The message passing scheme of expectation propagation gives rise to an identical marginal likelihood approximation, and the following interpretation sheds light on both methods by looking at them as a set of self-consistent approximations to marginal or predictive distributions. [sent-82, score-0.659]
16 The outline presented here allows for further perturbation corrections to be derived. [sent-83, score-0.334]
17 By introducing a similar approximation to the “cavity” predictive distributions Z Z p(xn |D\n ) = dθ p(xn |θ)p(θ|D\n ) ≈ dθ p(xn |θ)q\n (θ) for each xn in the training set, a computationally efficient approximation can be derived. [sent-88, score-0.385]
18 We shall now rather require q(θ) to share key properties, namely lower order statistics, with each of the distributions qn (θ) ∝ p(xn |θ)q\n (θ); this is explored in the next section. [sent-89, score-0.309]
19 1 EC and EP with Exponential Families EC defines a tractable approximation q(θ) through expectation consistency with each qn (θ). [sent-91, score-0.421]
20 (3) Z(Λ, 0) By adding the condition Λ = ∑n Λn , we allow each likelihood factor p(xn |θ) of the posterior in (1) to correspond to simpler factor proportional to Λ\n = Λ − Λn in (3): the Λn ’s therefore parameterize the likelihood term contributions to the approximation. [sent-98, score-0.346]
21 We therefore require consistency of the generalized moments, that is, φ(θ) q = φ(θ) qn , n = 1, . [sent-104, score-0.309]
22 VB finds its approximation q(θ) by lower-bounding the log marginal likelihood with Jensen’s inequality (Jordan et al. [sent-116, score-0.474]
23 By writing log ZVB = −KL(q(θ) p(θ|D )) + log p(D ) the bound can be made as tight as possible by adjusting q(θ); this is achieved by minimizing the KL-divergence between q(θ) and p(θ|D ). [sent-118, score-0.422]
24 7: until expectation consistency φ(θ) qn (θ;Λn ) = φ(θ) q(θ;Λ) = µ holds for n = 1, . [sent-131, score-0.35]
25 EP defines a specific message algorithm which iteratively refines each Λn by minimising local Kullback-Leibler divergences KL(qn (θ) q(θ)); in other words it iteratively performs the required moment matching φ(θ) q = φ(θ) qn . [sent-136, score-0.368]
26 If EP converges we will have expectation consistency φ(θ) qn (θ) = φ(θ) q(θ) = µ because of the moment matching in lines 4 and 5 of Algorithm 1. [sent-138, score-0.35]
27 In order to expand the EC approximation we use (4) to express each likelihood term by the approximating densities as p(xn |θ) = Z(Λ − Λn , 1n ) qn (θ) exp ΛT φ(θ) , n Z(Λ, 0) q(θ) to find that p(θ) ∏ p(xn |θ) = ZEC q(θ) ∏ n n If we define εn (θ) = qn (θ) q(θ) . [sent-148, score-0.894]
28 (6) qn (θ) − q(θ) q(θ) such that qn (θ) = 1 + εn (θ), we should expect εn (θ) to be on average small over a suitable measure q(θ) when the EC approximation works well. [sent-149, score-0.664]
29 Bearing this definition in mind, the exact posterior and the exact marginal likelihood can be written as p(θ|D ) = 1 q(θ) ∏ (1 + εn (θ)) R n with R= Z and Z = ZEC R , (7) dθ q(θ) ∏ (1 + εn (θ)) . [sent-150, score-0.339]
30 n We expect that an expansion of posterior and Z in terms of εn (θ) truncated at low orders might give the dominant corrections to EC. [sent-151, score-0.361]
31 The acceptance probability of the move is p(accept) = min(1, p({θi }new )/p({θi })), and the acceptance ratio simplifies to p({θi }new ) = exp (βi − β j ) log p(D |θ j ) − log p(D |θi ) p({θi }) . [sent-156, score-0.525]
32 From (18), the acceptance probability depends on the difference between log p(D |θi ) and log p(D |θi+1 ), and for some swaps to be accepted this difference should not be “too big”; there should be an overlap of some of the log likelihood evaluations of adjacent chains, as illustrated in Figure 3. [sent-160, score-0.778]
33 For a simulation at inverse temperature β, define the mean evaluation of the log likelihood as Z log p(D |θ) β = dθ log p(D |θ) p(θ|D , β) . [sent-161, score-0.775]
34 2 Thermodynamic Integration The samples from parallel tempering can be used for model comparison (Gregory, 2005, Skilling, 1998), as the marginal likelihood can be obtained from tempering. [sent-166, score-0.33]
35 Firstly, notice that the integral Z 1 0 d log Z(β) = Z 1 0 dβ d log Z(β) = log Z(1) − log Z(0) = log Z(1) = log p(D ) dβ 1276 P ERTURBATION C ORRECTIONS IN A PPROXIMATE I NFERENCE 1 p β log p(D|θ)p(θ)/q (θ) 0. [sent-167, score-1.477]
36 For interest, the log marginal likelihood log p(D ) is indicated with a ×. [sent-177, score-0.639]
37 is equal to the log marginal likelihood, as β = 0 gives the prior, which integrates to one. [sent-180, score-0.316]
38 By taking the derivative of the log normalizer (log partition function), we see that it evaluates as an average over the posterior 1 d log Z(β) = dβ Z(β) Z dθ log p(D |θ) × p(D |θ)β p(θ) = log p(D |θ) β . [sent-182, score-1.039]
39 The log marginal likelihood equals log p(D ) = Z 1 0 dβ log p(D |θ) (19) β (t) and can be numerically estimated from the Markov chain samples. [sent-183, score-0.85]
40 If {θi } represents the samples for tempering parameter βi , then the expectation is approximated with log p(D |θ) βi ≈ 1 T (t) ∑ log p(D |θi ) . [sent-184, score-0.545]
41 8 1 β Figure 4: The log likelihood averages log p(D |θ) β are estimated from each of the MCMC simulations at temperatures {βi }, and interpolated, so that Equation (19)’s integral can be evaluated numerically. [sent-192, score-0.598]
42 3 A Practical Generalization of Parallel Tempering The success of the interpolation obtaining log p(D |θ) β , illustrated in Figure 4, is dependent on the slope d log p(D |θ) β d 2 log Z(β) = = σ2 β dβ dβ2 at β ≈ 0. [sent-197, score-0.633]
43 (20) The log marginal likelihood can, as before, be determined with log p(D ) = Z 1 0 dβ log p(D |θ) + log 1278 p(θ) q(θ) β . [sent-203, score-1.061]
44 P ERTURBATION C ORRECTIONS IN A PPROXIMATE I NFERENCE It is evident that setting q(θ) = p(θ|D ) gives an integral over a constant function, log p(D ) = 0 dβ log p(D ) β . [sent-204, score-0.422]
45 Our aim in this section is to use simple toy posteriors to facilitate discussion on the behavior of q under various scenarios and discuss how that might affect the estimation of the marginal likelihood and predictive distribution. [sent-218, score-0.334]
46 The predictive density p(x|D ) is average of the likelihood over the bi-modal posterior (black contours), while the approximate predictive density is the average of the likelihood function over the uni-modal EC approximation (overlaid in blue). [sent-250, score-0.692]
47 When the EC approximation puts all its mass on one mode, and the modes are well separated, the two predictive densities are therefore similar. [sent-252, score-0.321]
48 We also overlay the first order correction to q(θ), given from (12) by p(θ|D ) ≈ ∑n qn (θ) − (N − 1)q(θ). [sent-259, score-0.432]
49 We will see in the following sections that even though q is a rather crude approximation to the posterior the predictions for the predictive distribution are fairly precise. [sent-266, score-0.319]
50 The corrections are on average (blue line) large for broadly overlapping posterior modes, as EP does not necessarily lock onto one of them. [sent-276, score-0.39]
51 2 Corrections The illustrations in Table 1 suggest that the lowest nontrivial corrections can provide insight into the quality of approximation, as we expect corrections to be small for good approximations. [sent-287, score-0.478]
52 5 provides a model with overlapping mixture k 1282 P ERTURBATION C ORRECTIONS IN A PPROXIMATE I NFERENCE 0 −20 logZ E C −40 −60 −80 −100 −120 10 20 30 40 50 60 Data set size N Figure 7: The growth of log ZEC for the random data sets DN used to obtain Figure 6(b). [sent-296, score-0.355]
53 5, and the second order corrections to the log marginal likelihood approximations were computed where possible (i. [sent-300, score-0.701]
54 equivalent modes); small corrections immediately tell us that solution is close to exact, apart from here a log K! [sent-307, score-0.45]
55 We also observe that the corrections do not scale with N, whereas the free energy log ZEC does, as shown in Figure 7. [sent-309, score-0.45]
56 When and whether the expected correction log R DN = cR (N) → 0 as N → ∞ (and hence EP becomes exact) is an open question. [sent-316, score-0.334]
57 Figure 8(d) shows a second order correction to the log marginal likelihoods 1283 PAQUET, W INTHER AND O PPER 1. [sent-326, score-0.439]
58 6 −2 (d) The log marginal likelihood estimates −1. [sent-390, score-0.428]
59 5 2 (c) K = 2, N = 7 Figure 8: Predictive densities p(x|D , MK ) given by VB, EC, and a perturbation correction (EC+R), with accompanying log marginal likelihood estimates and MCMC “truth” baselines. [sent-394, score-0.69]
60 A lower bound to the log marginal likelihood is provided by VB. [sent-400, score-0.428]
61 4 A Practical Comparison In this section we draw a comparison between the approximate log marginal likelihoods and predictive distributions given by VB, EP, and various corrections, and use estimations given by PT and TI as a benchmark. [sent-403, score-0.433]
62 From the results that follow, we observe that the growth of log Z, as a function of model size, gives a characteristic “Ockham hill” (defined in more detail later in this section), where the “peak” of the hill indicates the model with highest approximate log p(D ). [sent-406, score-0.447]
63 1’s discussion, the discrepancy between a log Z estimate and the true log p(D ) grows as the model size is increased. [sent-409, score-0.422]
64 We will show that a principled algorithm initialization can circumvent many spurious local minima in the log marginal likelihood estimate. [sent-411, score-0.465]
65 1 T HE A PPROXIMATE L OG M ARGINAL L IKELIHOOD Ockham hills are useful for visualizing log marginal likelihood estimates for a set of plausible models with increasing explanatory power, for example, mixture models with increasing K. [sent-416, score-0.575]
66 For mixture models this downward trend is even slower when the true log marginal is considered; this is mainly due to the number of modes in the true posterior increasing with the number of components, with an approximation possibly only capturing one of them. [sent-420, score-0.713]
67 5 In the case of VB, the log Z approximation provides a lower bound to the marginal likelihood p(D |M ), and this quantity is often used for model selection (Beal and Ghahramani, 2003, Bishop and Svens´ n, 2003, Corduneanu and Bishop, 2001). [sent-421, score-0.474]
68 ) correction on any of the approximate solutions can be seen by comparing them against the dashed-line plot of log p(D |MK ) − log K! [sent-429, score-0.545]
69 Figure 9 shows such hills for the marginal likelihood approximations for different data sets for VB, α = 1 message passing, EP, and a second-order perturbation correction. [sent-433, score-0.437]
70 For each of the models MK , with K mixture components, the figures show twenty approximations for each method, with the colour intensity of each plot corresponding to the frequency of reaching different approximations for log Z. [sent-441, score-0.394]
71 We base our factor initializations around a scaled version of the solution obtained by the VB expectation maximisation algorithm, exp(ΛT φ(θ)) ∝ exp n Z dzn q(zn ) log p(xn , zn |θ) , which was seeded with a data clustering based on the k-means algorithm. [sent-450, score-0.39]
72 To illustrate how much the approximate predictive distribution differs from the true predictive distribution, the figures show p(x|D ) obtained from an average over ten thousand β = 1 samples from a parallel tempered Markov chain. [sent-461, score-0.336]
73 These corrections are useful not only in improving estimates like log marginal likelihood and predictive density approximations, but can also provide insight into the quality of an approximation in polynomial time. [sent-465, score-0.863]
74 For K = 3 under narrower prior in Figure 10, we see three local maxima of the EC objective function in (5): the predictive distribution shown on the left coincided with log ZEC = −243. [sent-475, score-0.468]
75 5, suggesting multicanonical sampling as a viable alternative, since it aims at sampling from a distribution that is flat in the log likelihood and will therefore not have this “bottleneck. [sent-495, score-0.357]
76 Bounds on the Marginal Likelihood It is interesting to compare the marginal likelihood approximation (5) with the one given by a q(θ) variational approximation. [sent-509, score-0.32]
77 Here one would use the fact that the relative entropy log p(θ|D ) ≥ 0 q to approximate the free energy − log Z by the upper bound − log Z ≤ log q(θ) p(θ) ∏n p(xn |θ) . [sent-510, score-0.844]
78 After inserting this expression into (5), taking logs and applying Jensen’s inequality we arrive at exp ∑n ΛT φ(θ) n q(θ) − log Z(Λ, 0) = log − log ZEC ≤ log , p(xn |θ) p(θ) ∏n p(xn |θ) ∏n q q where in the last step we have used ∑n Λn = Λ. [sent-512, score-0.881]
79 3 0 0 xn 5 −6 −4 −2 0 2 f∗ (b) The distributions q( f∗ ) and each of the qn ( f∗ )’s (which are not Gaussian) used in (13) are respectively shown in dashed blue and (thick solid) orange. [sent-529, score-0.485]
80 This “tilted” predictive marginal in (22) has exactly the same form as q( f∗ ) in (21), except8 for its ˜ use of µ\n and S\n , and the nonlinear “weight” that is still a function of f∗ , so that qn ( f∗ ) is ultimately non-Gaussian. [sent-537, score-0.531]
81 In Figure 13 the marginals q( f∗ ) and qn ( f∗ ) are shown, leading to the first-order correction originally shown in Figure 1. [sent-546, score-0.432]
82 For these distribution have to 1) compute their sufficient statistics, 2) for message passing solve the inverse problem: given the sufficient statistics we must solve for the parameters of the distribution and 3) for the predictions with the mixture models compute their predictive distribution. [sent-549, score-0.406]
83 1 S UFFICIENT S TATISTICS The sufficient statistic of the Dirichlet is log πk = ∂ log ZD (λ) = ψ(λk ) − ψ ∂λk ∑ λk , (25) k where ψ is the digamma-function d log Γ(x)/dx. [sent-555, score-0.633]
84 2 I NVERSE In line 5 of the Algorithms 1 and 2 we have to solve the inverse problem: given the statistics mk = log πk find the parameters λ. [sent-558, score-0.345]
85 The predictive distribution for multinomial data and Dirichlet distributed posterior is straightforward to calculate using the result for the normalizer of the Dirichlet: p(x|λ) = Z dπp(x|π)D (π; λ) = (∑l xl )! [sent-576, score-0.346]
86 By taking derivatives of log ZN G with respect to the parameters {m, ν, a, b}, ν µγ − νm γ = 0, 1 1 m µγ − µ2 γ − m2 γ = − , 2 2ν log γ = ψ(a) − log b, − γ = −a/b , we can solve for µγ , µ2 γ , γ and log γ . [sent-588, score-0.844]
87 We get closed form expressions for three of parameters m= µγ , γ ν= µ2 γ 1 , − m2 γ 1293 b= a , γ PAQUET, W INTHER AND O PPER and a should be found from ψ(a) − log a = log γ − log γ by for example Newton’s method. [sent-592, score-0.633]
88 We will write the Wishart distribution over positive definite symmetric matrices in the same form as the Gamma distribution: d +1 W (Γ; a, B) ∝ exp a − log det Γ − tr BΓ , 2 where the degrees of freedom 2a should be greater than d − 1 for the distribution to be normalizable. [sent-598, score-0.404]
89 The Normal-Wishart is given by N W (µ, Γ; m, ν, a, B) = 1 ZN W (m, ν, a, B) T νm Γµ 1 exp − 1 ν µT Γµ − tr(B + νmmT )Γ , 2 2 log det Γ a− d 2 ZN W (m, ν, a, B) = πd(d−1)/4 2π ν d/2 d ∏Γ l=1 a+ 1−l 2 e−a log det B . [sent-599, score-0.635]
90 1 S UFFICIENT S TATISTICS The sufficient statistics follow from a straightforward generalization of the results from the NormalGamma model: Γ = aB−1 , 1294 P ERTURBATION C ORRECTIONS IN A PPROXIMATE I NFERENCE Γµ = Γ m, d µT Γµ = + mT Γ m, ν d 1−l log det Γ = ∑ ψ a + 2 l=1 − log det B . [sent-602, score-0.598]
91 2 I NVERSE From the sufficient statistics we can get closed form expressions for the parameters m= Γ −1 Γµ , ν= µT Γµ d , − mT Γ m B=a Γ −1 , whereas a should be found from d ∑ψ l=1 a+ 1−l 2 − d log a = log det Γ − log det Γ . [sent-605, score-0.809]
92 The predictive distribution p(x|D ) is again approximated by the form given in (14), with p(x|k) replaced by p(x|k) ≡ p(x|mk , νk , ak , Bk ) = T 2Bk νk + 1 , 2ak − d + 1 2ak − d + 1 νk x; mk , . [sent-616, score-0.315]
93 We can again use the cavity parameters to write qn in terms of q\n : qn (θ) = Z(Λ\n , 0) q (θ) ∑ πk p(xn |µk , Γk ) . [sent-621, score-0.704]
94 2 Expectations The statistics of qn (θ) for the mixture of Gaussians are computed by using log Zn (Λ\n ) from (27) as a generating function. [sent-624, score-0.635]
95 The other moments Γk , µT Γk µk and log det Γk can be expressed as weighed sums similar to (28): k Γk qn µT Γk µk k qn log det Γk qn = (1 − rnk ) Γk q\n + rnk Γk = (1 − rnk ) µT Γk µk k q\n = (1 − rnk ) log det Γk qk,n , + rnk µT Γk µk k q\n qk,n + rnk log det Γk , qk,n . [sent-628, score-3.023]
96 qn , T Γk qk,n νk\n mk\n + xn vk\n + 1 −1 , , 1 νk\n (xn − mk\n )(xn − mk\n )T 2 νk\n + 1 . [sent-630, score-0.485]
97 We extend the parameter space to include a binary latent allocation vector zn 1297 PAQUET, W INTHER AND O PPER for each data point n to indicate which mixture component was responsible for generating it (Diebolt and Robert, 1994); consequently znk ∈ {0, 1}, and ∑k znk = 1. [sent-636, score-0.447]
98 Given the allocation variables, we define 1 rnk xn , Nk ∑ n 1 rnk (xn − xk )(xn − xk )T , ¯ ¯ Σk , = Nk ∑ n rnk = βznk , xk = ¯ Nk = ∑ rnk n to give the conditional distributions needed for sampling the mixture parameters as 1 1 π|z ∼ D π; λ1,0 + β N1 , . [sent-642, score-0.93]
99 (t) (t) (t) (t) T If {{πk,i , µk,i , Γk,i }K , {zn,i }N }t=1 indicates the samples of chain i (after a burn-in period), then n=1 k=1 R log p(D |θ, z) βi ≈ 1 T ∑ ∑ ∑ znk,i log N (t) t n (t) (t) −1 xn ; µk,i , Γk,i . [sent-647, score-0.598]
100 (32) k (t) Notice that the samples of the mixing weights πk,i are not used in estimating the log likelihood average over the posterior, but occur in the prior. [sent-648, score-0.323]
wordName wordTfidf (topN-words)
[('qn', 0.309), ('ec', 0.293), ('vb', 0.239), ('corrections', 0.239), ('ep', 0.235), ('log', 0.211), ('paquet', 0.182), ('xn', 0.176), ('erturbation', 0.16), ('inther', 0.16), ('orrections', 0.16), ('pper', 0.16), ('rnk', 0.15), ('pproximate', 0.145), ('mk', 0.134), ('correction', 0.123), ('nference', 0.122), ('posterior', 0.122), ('predictive', 0.117), ('df', 0.117), ('mixture', 0.115), ('modes', 0.114), ('likelihood', 0.112), ('zec', 0.107), ('marginal', 0.105), ('zn', 0.101), ('znk', 0.096), ('perturbation', 0.095), ('det', 0.088), ('cavity', 0.086), ('tempering', 0.082), ('minka', 0.081), ('mcmc', 0.08), ('logz', 0.075), ('normalizer', 0.073), ('pt', 0.066), ('narrower', 0.064), ('temperatures', 0.064), ('message', 0.059), ('galaxy', 0.057), ('variational', 0.057), ('dirichlet', 0.056), ('nk', 0.055), ('mog', 0.053), ('ti', 0.052), ('symmetries', 0.049), ('passing', 0.047), ('approximation', 0.046), ('winther', 0.045), ('densities', 0.044), ('bk', 0.043), ('istribution', 0.043), ('ockham', 0.043), ('redictive', 0.043), ('thermodynamic', 0.043), ('prior', 0.042), ('zd', 0.042), ('expectation', 0.041), ('allocation', 0.039), ('propagation', 0.039), ('gibbs', 0.038), ('exp', 0.037), ('mode', 0.037), ('approximating', 0.037), ('tempered', 0.037), ('ds', 0.037), ('minima', 0.037), ('lbp', 0.036), ('bishop', 0.035), ('approximations', 0.034), ('distribution', 0.034), ('density', 0.033), ('inference', 0.033), ('acceptance', 0.033), ('labelling', 0.033), ('integration', 0.032), ('hills', 0.032), ('nverse', 0.032), ('replicas', 0.032), ('svens', 0.032), ('tatistics', 0.032), ('ufficient', 0.032), ('carlo', 0.032), ('monte', 0.032), ('parallel', 0.031), ('temperature', 0.03), ('ak', 0.03), ('overlapping', 0.029), ('corrected', 0.028), ('broader', 0.028), ('deterministic', 0.027), ('ulrich', 0.027), ('enzyme', 0.027), ('dn', 0.027), ('opper', 0.026), ('hill', 0.025), ('gaussians', 0.025), ('tractable', 0.025), ('appendix', 0.024), ('wishart', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999857 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
Author: Sébastien Bubeck, Ulrike von Luxburg
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency
3 0.090312243 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
Author: Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu
Abstract: Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of SegreVeronese varieties. Keywords: marginal likelihood, exact integration, mixture of independence model, computational algebra
4 0.076459818 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
Author: Dirk Gorissen, Tom Dhaene, Filip De Turck
Abstract: Due to the scale and computational complexity of currently used simulation codes, global surrogate (metamodels) models have become indispensable tools for exploring and understanding the design space. Due to their compact formulation they are cheap to evaluate and thus readily facilitate visualization, design space exploration, rapid prototyping, and sensitivity analysis. They can also be used as accurate building blocks in design packages or larger simulation environments. Consequently, there is great interest in techniques that facilitate the construction of such approximation models while minimizing the computational cost and maximizing model accuracy. Many surrogate model types exist (Support Vector Machines, Kriging, Neural Networks, etc.) but no type is optimal in all circumstances. Nor is there any hard theory available that can help make this choice. In this paper we present an automatic approach to the model type selection problem. We describe an adaptive global surrogate modeling environment with adaptive sampling, driven by speciated evolution. Different model types are evolved cooperatively using a Genetic Algorithm (heterogeneous evolution) and compete to approximate the iteratively selected data. In this way the optimal model type and complexity for a given data set or simulation code can be dynamically determined. Its utility and performance is demonstrated on a number of problems where it outperforms traditional sequential execution of each model type. Keywords: model type selection, genetic algorithms, global surrogate modeling, function approximation, active learning, adaptive sampling
5 0.07064005 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
Author: Ricardo Silva, Zoubin Ghahramani
Abstract: Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly. Keywords: graphical models, structural equation models, Bayesian inference, Markov chain Monte Carlo, latent variable models 1. Contribution The introduction of graphical models (Pearl, 1988; Lauritzen, 1996; Jordan, 1998) changed the way multivariate statistical inference is performed. Graphical models provide a suitable language to decompose many complex real-world processes through conditional independence constraints. Different families of independence models exist. The directed acyclic graph (DAG) family is a particularly powerful representation. Besides providing a language for encoding causal statements (Spirtes et al., 2000; Pearl, 2000), it is in a more general sense a family that allows for non-monotonic independence constraints: that is, models where some independencies can be destroyed by conditioning on new information (also known as the “explaining away” effect — Pearl, 1988), a feature to be expected in many real problems. ∗. Part of this work was done while RS was at the Gatsby Computational Neuroscience Unit, UCL, and at the Statistical Laboratory, University of Cambridge. †. Also affiliated with the Machine Learning Department, Carnegie Mellon University. c 2009 Ricardo Silva and Zoubin Ghahramani. S ILVA AND G HAHRAMANI Y1 Y2 Y4 Y3 Y5 Y6 Y2 Y1 Y4 (a) Y3 Y5 (b) Y2 Y1 Y4 Y3 Y5 (c) Figure 1: Consider the DAG in (a). Suppose we want to represent the marginal dependencies and independencies that result after marginalizing out Y6 . The simplest resulting DAG (i.e., the one with fewest edges) is depicted in (b). However, notice that this graph does not encode some of the independencies of the original model. For instance, Y3 and Y4 are no longer marginally independent in the modified DAGs. A different family of graphical models, encoded with more than one type of edge (directed and bi-directed), is the focus of this paper. The graph in (c) depicts the solution using this “mixed” representation. However, DAG independence models have an undesirable feature: they are not closed under marginalization, as we will illustrate. Consider the regression problem where we want to learn the effect of a cocktail of two drugs for blood pressure, while controlling for a chemotherapy treatment of liver cancer. We refer to Y1 , Y2 as the dosage for the blood pressure drugs, Y3 as a measure of chemotherapy dosage, Y4 as blood pressure, and Y5 as an indicator of liver status. Moreover, let Y6 be an hidden physiological factor that affects both blood pressure and liver status. It is assumed that the DAG corresponding to this setup is given by Figure 1(a). In this problem, predictions concerning Y6 are irrelevant: what we care is the marginal for {Y1 , . . . ,Y5 }. Ideally, we want to take such irrelevant hidden variables out of the loop. Yet the set of dependencies within the marginal for {Y1 , . . . ,Y5 } cannot be efficiently represented as a DAG model. If we remove the edge Y3 → Y4 from Figure 1(b), one can verify this will imply a model where Y3 and Y4 are independent given Y5 , which is not true in our original model. To avoid introducing unwanted independence constraints, a DAG such as the one in Figure 1(b) will be necessary. Notice that in general this will call for extra dependencies that did not exist originally (such as Y3 and Y4 now being marginally dependent). Not only learning from data will be more difficult due to the extra dependencies, but specifying prior knowledge on the parameters becomes less intuitive and therefore more error prone. In general, it will be the case that variables of interest have hidden common causes. This puts the researcher using DAGs in a difficult position: if she models only the marginal comprising the variables of interest, the DAG representation might not be suitable anymore. If she includes all hidden variables for the sake of having the desirable set of independencies, extra assumptions about hidden variables will have to be taken into account. In this sense, the DAG representation is flawed. There is a need for a richer family of graphical models, for which mixed graphs are an answer. Directed mixed graphs (DMGs) are graphs with directed and bi-directed edges. In particular, acyclic directed mixed graphs (ADMGs) have no directed cycle, that is, no sequence of directed edges X → · · · → X that starts and ends on the same node. Such a representation encodes a set of conditional independencies among random variables, which can be read off a graph by using a 1188 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y1 Y2 Y3 Y1 (a) Y2 Y3 Y1 (b) Y2 Y3 Y1 Y2 (c) Y3 (d) Figure 2: Different examples of directed mixed graphs. The graph in (b) is cyclic, while all others are acyclic. A subgraph of two variables where both edges Y1 → Y2 and Y1 ↔ Y2 are present is sometimes known as a “bow pattern” (Pearl, 2000) due to its shape. Y1 Y1 Y2 H1 Y4 Y3 H2 Y1 Y2 Y4 Y3 Y2 Y3 Y4 Figure 3: After marginalizing variables H1 and H2 from the DAG on the left, one possible DMG representation of the same dependencies is shown by the graph in the middle. Notice that there are multiple DMGs within a same Markov equivalence class, that is, encoding the same set of conditional independencies (Richardson and Spirtes, 2002). The two last graphs above are on the same class. criterion known as m-separation, a natural extension of the d-separation criterion used for directed acyclic graphs (Richardson, 2003). In a ADMG, two adjacent nodes might be connected by up to two edges, where in this case one has to be bi-directed and the other directed. A cyclic model can in principle allow for two directed edges of opposite directions. Figure 2 provides a few examples of DMGs. The appeal of this graphical family lies on the representation of the marginal independence structure among a set of observed variables, assuming they are part of a larger DAG structure that includes hidden variables. This is illustrated in Figure 3.1 More details on DMGs are given in Sections 2 and 8. In our blood pressure\liver status multiple regression problem, the suitable directed mixed graph is depicted in Figure 1(c). The contribution of this paper is how to perform Bayesian inference on two different families of mixed graph models: Gaussian and probit. Markov chain Monte Carlo (MCMC) and variational approximations will be discussed. Current Bayesian inference approaches for DMG models have limitations, as discussed in Section 2, despite the fact that such models are widely used in several sciences. The rest of the paper is organized as follows. Section 3 describes a special case of Gaussian mixed graph models, where only bi-directed edges are allowed. Priors and a Monte Carlo algorithm are described. This case will be a building block for subsequent sections, such as Section 4, where 1. Notice that it is not necessarily the case that the probability model itself is closed under marginalization. This will happen to some models, including the Gaussian model treated in this paper. But the basic claim of closure concerns the graph, that is, the representation of independence constraints. 1189 S ILVA AND G HAHRAMANI Gaussian DMG models are treated. Section 5 covers a type of discrete distribution for binary and ordinal data that is Markov with respect to an acyclic DMG. In Section 6 we discuss more sophisticated algorithms that are useful for scaling up Bayesian learning to higher-dimensional problems. Section 7 presents several empirical studies. Since the use of mixed graph models in machine learning applications is still in its early stages, we briefly describe in Section 8 a variety of possible uses of such graphs in machine learning applications. 2. Basics of DMGs, Gaussian Models and Related Work In this section, we describe the Gaussian DMG model and how it complements latent variable models. At the end of the section, we also discuss a few alternative approaches for the Bayesian inference problem introduced in this paper. 2.1 Notation and Terminology In what follows, we will use standard notions from the graphical modeling literature, such as vertex (node), edge, parent, child, ancestor, descendant, DAG, undirected graph, induced subgraph, Markov condition and d-separation. Refer to Pearl (1988) and Lauritzen (1996) for the standard definitions if needed. Less standard definitions will be given explicitly when appropriate. A useful notion is that of m-separation (Richardson, 2003) for reading off which independencies are entailed by a DMG representation. This can be reduced to d-separation (Pearl, 1988) by the following trick: for each bi-directed edge Yi ↔ Y j , introduce a new hidden variable Xi j and the edges Xi j → Yi and Xi j → Y j . Remove then all bi-directed edges and apply d-separation to the resulting directed graph. As usual, we will refer to vertices (nodes) in a graph and the corresponding random variables in a distribution interchangeably. Data points are represented by vectors with an upper index, such as ( j) Y(1) , Y(2) , . . . , Y(n) . The variable corresponding to node Yi in data point Y( j) is represented by Yi . 2.2 Gaussian Parameterization The origins of mixed graph models can be traced back to Sewall Wright (Wright, 1921), who used special cases of mixed graph representations in genetic studies. Generalizing Wright’s approach, many scientific fields such as psychology, social sciences and econometrics use linear mixed graph models under the name of structural equation models (Bollen, 1989). Only recently the graphical and parametrical aspects of mixed graph models have been given a thorough theoretical treatment (Richardson and Spirtes, 2002; Richardson, 2003; Kang and Tian, 2005; Drton and Richardson, 2008a). In practice, many structural equation models today are Gaussian models. We will work under this assumption unless stated otherwise. For a DMG G with a set of vertices Y, a standard parameterization of the Gaussian model is given as follows. For each variable Yi with a (possibly empty) parent set {Yi1 , ...,Yik }, we define a “structural equation” Yi = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi where εi is a Gaussian random variable with zero mean and variance vii . Notice that this parameterization allows for cyclic models. Unlike in standard Gaussian DAG models, the error terms {εi } are not necessarily mutually independent. Independence is asserted by the graphical structure: given two vertices Yi and Y j , 1190 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS the respective error terms εi and ε j are marginally independent if Yi and Y j are not connected by a bi-directed edge. By this parameterization, each directed edge Yi ← Y j in the graph corresponds to a parameter bi j . Each bi-directed edge Yi ↔ Y j in the graph is associated with a covariance parameter vi j , the covariance of εi and ε j . Each vertex Y j in the graph is associated with variance parameter v j j , the variance of ε j . Algebraically, let B be a m × m matrix, m being the number of observed variables. This matrix is such that (B)i j = bi j if Yi ← Y j exists in the graph, and 0 otherwise. Let V be a m × m matrix, where (V)i j = vi j if i = j or if Yi ↔ Y j is in the graph, and 0 otherwise. Let Y be the column vector of observed variables, α the column vector of intercept parameters, and ε be the corresponding vector of error terms. The set of structural equations can be given in matrix form as Y = BY + α + ε ⇒ Y = (I − B)−1 (ε + α) ⇒ Σ(Θ) = (I − B)−1 V(I − B)−T (1) where A−T is the transpose of A−1 and Σ(Θ) is the implied covariance matrix of the model, Θ ≡ {B, V, α}. 2.2.1 C OMPLETENESS OF PARAMETERIZATION AND A NCESTRAL G RAPHS An important class of ADMGs is the directed ancestral graph. Richardson and Spirtes (2002) provide the definition and a thorough account of the Markov properties of ancestral graphs. One of the reasons for the name “ancestral graph” is due to one of its main properties: if there is a directed path Yi → · · · → Y j , that is, if Yi is an ancestor of Y j , then there is no bi-directed edge Yi ↔ Y j . Thus directed ancestral graphs are ADMGs with this constraint.2 In particular, they show that any Gaussian distribution that is Markov with respect to a given ADMG can be represented by some Gaussian ancestral graph model that is parameterized as above. For the ancestral graph family, the given parameterization is complete: that is, for each Markov equivalence class, it is always possible to choose an ancestral graph where the resulting parameterization imposes no further constraints on the distribution besides the independence constraints of the class. Since the methods described in this paper apply to general DMG models, they also apply to directed ancestral graphs. In principle, it is possible to define and parameterize a Gaussian DAG model that entails exactly the same independence constraints encoded in an directed ancestral graph. One possibility, as hinted in the previous Section, is to replace each bi-directed edge Yi ↔ Y j by a new path Yi ← Xi j → Y j . Variables {Xi j } are “ancillary” hidden variables, in the sense that they are introduced for the sake of obtaining the same independence constraints of an ancestral graph. Standard Bayesian methodology can then be applied to perform inference in this Gaussian DAG model. However, this parameterization might have undesirable consequences, as discussed in Section 8.6 of Richardson and Spirtes (2002). Moreover, when Markov chain Monte Carlo algorithms are applied to compute posteriors, the “ancillary” hidden variables will have to be integrated out numerically. The resulting Markov chain can suffer from substantial autocorrelation when compared to a model with no ancillary variables. We illustrate this behavior in Section 7. Further constraints beyond independence constraints are certainly desirable depending on the context. For instance, general ADMGs that are not ancestral graphs may impose other constraints (Richardson and Spirtes, 2002), and such graphs can still be sensible models of, for example, the 2. Notice this rules out the possibility of having both edges Yi → Y j and Yi ↔ Y j in the same ancestral graph. 1191 S ILVA AND G HAHRAMANI causal processes for the problem at hand. When many observed variables are confounded by a same hidden common cause, models based on factor analysis are appropriate (Silva et al., 2006). However, it is useful to be able to build upon independence models that are known to have a complete parameterization. In any case, even the latent variables in any model might have dependencies that arise from other latent variables that were marginalized, and a latent variable ADMG model will be necessary. When it comes to solving a problem, it is up to the modeler (or learning algorithm) to decide if some set of latent variables should be included, or if they should be implicit, living their hidden life through the marginals. Richardson and Spirtes (2002) provide further details on the advantages of a complete parameterization. Drton and Richardson (2004) provide an algorithm for fitting Gaussian ancestral graph models by maximum likelihood. 2.3 Bayesian Inference The literature on Bayesian structural equation models is extensive. Scheines et al. (1999) describe one of the first approaches, including ways of testings such models. Lee (2007) provides details on many recent advances. Standard Bayesian approaches for Gaussian DMG models rely on either attempting to reduce the problem to inference with DAG models, or on using rejection sampling. In an application described by Dunson et al. (2005), the “ancillary latent” trick is employed, and Gibbs sampling for Gaussian DAG models is used. This parameterization has the disadvantages mentioned in the previous section. Scheines et al. (1999) use the complete parameterization, with a single parameter corresponding to each bi-directed edge. However, the global constraint of positive-definiteness in the covariance matrix is enforced only by rejection sampling, which might be inefficient in models with moderate covariance values. The prior is setup in an indirect way. A Gaussian density function is independently defined for each error covariance vi j . The actual prior, however, is the result of multiplying all of such functions and the indicator function that discards non-positive definite matrices, which is then renormalized. In contrast, the Bayesian approach delineated in the next sections uses the complete parameterization, does not appeal to rejection sampling, makes use of a family of priors which we believe is the natural choice for the problem, and leads to convenient ways of computing marginal likelihoods for model selection. We will also see that empirically they lead to much better behaved Markov chain Monte Carlo samplers when compared to DAGs with ancillary latent variables. 3. Gaussian Models of Marginal Independence This section concerns priors and sampling algorithms for zero-mean Gaussian models that are Markov with respect to a bi-directed graph, that is, a DMG with no directed edges. Focusing on bi-directed graphs simplifies the presentation, while providing a convenient starting point to solve the full DMG case in the sequel. Concerning the notation: the distribution we introduce in this section is a distribution over covariance matrices. In the interest of generality, we will refer to the random matrix as Σ. In the context of the previous section, Σ ≡ Σ(Θ) = V, since we are assuming B = 0, α = 0. 1192 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 3.1 Priors Gaussian bi-directed graph models are sometimes called covariance graph models. Covariance graphs are models of marginal independence: each edge corresponds to a single parameter in the covariance matrix (the corresponding covariance); the absence of an edge Yi ↔ Y j is a statement that σYiY j = 0, σXY being the covariance of random variables X and Y . More precisely, if Σ is a random covariance matrix generated by a covariance model, a distribution of Σ is the distribution over the (non-repeated) entries corresponding to variances and covariances of adjacent nodes.3 In a model with a fully connected bi-directed graph, this reduces to a space of unrestricted covariance matrices. A common distribution for covariance matrices is the inverse Wishart IW (δ, U). In this paper, we adopt the following inverse Wishart parameterization: 1 p(Σ) ∝ |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ positive definite, 2 p(·) being the density function, tr(·) the trace function, and m the number of variables (nodes) in our model.4 We will overload the symbol p(·) wherever it is clear from the context which density function we are referring to. It is assumed that δ > 0 and U is positive definite. Following Atay-Kayis and Massam (2005), let M + (G ) be the cone of positive definite matrices such that, for a given bi-directed graph G and Σ ∈ M + (G ), σi j = 0 if nodes Yi and Y j are not adjacent in G . It is convenient to choose a distribution that is conjugate to the Gaussian likelihood function, since one can use the same algorithms for performing inference both in the prior and posterior. In a zero-mean Gaussian model, the likelihood function for a fixed data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined by the sufficient statistic S = ∑n (Y(d) )(Y(d) )T as follows: d=1 1 2 L (Σ; D ) = (2π)−nm/2 |Σ|−n/2 exp − tr(Σ−1 S) . (2) We extend the inverse Wishart distribution to the case of constrained covariance matrices in order to preserve conjugacy. This define the following distribution: p(Σ) = 1 1 |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ ∈ M + (G ) IG (δ, U) 2 (3) which is basically a re-scaled inverse Wishart prior with a different support and, consequently, different normalizing constant IG (δ, U). An analogous concept exists for undirected graphs, where Σ−1 ∈ M + (G ) is given a Wishart-like prior: the “G -Wishart” distribution (Atay-Kayis and Massam, 2005). We call the distribution with density function defined as in Equation (3) the G -Inverse Wishart distribution (G -IW ). It will be the basis of our framework. There are no analytical formulas for the normalizing constant. 3. As such, the density function for Σ is defined with respect to the Lebesgue measure of the non-zero, independent elements of this matrix. 4. We adopt this non-standard parameterization of the inverse Wishart because it provides a more convenient reparameterization used in the sequel. Notice this is the parameterization used by Brown et al. (1993) and Atay-Kayis and Massam (2005), which developed other distributions for covariance matrices. 1193 S ILVA AND G HAHRAMANI 3.2 The Normalizing Constant We now derive a Monte Carlo procedure to compute IG (δ, U). In the sequel, this will be adapted into an importance sampler to compute functionals of a G -IW distribution. The core ideas are also used in a Gibbs sampler to obtain samples from its posterior. The normalizing constant is essential for model selection of covariance graphs. By combining the likelihood equation (2) with the prior (3), we obtain the joint nm p(D , Σ | G ) = (2π)− 2 IG (δ, U)−1 × |Σ|− δ+2m+n 2 1 exp − tr[Σ−1 (S + U)] 2 where we make the dependency on the graphical structure G explicit. By the definition of IG , integrating Σ out of the above equation implies the following marginal likelihood: p(D | G ) = 1 IG (δ + n, S + U) nm IG (δ, U) (2π) 2 from which a posterior P (G | D ) can be easily derived as a function of quantities of the type IG (·, ·). The normalizing constant IG (δ, U) is given by the following integral:5 IG (δ, U) = Z M + (G ) |Σ|− δ+2m 2 1 exp − tr(Σ−1 U) dΣ. 2 (4) The space M + (G ) can be described as the space of positive definite matrices conditioned on the event that each matrix has zero entries corresponding to non-adjacent nodes in graph G . We will reduce the integral (4) to an integral over random variables we know how to sample from. The given approach follows the framework of Atay-Kayis and Massam (2005) using the techniques of Drton and Richardson (2003). Atay-Kayis and Massam (2005) show how to compute the marginal likelihood of nondecomposable undirected models by reparameterizing the precision matrix through the Cholesky decomposition. The zero entries in the inverse covariance matrix of this model correspond to constraints in this parameterization, where part of the parameters can be sampled independently and the remaining parameters calculated from the independent ones. We will follow a similar framework but with a different decomposition. It turns out that the Cholesky decomposition does not provide an easy reduction of (4) to an integral over canonical, easy to sample from, distributions. We can, however, use Bartlett’s decomposition to achieve this reduction. 3.2.1 BARTLETT ’ S D ECOMPOSITION Before proceeding, we will need a special notation for describing sets of indices and submatrices. Let {i} represent the set of indices {1, 2, . . . , i}. Let Σi,{i−1} be the row vector containing the covariance between Yi and all elements of {Y1 ,Y2 , . . . ,Yi−1 }. Let Σ{i−1},{i−1} be the marginal covariance matrix of {Y1 ,Y2 , . . . ,Yi−1 }. Let σii be the variance of Yi . Define the mapping Σ → Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }, 5. Notice this integral is always finite for any choice of δ > 0 and positive definite U, since it is no greater than the normalizing constant of the inverse Wishart. 1194 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS such that Bi is a row vector with i − 1 entries, γi is a scalar, and γ1 = σ11 , Bi = Σi,{i−1} Σ−1 i > 1, {i−1},{i−1} , −1 γi = σii.{i−1},{i−1} ≡ σii − Σi,{i−1} Σ{i−1},{i−1} Σ{i−1},i , i > 1. (5) The set Φ provides a parameterization of Σ, in the sense that the mapping (5) is bijective. Given that σ11 = γ1 , the inverse mapping is defined recursively by Σi,{i−1} = Bi Σ{i−1},{i−1} , i > 1, σii = γi + Bi Σ{i−1},i , i > 1. (6) We call the set Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } the Bartlett parameters of Σ, since the decomposition (6) is sometimes known as Bartlett’s decomposition (Brown et al., 1993). For a random inverse Wishart matrix, Bartlett’s decomposition allows the definition of its density function by the joint density of {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }. Define U{i−1},{i−1} , U{i−1},i and uii.{i−1},{i−1} in a way analogous to the Σ definitions. The next lemma follows directly from Lemma 1 of Brown et al. (1993): Lemma 1 Suppose Σ is distributed as IW (δ, U). Then the distribution of the corresponding Bartlett parameters Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } is given by: 1. γi is independent of Φ\{γi , Bi } 2. γi ∼ IG((δ + i − 1)/2, uii.{i−1,i−1} /2), where IG(α, β) is the inverse gamma distribution −1 3. Bi | γi ∼ N(U−1 {i−1},{i−1} U{i−1},i , γi U{i−1},{i−1} ), where N(M, C) is a multivariate Gaussian −1 distribution and U−1 {i−1},{i−1} ≡ (U{i−1},{i−1} ) . 3.2.2 BARTLETT ’ S D ECOMPOSITION OF M ARGINAL I NDEPENDENCE M ODELS What is interesting about Bartlett’s decomposition is that it provides a simple parameterization of the inverse Wishart distribution with variation independent parameters. This decomposition allows the derivation of new distributions. For instance, Brown et al. (1993) derive a “Generalized Inverted Wishart” distribution that allows one to define different degrees of freedom for different submatrices of an inverse Wishart random matrix. For our purposes, Bartlett’s decomposition can be used to reparameterize the G -IW distribution. For that, one needs to express the independent elements of Σ in the space of Bartlett parameters. The original reparameterization maps Σ to Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bd , γd }. To impose the constraint that Yi and Y j are uncorrelated, for i > j, is to set Bi Σ{i−1},{i−1} j = σYiY j (Φ) = 0. For a fixed Σ{i−1},{i−1} , this implies a constraint on (Bi ) j ≡ βi j . Following the terminology used by Richardson and Spirtes (2002), let a spouse of node Y in a mixed graph be any node adjacent to Y by a bi-directed edge. The set of spouses of Yi is denoted by sp(i). The set of spouses of Yi according to order Y1 ,Y2 , . . . ,Ym is defined by sp≺ (i) ≡ sp(i) ∩ {Y1 , . . . ,Yi−1 }. The set of non-spouses of Yi is denoted by nsp(i). Analogously, nsp≺ (i) ≡ {Y1 , . . . ,Yi−1 }\sp≺ (i). Let Bi,sp≺ (i) be the subvector of Bi corresponding to the the respective spouses of Yi . Define Bi,nsp≺ (i) analogously. 1195 S ILVA AND G HAHRAMANI Given the constraint Bi Σ{i−1},nsp≺ (i) = 0, it follows that Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) + Bi,nsp≺ (i) Σnsp≺ (i),nsp≺ (i) = 0 ⇒ Bi,nsp≺ (i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp (7) Identity (7) was originally derived by Drton and Richardson (2003). A property inherited from the original decomposition for unconstrained matrices is that Bi,sp≺ (i) is functionally independent of Σ{i−1},{i−1} . From (7), we obtain that the free Bartlett parameters of Σ are ΦG ≡ {γ1 , B2,sp≺ (2) , γ2 , B3,sp≺ (3) , γ3 , . . . , Bm,sp≺ (m) , γm }. Notice that, according to (5), Φ corresponds to the set of parameters of a fully connected, zeromean, Gaussian DAG model. In such a DAG, Yi is a child of {Y1 , . . . ,Yi−1 }, and Yi = Bi Yi−1 + ζ j , ζ j ∼ N(0, γ j ) where Yi−1 is the (i − 1) × 1 vector corresponding to {Y1 , . . . ,Yi−1 }. As discussed by Drton and Richardson (2003), this interpretation along with Equation (7) implies Yi = Bi,sp≺ (i) Zi + ζ j (8) where the entries in Zi are the corresponding residuals of the regression of sp≺ (i) on nsp≺ (i). The next step in solving integral (4) is to find the Jacobian J(ΦG ) of the transformation Σ → ΦG . This is given by the following Lemma: Lemma 2 The determinant of the Jacobian for the change of variable Σ → ΦG is m |J(ΦG )| = ∏ |Ri | = i=2 m−1 1 γm−i m ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 i ∏ where Ri ≡ Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) , that is, the covariance matrix of the nsp / respective residual Zi (as parameterized by ΦG ). If nsp≺(i) = 0, Ri is defined as Σsp≺(i) ,sp≺(i) and |Σnsp≺(i) ,nsp≺(i) | is defined as 1. The proof of this Lemma is in Appendix C. A special case is the Jacobian of the unconstrained covariance matrix (i.e., when the graph has no missing edges): m−1 |J(Φ)| = ∏ γm−i . i (9) i=1 Now that we have the Jacobian, the distribution over Bartlett’s parameters given by Lemma 1, and the identities of Drton and Richardson (2003) given in Equation (7), we have all we need to provide a Monte Carlo algorithm to compute the normalizing constant of a G -IW with parameters (δ, U). Let Σ(ΦG ) be the implied covariance matrix given by our set of parameters ΦG . We start from the integral in (4), and rewrite it as a function of ΦG . This can be expressed by substituting Σ for Σ(ΦG ) and multiplying the integrand by the determinant of the Jacobian. Notice that the parameters in Σ(ΦG ) are variation independent: that is, their joint range is given by the product of 1196 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS their individual ranges (positive reals for the γ variables and the real line for the β coefficients). This range will replace the original M + (G ) space, which we omit below for simplicity of notation: IG (δ, U) = Z |J(ΦG )||Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . 2 We now multiply and divide the above expression by the normalizing constant of an inverse Wishart (δ, U), which we denote by IIW (δ, U): IG (δ, U) = IIW (δ, U) Z −1 |J(ΦG )| × IIW (δ, U)|Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . (10) 2 The expression −1 IIW (δ, U)|Σ|− δ+2m 2 1 exp − tr(Σ−1 U) 2 corresponds to the density function of an inverse Wishart Σ. Lemma 1 allows us to rewrite the inverse Wishart density function as the density of Bartlett parameters, but this is assuming no independence constraints. We can easily reuse the result of Lemma 1 as follows: 1. write the density of the inverse Wishart as the product of gamma-normal densities given in Lemma 1; 2. this expression contains the original Jacobian determinant |J(Φ)|. We have to remove it, since we are plugging in our own Jacobian determinant. Hence, we divide the reparameterized density by the expression in Equation (9). This ratio |J(ΦG )|/|J(Φ)| can be rewritten as m |J(ΦG )| |Ri | 1 = ∏ m−i = m |J(Φ)| ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 γi / where |Σnsp≺(i) ,nsp≺(i) | ≡ 1 if nsp≺ (i) = 0; 3. substitute each vector Bi,nsp≺ (i) , which is not a free parameter, by the corresponding expression −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp This substitution takes place into the original factors given by Bartlett’s decomposition, as introduced in Lemma 1: −(i−1)/2 p(Bi , γi ) = (2π)−(i−1)/2 γi |U{i−1},{i−1} |1/2 1 × exp − (BiT − Mi )T U{i−1},{i−1} (BiT − Mi ) 2γi (uii.{i−1},{i−1} /2)(δ+i−1)/2 −( δ+i−1 +1) 1 exp − uii.{i−1},{i−1} × γi 2 Γ((δ + i − 1)/2) 2γi where Mi ≡ U−1 {i−1},{i−1} U{i−1},i . Plugging in this in (10) results in 1197 (11) S ILVA AND G HAHRAMANI IG (δ, U) = IIW (δ, U) Z 1 ∏m |Σnsp≺(i) ,nsp≺(i) | i=2 m × p(γ1 ) ∏ p(Bi , γi ) dΦG . i=2 However, after substitution, each factor p(Bi , γi ) is not in general a density function for {Bi,sp≺ (i) , γi } and will include also parameters {B j,sp≺ ( j) , γ j }, j < i. Because of the non-linear relationships that link Bartlett parameters in a marginal independence model, we cannot expect to reduce this expression to a tractable distribution we can easily sample from. Instead, we rewrite each original density factor p(Bi , γi ) such that it includes all information about Bi,sp≺ (i) and γi within a canonical density function. That is, factorize p(Bi , γi ) as p(Bi , γi |Φi−1 ) = pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) × fi (Φi−1 ) (12) where we absorb any occurrence of Bi,sp≺ (i) within the sampling distribution and factorize the remaining dependence on previous parameters Φi−1 ≡ {γ1 , γ2 , B2,sp≺ (2) , . . . , γi−1 , Bi−1,sp≺ (i−1) } into a separate function.6 We derive the functions pb (·), pg (·) and fi (·) in Appendix A. The result is as follows. The density pb (Bi,sp≺ (i) |γi , Φi−1 ) is the density of a Gaussian N(Ki mi , γi Ki ) such that mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i (13) Ai = Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp where Uss Usn Uns Unn ≡ Usp≺ (i),sp≺ (i) Usp≺ (i),nsp≺ (i) Unsp≺ (i),sp≺ (i) Unsp≺ (i),nsp≺ (i) . (14) The density pg (γi |Φi−1 ) is the density of an inverse gamma IG(g1 , g2 ) such that δ + i − 1 + #nsp≺ (i) , 2 uii.{i−1},{i−1} + Ui = , 2 TU = Mi {i−1},{i−1} Mi − mT Ki mi . i g1 = g2 Ui where uii.{i−1},{i−1} was originally defined in Section 3.2.1. Finally, (i−1)−#sp≺ (i) 2 |Ki |1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 6. A simpler decomposition was employed by Silva and Ghahramani (2006) (notice however that paper used an incorrect expression for the Jacobian). The following derivation, however, can be adapted with almost no modification to define a Gibbs sampling algorithm, as we show in the sequel. 1198 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS / Density function pb (Bi,sp≺ (i) |·, ·) and determinant |Ki |1/2 are defined to be 1 if sp≺ (i) = 0. Ui TU / / is defined to be zero if nsp≺ (i) = 0, and Ui = Mi {i−1},{i−1} Mi if sp≺ (i) = 0. The original normalizing constant integral is the expected value of a function of ΦG over a factorized inverse gamma-normal distribution. The density function of this distribution is given below: m pI(δ,U) (ΦG ) = m ∏ pg (γi |Φi−1 ) ∏ pb (Bi,sp i=1 ≺ (i) |γi , Φi−1 ) . i=2 We summarize the main result of this section through the following theorem: Theorem 3 Let f (X) p(X) be the expected value of f (X) where X is a random vector with density p(X). The normalizing constant of a G -Inverse Wishart with parameters (δ, U) is given by m IG (δ, U) = IIW (δ, U) × fi (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | . pI(δ,U) (ΦG ) This can be further simplified to m IG (δ, U) = fi′ (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | (15) pI(δ,U) (ΦG ) where fi′ (Φi−1 ) ≡ (2π) #sp≺ (i) 2 |Ki (Φi−1 )|1/2 Γ((δ + i − 1 + #nsp≺ (i))/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 which, as expected, reduces IG (δ, U) to IIW (δ, U) when the graph is complete. (1) (2) A Monte Carlo estimate of IG (δ, U) is then given from (15) by obtaining samples {ΦG , ΦG , (M) . . . , ΦG } according to pI(δ,U) (·) and computing: (s) fi′ (Φi−1 ) 1 M m IG (δ, U) ≈ ∑∏ M s=1 i=1 |Σnsp ,nsp (Φ(s) )| i−1 ≺(i) ≺(i) where here we emphasize that Σnsp≺(i) ,nsp≺(i) is a function of ΦG as given by (6). 3.3 General Monte Carlo Computation If Y follows a Gaussian N(0, Σ) where Σ is given a G -IW (δ, U) prior, then from a sample D = {Y(1) , . . . , Y(n) } with sufficient statistic S = ∑n (Y(d) )(Y(d) )T , the posterior distribution for Σ d=1 given S will be a G -IW (δ + n, U + S). In order to obtain samples from the posterior or to compute its functionals, one can adapt the algorithm for computing normalizing constants. We describe an importance sampler for computing functionals, followed by a Gibbs sampling algorithm that also provides samples from the posterior. 1199 S ILVA AND G HAHRAMANI Algorithm S AMPLE GIW-1 Input: a m × m matrix U, scalar δ, bi-directed graph G , an ordering ≺ 1. Let Σ be a m × m matrix 2. Define functions sp≺ (·), nsp≺ (·) according to G and ordering ≺ 3. Sample σ11 from IG(δ/2, u11 /2) 4. For i = 2, 3, . . . , m 5. Sample γi ∼ IG((δ + i − 1 + #nsp≺ (i))/2, (uii.{i−1},{i−1} + Ui )/2) 6. Sample Bi,sp≺ (i) ∼ N(Ki mi , γi Ki ) 7. Set Bi,nsp≺(i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp 8. Set ΣT {i−1},i = Σi,{i−1} = Bi Σ{i−1},{i−1} 9. Set σii = γi + Bi Σi,{i−1} 10. Set w = ∏m fi′ (Φi−1 )/|Σnsp≺ (i),nsp≺ (i) | i=1 11. Return (w, Σ). Figure 4: A procedure for generating an importance sample Σ and importance weight w for computing functionals of a G -Inverse Wishart distribution. Variables {Mi , mi , Ki , Ui } and function fi′ (Φi−1 ) are defined in Section 3.2.2. 3.3.1 T HE I MPORTANCE S AMPLER One way of computing functionals of the G -IW distribution, that is, functions of the type g(δ, U; G ) ≡ Z M + (G ) g(Σ)p(Σ | δ, U, G ) dΣ is through the numerical average g(δ, U; G ) ≈ ∑M ws g(Σ(s) ) s=1 , ∑M ws s=1 where weights {w1 , w2 , . . . , wM } and samples {Σ(1) , Σ(2) , . . . , Σ(M) } are generated by an importance sampler. The procedure for computing normalizing constants can be readily adapted for this task using pI(δ,U) (·) as the importance distribution and the corresponding weights from the remainder factors. The sampling algorithm is shown in Figure 4. 3.3.2 T HE G IBBS S AMPLER While the importance sampler can be useful to compute functionals of the distribution, we will need a Markov chain Monte Carlo procedure to sample from the posterior. In the Gibbs sampling 1200 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Algorithm S AMPLE GIW-2 Input: a m × m matrix U, scalar δ, bi-directed graph G , a m × m matrix Σstart 1. Let Σ be a copy of Σstart 2. Define functions sp(·), nsp(·) according to G 3. For i = 1, 2, 3, . . . , m 4. Sample γi ∼ IG((δ + (m − 1) + #nsp(i))/2, (uii.{\i},{\i} + U\i )/2) 5. Sample Bi,sp(i) from a N(K\i m\i , γi K\i ) 6. Set Bi,nsp(i) = −Bi,sp(i) Σsp(i),nsp(i) Σ−1 nsp(i),nsp(i) 7. Set ΣT = Σi,{\i} = Bi Σ{\i},{\i} {\i},i 8. Set σii = γi + Bi Σi,{\i} 9. Return Σ. Figure 5: A procedure for generating a sampled Σ within a Gibbs sampling procedure. procedure, we sample the whole i-th row of Σ, for each 1 ≤ i ≤ m, by conditioning on the remaining independent entries of the covariance matrix as obtained on the previous Markov chain iteration. The conditional densities required by the Gibbs sampler can be derived from (12), which for a particular ordering ≺ implies m p(Σ; δ, U, G ) ∝ pg (γ1 ) ∏ pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) fi (Φi−1 ). i=2 By an abuse of notation, we used Σ in the left-hand side and the Bartlett parameters in the righ-hand side. The conditional density of {Bm,sp≺ (m) , γm } given all other parameters is therefore p(Bm,sp≺ (m) , γm |ΦG \{Bm,sp≺ (m) , γm }) = pb (Bm,sp≺ (m) |γm , Φm−1 )pg (γm |Φm−1 ) from which we can reconstruct a new sample of the m-th row/column of Σ after sampling {Bm,sp≺ (m) , γm }. Sampling other rows can be done by redefining a new order where the corresponding target variable is the last one. More precisely: let {\i} denote the set {1, 2, . . . , i − 1, i + 1, . . . , m}. The Gibbs algorithm is analogous to the previous algorithms. Instead of sp≺ (i) and nsp≺ (i), we refer to the original sp(i) and nsp(i). Matrices Σ{\i},{\i} and U{\i},{\i} are defined by deleting the respective i-th row and i-th columns. Row vector Σi,{\i} and scalar uii.{\i} are defined accordingly, as well as any other vector and matrix originally required in the marginal likelihood/importance sampling procedure. The algorithm is described in Figure 5. The procedure can be interpreted as calling a modification of the importance sampler with a dynamic ordering ≺i which, at every step, moves Yi to the end of the global ordering ≺. 1201 S ILVA AND G HAHRAMANI 3.4 Remarks The importance sampler suffers from the usual shortcomings in high-dimensional problems, where a few very large weights dominate the procedure (MacKay, 1998). This can result in unstable estimates of functionals of the posterior and the normalizing constant. The stability of the importance sampler is not a simple function of the number of variables in the domain. For large but sparse graphs, the number of parameters might be small. For large but fairly dense graphs, the importance distribution might be a good match to the actual distribution since there are few constraints. In Section 7, we performe some experiments to evaluate the sampler. When used to compute functionals, the Gibbs sampler is more computationally demanding considering the cost per step, but we expect it to be more robust in high-dimensional problems. In problems that require repeated calculations of functionals (such as the variational optimization procedure of Section 4.3), it might be interesting to run a few preliminary comparisons between the estimates of the two samplers, and choose the (cheaper) importance sampler if the estimates are reasonably close. Na¨vely, the Gibbs sampler costs O(m4 ) per iteration, since for each step we have to invert the ı matrix Σnsp{\i},nsp{\i} , which is of size O(m) for sparse graphs. However, this inversion can cost much less than O(m3 ) if sparse matrix inversion methods are used. Still, the importance sampler can be even more optimized by using the methods of Section 6. 4. Gaussian Directed Mixed Graph Models As discussed in Section 2, Gaussian directed mixed graph models are parameterized by the set with parameters Θ = {V, B, α}. Our prior takes the form p(Θ) = p(B)p(α)p(V). We assign priors for the parameters of directed edges (non-zero entries of matrix B) in a standard way: each parameter bi j is given a Gaussian N(cBj , sBj ) prior, where all parameters are marginally independent in the i i prior, that is, p(B) = ∏i j p(bi j ). The prior for intercept parameters α is analogous, with αi being a Gaussian N(cα , sα ). i i Recall from Equation (1) that the implied covariance of the model is given by the matrix Σ(Θ) = (I − B)−1 V(I − B)−T . Similarly, we have the implied mean vector µ(Θ) ≡ (I − B)−1 α. The likelihood function for data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined as L (Θ; D ) = |Σ(Θ)|−n/2 ∏n exp − 1 (Y(d) − µ(Θ))T Σ(Θ)−1 (Y(d) − µ(Θ)) d=1 2 = |(I − B)−1 ||V||(I − B)−T | −n/2 exp − 1 tr(V−1 (I − B)S(I − B)T ) 2 , where now S ≡ ∑n (Y(d) − µ(Θ))(Y(d) − µ(Θ))T . d=1 Given a prior G -IW (δ, U) for V, it immediately follows that the posterior distribution of V given the data and other parameters is V | {B, α, D } ∼ G -IW (δ + n, U + (I − B)S(I − B)T ). Therefore it can be sampled using the results from the previous section. Notice this holds even if the directed mixed graph G is cyclic. Sampling αi given {D , Θ\{αi }} can also be done easily for both cyclic and acyclic models: the ′ ′ ′ posterior is given by a normal N(cα /sα , 1/sα ) where i i i 1202 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS sα i cα i ′ ≡ ′ ≡ 1 + n(V−1 )ii , sα i n m m cα (d) (d) −1 −1 i , − ∑ bt pt Ypt α − n ∑ (V )it αt + ∑ ∑ (V )it Yt si pt d=1 t=1 t=1,t=i with pt being an index running over the parents of Yt in G . However, sampling the non-zero entries of B results in two different cases depending whether G is cyclic or not. We deal with them separately. 4.1 Sampling from the Posterior: Acyclic Case The acyclic case is simplified by the fact that I − B can be rearranged in a way it becomes lower triangular, with each diagonal element being 1. This implies the identity |(I−B)−1 ||V||(I−B)−T | = |V|, with the resulting log-likelihood being a quadratic function of the non-zero elements of B. Since the prior for coefficient bi j is Gaussian, its posterior given the data and all other parameters will be ′ ′ ′ the Gaussian N(cbj /sbj , 1/sbj ) where i i i ′ sbj i ′ cbj i ≡ ≡ n 1 (d) + (V−1 )ii ∑ (Y j )2 , b si j d=1 m n cbj i (d) (d) (d) + ∑ Y j ∑ (V−1 )it Yt − ∑ bt pt Ypt − αt . b si j d=1 t=1 pt ,(t,pt )=(i, j) (16) As before, pt runs over the indices of the parents of Yt in G . Notice that in the innermost (d) summation we exclude bi jY j . We can then sample bi j accordingly. It is important to notice that, in practice, better mixing behavior can be obtained by sampling the coefficients (and intercepts) jointly. The joint distribution is Gaussian and can be obtained in a way similar to the above derivation. The derivation of the componentwise conditionals is nevertheless useful in the algorithm for cyclic networks. 4.2 Sampling from the Posterior: Cyclic Case Cyclic directed graph models have an interpretation in terms of causal systems in equilibrium. The simultaneous presence of directed paths Yi → · · · → Y j and Y j → · · · → Yi can be used to parameterize instantaneous causal effects in a feedback loop (Spirtes, 1995). This model appears also in the structural equation modeling literature (Bollen, 1989). In terms of cyclic graphs as families of conditional independence constraints, methods for reading off constraints in linear systems also exist (Spirtes et al., 2000). The computational difficulty in the cyclic case is that the determinant |I − B| is no longer a constant, but a multilinear function of coefficients {bi j }. Because bi j will appear outside the exponential term, its posterior will no longer be Gaussian. From the definition of the implied covariance matrix Σ(Θ), it follows that |Σ(Θ)|−n/2 = (|I − B||V|−1 |I − B|)n/2 . As a function of coefficient bi j , |I − B| = (−1)i+ j+1Ci j bi j + k=m ∑ k=1,k= j 1203 (−1)i+k+1Cik bik , S ILVA AND G HAHRAMANI where Ci j is the determinant of respective co-factor of I − B, bik ≡ 0 if there is no edge Yi ← Yk , and bii ≡ −1. The resulting density function of bi j given D and Θ\{bi j } is ′ p(bi j |Θ\{bi j }, D ) ∝ |bi j − κi j | exp − n ′ (bi j − cbj /sbj )2 i i ′ 2sbj i , where κi j ≡ Ci−1 j k=m ∑ (−1)k− j+1Cik bik k=1,k= j ′ ′ and {cbj , sbj } are defined as in Equation (16). Standard algorithms such as Metropolis-Hastings can i i be applied to sample from this posterior within a Gibbs procedure. 4.3 Marginal Likelihood: A Variational Monte Carlo Approach While model selection of bi-directed graphs can be performed using a simple Monte Carlo procedure as seen in the previous Section, the same is not true in the full Gaussian DMG case. Approaches such as nested sampling (Skilling, 2006) can in principle be adapted to deal with the full case. For problems where there are many possible candidates to be evaluated, such a computationally demanding sampling procedure might be undesirable (at least for an initial ranking of graphical structures). As an alternative, we describe an approximation procedure for the marginal likelihood p(D |G ) by combining variational bounds (Jordan et al., 1998) with the G -Inverse Wishart samplers, and therefore avoiding a Markov chain over the joint model of coefficients and error covariances. This is described for acyclic DMGs only. We adopt the following approximation in our variational approach, accounting also for possible latent variables X: n p(V, B, α, X|D ) ≈ q(V)q(B, α) ∏ q(X(d) ) ≡ q(V)q(B, α)q(X) d=1 with q(B, α) being a multivariate Gaussian density of the non-zero elements of B and α. Function q(X(d) ) is also a Gaussian density, and function q(V) is a G -Inverse Wishart density. From Jensen’s inequality, we obtain the following lower-bound (Beal, 2003, p. 47): ln p(D |G ) = ln p(Y, X|V, B, α)p(V, B, α) dX dB dV dα ≥ ln p(Y, X|V, B, α) q(V)q(B,α)q(X) + ln p(V)/q(V) q(V) + ln p(B, α)/q(B, α) q(B,α) − ln q(X) q(X) R (17) where this lower bound can be optimized with respect to functions q(V), q(B), q(X). This can be done by iterative coordinate ascent, maximizing the bound with respect to a single q(·) function at a time. The update of q(V) is given by qnew (V) = pG -IW (δ + d, U + (I − B)S(I − B)T 1204 q(X)q(B,α) ) BAYESIAN L EARNING WITH M IXED G RAPH M ODELS where pG -IW (·) is the density function for a G -Inverse Wishart, and S is the empirical second moment matrix summed over the completed data set (X, Y) (hence the expectation over q(X)) centered at µ(Θ). The updates for q(B, α) and q(X) are tedious but straightforward derivations, and described in Appendix B. The relevant fact about these updates is that they are functions of V−1 q(V) . Fortunately, we pay a relatively small cost to obtain these inverses using the Monte Carlo sampler of Figure 4: from the Bartlett parameters, define a lower triangular m × m matrix B (by placing on the ith line the row vector Bi , followed by zeroes) and a diagonal matrix Γ from the respective vector of γi ’s. The matrix V−1 can be computed from (I − B )T Γ−1 (I − B ), and the relevant expectation computed according to the importance sampling procedure. For problems of moderate dimensionality,7 the importance sampler might not be recommended, but the Gibbs sampler can be used. At the last iteration of the variational maximization, the (importance or posterior) samples from q(V) can then be used to compute the required averages in (17), obtaining a bound on the marginal log-likelihood of the model. Notice that the expectation ln p(V)/q(V) q(V) contains the entropy of q(V), which will require the computation of G -inverse Wishart normalizing constants. For large problems, the cost of this approximation might still be prohibitive. An option is to partially parameterize V in terms of ancillary latents and another submatrix distributed as a G -inverse Wishart, but details on how to best do this partition are left as future work (this approximation will be worse but less computationally expensive if ancillary latents are independent of the coefficient parameters in the variational density function q(·)). Laplace approximations might be an alternative, which have been successfully applied to undirected non-decomposable models (Roverato, 2002). We emphasize that the results present in this section are alternatives that did not exist before in previous approaches for learning mixed graph structures through variational methods (e.g., Silva and Scheines, 2006). It is true that the variational approximation for marginal likelihoods will tend to underfit the data, that is, generate models simpler than the true model in simulations. Despite the bias introduced by the method, this is less of a problem for large data sets (Beal and Ghahramani, 2006) and the method has been shown to be useful in model selection applications (Silva and Scheines, 2006), being consistently better than standard scores such as BIC when hidden variables are present (Beal and Ghahramani, 2006). An application in prediction using the variational posterior instead of MCMC samples is discussed by Silva and Ghahramani (2006). It is relevant to explore other approaches for marginal likelihood evaluation of DMG models using alternative methods such as annealed importance sampling (Neal, 2001) and nested sampling (Skilling, 2006), but it is unrealistic to expect that such methods can be used to evaluate a large number of candidate models. A pre-selection by approximations such as variational methods might be essential. 5. Discrete Models: The Probit Case Constructing a discrete mixed graph parameterization is not as easy as in the Gaussian case. Advances in this area are described by Drton and Richardson (2008a), where a complete parameterization of binary bi-directed graph models is given. In our Bayesian context, inference with the mixed graph discrete models of Drton and Richardson would not to be any computationally easier than the case for Markov random fields, which has been labeled as doubly-intractable (Murray et al., 2006). 7. We observed a high ratio of the highest importance weight divided by the median weight in problems with dimensionality as low as 15 nodes. However, notice that in practice the error covariance matrix V has a block diagonal structure, and only the size of the largest block is relevant. This is explained in more detail in Section 6. 1205 S ILVA AND G HAHRAMANI Instead, in this paper we will focus on a class of discrete models that has been widely used in practice: the probit model (Bartholomew and Knott, 1999). This model is essentially a projection of a Gaussian distribution into a discrete space. It also allows us to build on the machinery developed in the previous sections. We will describe the parameterization of the model for acyclic DMGs, and then proceed to describe algorithms for sampling from the posterior distribution. 5.1 Parameterizing Models of Observable Independencies A probit model for the conditional probability of discrete variable Yi given a set of variables {Yi1 , ..., Yik } can be described by the two following relationships: P (Yi = vi l Yi⋆ = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi | Yi⋆ ) = 1(τi ≤ Yi⋆ < τi ) l−1 l (18) where P (·) is the probability mass function of a given random variable, as given by the context, and 1(·) is the indicator function. Yi assumes values in {vi , vi , . . . , vi }. Thresholds {τi = −∞ < τi < 1 1 0 1 κ(i) τi < · · · < τi = ∞} are used to define the mapping from continuous Yi⋆ to discrete Yi . This model 2 κ(i) has a sensible interpretation for ordinal and binary values as the discretization of some underlying latent variable (UV) Yi⋆ . Such a UV is a conditionally Gaussian random variable, which follows by assuming normality of the error term εi . This formulation, however, is not appropriate for general discrete variables, which are out of the scope of this paper. Albert and Chib (1993) describe alternative Bayesian treatments of discrete distributions not discussed here. Given this binary/ordinal regression formulation, the natural step is how to define a graphical model accordingly. As a matter of fact, the common practice does not strictly follow the probit regression model. Consider the following example: for a given graph G , a respective graphical representation of a probit model can be built by first replicating G as a graph G ⋆ , where each vertex Yi is relabeled as Yi⋆ . Those vertices represent continuous underlying latent variables (UVs). To each vertex Yi⋆ in G ⋆ , we then add a single child Yi . We call this the Type-I UV model. Although there are arguments for this approach (see, for instance, the arguments by Webb and Forster (2006) concerning stability to ordinal encoding), this is a violation of the original modeling assumption as embodied by G : if the given graph is a statement of conditional independence constraints, it is expected that such independencies will be present in the actual model. The Type-I formulation does not fulfill this basic premise: by construction there are no conditional independence constraints among the set of variables Y (the marginal independencies are preserved, though). This is illustrated by Figure 6(b), where the conditional independence of Y1 and Y3 given Y2 disappears. An alternative is illustrated in Figure 6(c). Starting from the original graph G (as in Figure 6(a)), the probit graph model G ⋆ shown in the Figure is built from G by the following algorithm: 1. add to empty graph G ⋆ the vertices Y of G , and for each Yi ∈ Y, add a respective UV Yi⋆ and the edge Yi⋆ → Yi ; 2. for each Yi → Y j in G , add edge Yi → Y j⋆ to G ⋆ ; 3. for each Yi ↔ Y j in G , add edge Yi⋆ ↔ Y j⋆ to G ⋆ ; We call this the Type-II UV model, which has the following property (the proof is in Appendix C): 1206 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y* 1 Y1 Y2 (a) Y2 Y* 2 Y3 Y* 4 Y4 Y1 Y3 Y* 1 Y* 3 (b) Y2 Y* 2 Y* 4 Y4 Y1 Y3 Y* 3 Y4 (c) Figure 6: The model in (a) has at least two main representations as a probit network. In (b), the original structure is given to the underlying variables, with observed variables being children of their respective latents. In (c), the underlying variable inherits the parents of the original variable and the underlying latents of the spouses. Theorem 4 Suppose G is acyclic with vertex set Y. Yi and Y j are m-separated given Z ⊆ Y\{Yi ,Y j } in G if and only if Yi and Y j are m-separated given Z in G ⋆ . The parameterization of the Type-II UV model follows from the definition of probit regression: the conditional distribution Yi given its parents in {Yi1 , ...,Yik } in G is given as in Equation (18), while the error terms {ε1 , ε2 , . . . , εm } follow the multivariate Gaussian N(0, V). The entry corresponding to the covariance of εi and ε j is assumed to be zero if there is no bi-directed edge Yi ↔ Y j in G . In what follows, we discuss algorithms for Type-II models. The approach here described can be easily adapted to cover Type-I models. We say that Type-II models are models of observable independencies, since independencies hold even after marginalizing all UVs. 5.2 Algorithm As before, we provide a Gibbs sampling scheme to sample parameters Θ = {α, B, V, T } from the posterior distribution given data set D = {Y(1) , Y(2) , . . . , Y(n) }. The set T = {Ti } is the set of threshold parameters, Ti = {τi = −∞ < τi < τi < · · · < τi = ∞} for each random variable Yi with 2 1 0 κ(i) κ(i) different values. We will not discuss priors and algorithms for sampling T given the other parameters: this can be done by standard approaches (e.g., Albert and Chib, 1993).8 For the purposes of the Gibbs procedure, we augment the data set with the underlying variables ⋆ = {Y⋆(1) , Y⋆(2) , . . . , Y⋆(n) } at each sampling step. D From the set of structural equations Y⋆(d) = α + BY(d) + ε it follows that the conditional distribution of Y⋆(d) given the D ∪ Θ is a truncated Gaussian with mean α + BY(d) and covariance matrix V. The truncation levels are given by the thresholds and (d) ⋆(d) observed data Y(d) : for each Yi = vi , the range for Yi becomes [τi , τi ). Sampling from a l l−1 l truncated Gaussian is a standard procedure. We used the algorithm of Kotecha and Djuric (1999) in our implementation. To sample V from its conditional, we will rely on the following result. 8. In Section 7, we perform experiments with binary data only. In this case, the thresholds are set to fixed values: {τi = −∞, τi = 0, τi = ∞} for all 0 ≤ i ≤ m. 0 1 2 1207 S ILVA AND G HAHRAMANI Proposition 5 Let G be an acyclic DMG, and (α, B, V, T ) be the respective set of parameters that defines the probit model. For a fixed (α, B, T ), there is a bijective function fBαT (·) mapping Y⋆ to ε. This is not true in general if G is cyclic. Proof: If the graph is acyclic, this follows directly by recursively solving the model equations, starting from those corresponding to Y j⋆ vertices with no parents. This results in ε = Y⋆ − α − BY, as expected. For cyclic graphs, the following model provides a counter-example. Let the graph be Y1⋆ → Y1 → Y2⋆ → Y2 → Y1⋆ . Let the model be Y1⋆ = Y2 + ε1 ,Y2⋆ = Y1 + ε2 , that is, b12 = b21 = 1 and α = 0. Let the variables be binary, with a threshold at zero (Yi = 1 if and only if Yi⋆ ≥ 0). Then the two instantiations (Y1⋆ = −0.8,Y2⋆ = 0), (Y1⋆ = 0.2,Y2⋆ = 1) imply the same pair (ε1 = −0.8, ε2 = 0). The negative result for discrete models with cycles is the reason why such models are out of the scope of the paper. ⋆ Let Dε = {ε(1) , . . . , ε(n) }, where ε(d) = fBαT (y(d)⋆ ). Due to this bijection (and the determinism mapping Y⋆ to Y), the density p(V | Θ\V, D , D ⋆ ) = p(V | Θ\V, D ⋆ ) = p(V | Θ\V, y(1)⋆ , . . . , y(d)⋆ ) is equivalent to p(V | Θ\V, D ⋆ ) = = ∝ ∝ ⋆ p(V | α, B, T , D ⋆ , Dε ) ⋆) p(V | α, B, T , Dε ⋆ p(V | α, B, T )p(Dε | α, B, T , V) n (d) | V). p(V) ∏d=1 p(ε For the given data set D ∪ D ⋆ , define S⋆ as the sum of (Y⋆(d) − α − BY(d) )(Y⋆(d) − α − BY(d) )T over all d ∈ {1, 2, . . . , n}. Since p(ε | V) is normal with zero mean and covariance matrix V, the posterior for V given all other parameters and variables is V | {Θ\V, D , D ⋆ } ∼ G -IW (δ + n, U + S⋆ ). Sampling B and α is analogous to the Gaussian case, except that we have to consider that the left-hand side of the structural equations now refer to Y⋆ . We give the explicit conditional for αi , with the conditional for bi j being similarly adapted from Section 4. The posterior for αi is given by a normal N((s′ )−1 m′ , s′ ) where i i i sα i ′ cα i ′ = = 1 + n(V−1 )ii , sα i m n m cα (d) ⋆(d) i − n ∑ (V−1 )it αt + ∑ ∑ (V−1 )it Yt . − ∑ bt pt Ypt sα pt i d=1 t=1 t=1,t=i 5.3 A Note on Identifiability The scale of the underlying latent variables in the probit model is arbitrary. As such, it has been often suggested that such latents should have constant (e.g., unity) variance (Pitt et al., 2006). There are two usual arguments for fixing the variance: improving the interpretability of the model, and improving the mixing of the Markov chain. The interpretability argument is not particularly appealing within the Bayesian setting with proper priors, such as the one proposed in this paper: the posterior distribution of the parameters is well-defined by the prior uncertainty and the data. 1208 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The goal of improving the mixing of the chain might be important: if some parameters can assume arbitrary values and still allow for the same model over the observables, then fixing such parameters may help sampling by eliminating largely flat regions from the posterior (which will happen for large data sets and broad priors). In practice, however, scaling UVs might not be advantageous. In some cases it might increase the computational cost of each sampling step, while sampling from the non-scaled model might work just fine. Many MCMC algorithms work well on highly unidentifiable models such as multilayer perceptrons (Neal, 1996). In our experiments, we do not use any scaling. 5.4 Remarks It is clear that the given approach can be generalized to other generalized linear models by changing the link function that maps underlying latent variables (UVs) to observables. For instance, a model containing discrete and continuous variables can be constructed by using the identity link function instead of probit for the continuous variables. Notice that the continuous variables will not necessarily be marginally Gaussian if some of its parents are discrete. Other link functions will have different parameters besides thresholds, such as in multivalued (“polychotomous”) discrete distributions. A Bayesian account of Gaussian copula models is given by Pitt et al. (2006), to which a DMG-based family could in principle be defined. For continuous, marginally non-Gaussian, variables joined by a Gaussian copula, it is possible that all link functions are invertible. In this case, it is easier in principle to define cyclic models through Type-I UV models (e.g., Figure 6(b)) while preserving the observable independencies. It is important to point out that Type-II probit models with Markov equivalent graphs will not, in general, be likelihood equivalent. A simple example is given by the two-node graphs Y1 → Y2 and Y1 ↔ Y2 : if Y1 is binary, then the marginal for Y2 in the first case is equivalent to having an underlying latent variable that follows a mixture of two Gaussians. While some of these issues can be solved by adopting a mixture of Gaussians marginal independence model to account for bi-directed edges (Silva and Ghahramani, 2009), details need to be worked out. When the goal of model selection is to find causal structures (Spirtes et al., 2000), the usual guarantees of search methods based on Markov equivalence classes do not hold. However, it remains to be seen whether the parametric constraints implied by the Type-II formulation will allow for other consistent approaches for causal discovery, as shown in the case of non-linearities with additive noise (Hoyer et al., 2008). 6. Scaling Up: Factorizations and Perfect Sequences Each Monte Carlo sampling step for the given mixed graph models is theoretically tractable, but not necessarily practical when the dimensionality m of the data is high. By using clever factorizations of the graph and ordering of the variables, it is possible to sometimes scale to high-dimensional problems. In this section, we describe approaches to minimize the run-time of the marginal likelihood computation for bi-directed graphs, which is also important for computing variational bounds for DMG models. We start, however, with a discussion on factorizations of the posterior density for coefficient parameters B. The context is the Gibbs sampler for acyclic models. 1209 S ILVA AND G HAHRAMANI Y2 Y1 Y3 b31 Y3 Y1 Y2 Y3 Y4 Y5 Y1 b32 Y4 b54 Y5 (a) b21 Y4 b32 b43 Y6 Y2 b53 (b) (c) Figure 7: The coefficients b31 and b32 , represented as nodes in (a), become dependent after conditioning on Y. However, they are still independent of b54 . This a general property of DAG models. In DMG models, a sequence of bi-directed edges will connect extra coefficients. In graph (b), coefficients b21 , b32 and b43 will all be dependent given Y. Coefficients into nodes in different districts will still be independent. The graph in (c) has districts {Y1 ,Y2 ,Y3 ,Y4 } and {Y5 ,Y6 }. 6.1 Factorizations Our prior for coefficients {bi j } is fully factorized. In directed acyclic graphs, this is particularly advantageous: coefficients corresponding to edges into different nodes are independent in the posterior.9 One can then jointly sample a whole set of {bi j } coefficients with same i index, with no concern for the other coefficients. Figure 7(a) illustrates this factorization. This means that, in Equation (16), the summation over t does not go over all variables, but only for t = i. This also follows from the fact that (V)−1 = 0 unless i = t, since V is diagonal. it In ADMGs, however, this is not true anymore. For any pair of vertices linked by a path of bi-directed edges, for example, Yi ↔ Yi+1 ↔ · · · ↔ Yt , one will have in general that (V)−1 = 0. This it can be shown by using the graphical properties of the model when conditioning on some arbitrary datapoint Y: Proposition 6 Let G be an acyclic DMG with vertex set Y, and G ′ the DMG obtained by augmenting G with a vertex for each parameter bi j and a respective edge bi j → Yi . Then if there is a bi-directed path Yi ↔ · · · ↔ Yt in G , {bi j , btv } are not m-separated given Y in G ′ . Proof: The joint model for {Y, B} with independent priors on the non-zero entries of B is Markov with respect to G ′ . The sequence of bi-directed edges between Yi and Yt implies a path between bi j and btv where every vertex but the endpoints is a collider in this path. Since every collider is in Y, this path is active. This Proposition is illustrated by Figure 7(b). The practical implication is as follows: mconnection means that there is no further graphical property that would entail (V)−1 = 0 (i.e., only it particular cancellations on the expression of the inverse, unlikely to happen in practice, would happen to generate such zeroes). 9. Sampling in Gaussian DAG models is still necessary if the model includes latent variables (Dunson et al., 2005). 1210 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Consider the maximal sets of vertices in an ADMG such that each pair of elements in this set is connected by a path of bi-directed edges. Following Richardson (2003), we call this a district.10 It follows that is not possible in general to factorize the posterior of B beyond the set of districts of G . Figure 7(c) illustrates a factorization. Fortunately, for many DMG models with both directed and bi-directed edges found in practical applications (e.g., Bollen, 1989), the maximum district size tends to be considerably smaller than the dimensionality of the problem. 6.2 Perfect Sequences It is still important to speed up marginal likelihood (or variational bound) computations for models with districts of moderate size, particularly if many models are to be evaluated. Without loss of generality, assume our graph G is a bi-directed graph with a single district, since the problem can be trivially separated into the disjoint bi-directed components. We will consider the case where the bi-directed graph is sparse: otherwise there is little to be gained by exploring the graphical structure. In that case, we will assume that the largest number of spouses of any node in G is bounded by a constant κ that is independent of the total number of nodes, m. The goal is to derive algorithms that are of lower complexity in m than the original algorithms. The bottleneck of our procedure is the computation of the Σ−1 ≺ (i),nsp≺ (i) matrices, required nsp in the mapping between independent and dependent Bartlett parameters (Equation 7), as well as computing the determinants |Σnsp≺ (i),nsp≺ (i) |. Since in sparse districts nsp≺ (i) grows linearly with m, the cost of a na¨ve algorithm for a single sampling step is O(m3 ) per node. Iterating over all nodes ı implies a cost of O(m4 ) for a Monte Carlo sweep. Therefore, our goal is to find a procedure by which such mappings can be computed in less than O(m3 ) time. The general framework is reusing previous inverses and determinants instead of performing full matrix inversion and determinant calculation for each Yi . The difficulty on applying low-rank updates when we traverse the covariance matrix according to ≺ is that the sets of non-spouses nsp≺ (i) and nsp≺ (i + 1) might differ arbitrarily. We want sensible orderings where such sets vary slowly and allow for efficient low-rank updates, if any. The foundation of many scaling-up procedures for graphical models is the graph decomposition by clique separators (Tarjan, 1985), usually defined for undirected graphs. The definition for bi-directed graphs is analogous. Such a decomposition identifies overlapping prime subgraphs {GP(1) , GP(2) , . . . , GP(k) } of the original graph G . A prime graph is a graph that cannot be partitioned into a triple (Y′ , S, Y′′ ) of non-empty sets such that S is a complete separator (i.e., S is a clique and removing S disconnects the graph). Notice that a clique is also a prime subgraph. The prime components of a graph can be ordered in a perfect sequence {YP(1) , . . . , YP(k) } of subsets of Y (Roverato, 2002; Lauritzen, 1996). Define H j ≡ YP(1) ∪ · · · ∪ YP( j) as the history of the perfect sequence up to the j-th subgraph. Let R j ≡ YP( j) \H j−1 be the residual of this history (with R1 ≡ YP(1) ), and S j ≡ H j−1 ∩ YP( j) the separator. In a perfect sequence, the triple (H j−1 \S j , S j , R j ) forms a decomposition of the subgraph of G induced by the vertex set H j . Surprisingly, although bi-directed and undirected graph models have very different Markov properties (in undirected models, conditioning removes dependencies; in bi-directed models, it adds dependencies), perfect prime graph sequences prove to be also useful, but in an entirely different 10. Kang and Tian (2005) call such structures c-components and reserve the word “district” to refer to the function mapping a vertex to its respective c-component, as originally introduced by Richardson (2003). We choose to overload the word and call “district” both the structure and the mapping. 1211 S ILVA AND G HAHRAMANI Y7 Y3 Y1 Y5 Y2 Y6 V1 = {Y1 ,Y2 ,Y3 } V2 = {Y4 ,Y5 } V3 = {Y6 } Y4 Figure 8: On the left, we have a bi-directed graph of 7 vertices arranged and ordered such that nodes are numbered by a depth-first numbering starting from “root” Y7 , with {Y1 ,Y2 ,Y4 ,Y6 } 3 being leaves. Vertices {Y1 ,Y2 , . . . ,Y6 } can be partitioned as the union ∪t=1 Vt , as illustrated on the right. way. The next subsection describes the use of prime graph decompositions in a particularly interesting class of bi-directed graphs: the decomposable case. The general case is treated in the sequel. 6.2.1 D ECOMPOSABLE M ODELS In a recursively decomposable graph, all prime subgraphs are cliques. We will assume that any perfect sequence in this case contains all and only the (maximal) cliques of the graph. The resulting decomposition can be interpreted as a hypergraph where nodes are the maximal cliques of the original graph, and edges correspond to the separators. In the statistics literature, a decomposable model is defined as a model that is Markov with respect to a recursively decomposable undirected graph (Lauritzen, 1996). Its widespread presence on applications of Markov random fields is due to nice computational properties, with tree-structured distributions being a particular case. Our definition of bi-directed decomposable models is analogous: a model Markov with respect to a recursively decomposable bi-directed graph. Given the residual sequence {R1 , R2 , . . . , Rk } obtained through a perfect sequence of maximal cliques of G , we define a perfect ordering ≺ by numbering nodes in Rt before nodes in R1 , . . . , Rt−1 , 1 < t ≤ k and ordering nodes according to this numbering.11 Any ordering that satisfies this restriction is a perfect ordering. Such an ordering has the following property. Theorem 7 Let G be a recursively decomposable bi-directed graph such that the indexing of its vertices Y = {Y1 ,Y2 , . . . ,Ym } follows a perfect ordering ≺. Then for each 1 < i ≤ m, the set K(i) {Y1 ,Y2 , . . . ,Yi−1 } can be partitioned as ∪t=1 Vt such that: 1. each Vt induces a connected subgraph of G , and for each Yt ∈ Vt and Yt ′ ∈ Vt ′ , t = t ′ , Yt is not adjacent to Yt ′ in G ; 11. Lauritzen (1996) describes other uses of perfect sequences in undirected graphs. Notice that the notion of perfect numbering described by Lauritzen (1996) is not equivalent to our notion of perfect ordering, which is always derived from a perfect sequence of prime graphs. 1212 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 2. for each {Yp ,Yq } ⊆ Vt , if Yp is a spouse of Yi , and Yq is a non-spouse of Yi , then p > q; The proof is in Appendix C. This result is easier to visualize in trees. One can take as a perfect ordering some depth-first ordering for a given choice of root. Then for each vertex Yi , the set {Y1 ,Y2 , . . . ,Yi−1 } is partitioned according to the different branches “rooted” at Yi . The starting point of each branch is a spouse of Yi , and all other vertices are non-spouses of Yi . The ordering result then follows directly from the definition of depth-first traversal, as illustrated in Figure 8. Let Σ be the covariance matrix of a bi-directed decomposable model with graph G , where Σ follows a G -inverse Wishart distribution. Let ≺ be a perfect ordering for G . By the construction of Bartlett’s decomposition, mapping between parameters is given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) , nsp the computational bottleneck being the inversion. Notice this corresponds to the multiple regression coefficients of sp≺ (i) on nsp≺ (i). But according to Theorem 7, using a perfect ordering implies that within each Vs for a fixed Yi , all preceding non-spouses of Yi are ordered before the preceding spouses. Elements {Yp ,Yq } in different Vs are marginally independent given {Y1 , . . . ,Yi−1 }\{Yp ,Yq }. This implies that the regression coefficient of spouse Yp on non-spouse Yq will be zero if Yp and Yq are on different components Vs , and will be identical to the previously computed B p,q if they are in the same component. Splitting the set {Y1 ,Y2 , . . .Yi−1 } into preceding spouses Ysp≺ (i) and nonspouses Ynsp≺ (i) , we have Ysp≺ (i) = Bsp≺ (i),sp≺ (i) Ysp≺ (i) + Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ⇒ Ysp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 (Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ) where each ε j is an independent Gaussian with variance γ j , and each element (p, q) in Bsp≺ (i),nsp≺ (i) corresponds to the known (i.e., previously computed) regression coefficient of the spouse Yp on the non-spouse Yq . Matrix Bsp≺ (i),sp≺ (i) is defined analogously. Hence, the regression coefficients of Ysp≺ (i) on Ynsp≺ (i) are given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 Bsp≺ (i),nsp≺ (i) . nsp (19) No inversion of Σnsp≺ (i),nsp≺ (i) is ever necessary. Moreover, the determinant |Σnsp≺ (i),nsp≺ (i) | is given by ∏{q s.t. Yq ∈nsp≺ (i)} γq , since all non-spouses precede the spouses (which means their marginal covariance matrix is given by the previously computed Bartlett parameters). Hence, calculating Bi,nsp≺ (i) for all 1 ≤ i ≤ m according to a perfect ordering has as a bottleneck the inversion (of a triangular matrix) and multiplication in Equation (19), with a cost of O(κ2 +mκ2 ), κ being the maximum number of spouses for any given node. The cost of the remaining operations for the i-th stage in the importance sampler is O(κ3 ). As a function of m, the cost of the parameter sampling step falls from O(m3 ) to O(m). The cost of computing the weights is dominated by the computation of Ki from Equation (13), which is O(κ3 + κm2 ) = O(m2 ). Figure 9 illustrates the derivation of the new ordering in a tree-structured model. 6.2.2 N ON - DECOMPOSABLE M ODELS In a non-decomposable model, some prime graphs YP(t) will no longer be cliques. In what follows, we once again assume that ≺ is a perfect ordering. Unlike in the decomposable case, the product 1213 S ILVA AND G HAHRAMANI YA YC YD YA YC YC YB YC YC YC YD YB (a) (b) Figure 9: The tree-structured (i.e., cycle-free) bi-directed graph in (a) has as maximal cliques the adjacent pairs. Such cliques can be ordered in a perfect sequence as shown in (b), where rectangles indicate the separators. Notice that R1 = {YA ,YC }, R2 = {YB }, R3 = {YD }. One possible perfect ordering is {YD ,YB ,YC ,YA }. Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) does not simplify in general. Instead we will focus only on fast methods nsp to compute Σ−1 ≺ (i),nsp≺ (i) . nsp As we shall see, the function of the perfect sequence is now to provide a sensible choice of which inverse submatrices {Σ−1 }, W ⊆ Y, to cache and reuse when computing Σ−1 ≺ (i),nsp≺ (i) . W,W nsp The same can be done to compute determinants |Σnsp≺ (i),nsp≺ (i) |. A simple way of reusing the results from the previous section is by triangulating the nondecomposable graph G , transforming it into a decomposable one, G ′ , which is then used to generate the perfect sequence. We need to distinguish between the “true” spouses of a node Yi in G and the artificial spouses in G ′ that result from the extra edges added. Let nsp≺G ′ (i) be the non-spouses of Yi in G ′ that precede it according to ≺: by construction, these are also non-spouses of Yi in G . Let sp∆≺G ′ (i) be the spouses of Yi in G ′ that are not spouses of Yi in G . That is, the set of preceding non-spouses of Yi in G is given by nsp≺ (i) = nsp≺G ′ (i) ∪ sp∆≺G ′ (i). Recall that the inverse of a partitioned matrix can be given by the following identity: A B C D −1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1 −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1 . (20) In order to compute Σ−1 ≺ (i),nsp≺ (i) , we consider its partitioned version nsp Σ−1 ≺ (i),nsp≺ (i) nsp = Σnsp≺G ′ (i),nsp≺G ′ (i) Σnsp≺G ′ (i),sp∆≺G ′ (i) Σsp∆≺G ′ (i),nsp≺G ′ (i) Σsp∆≺G ′ (i),sp∆≺G ′ (i) −1 . (21) Let κnsp be the maximum number of non-spouses among all Yi within any prime subgraph induced by YP(t) . By using relation (20), where we assume for now that we know A−1 ≡ Σ−1 ′ (i),nsp ′ (i) , the cost of computing (21) is O(m2 κnsp ) + O(κ3 ) = O(m2 κnsp ) (the cost of comnsp nsp ≺G ≺G puting D − CA−1 B is O(m2 κnsp ) + O(κ2 ) = O(m2 κnsp ), while the cost of inverting it is O(κ3 )). nsp nsp Treating κnsp as a constant, this reduces the complexity of sampling the i-th row of Σ from O(m3 ) to O(m2 ). A similar procedure applies to the computation of the determinant |Σnsp≺ (i),nsp≺ (i) |, using in this case the relationship (26). 1214 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The advantage of using the perfect sequence is to allow for the computation of all A−1 ≡ at a total cost, across all nodes, of O(m3 ): each set nsp≺G ′ (i) is guaranteed to Σ−1 ′ (i),nsp ′ (i) nsp≺G ≺G be equal to {Y1 ,Y2 , . . . ,Ylns } where Ylns is the last non-spouse of Yi in G ′ that antecedes Yi . This follows from the result in the previous section, since all non-spouses of a node in a decomposable graph precede its spouses. Therefore, if we store the inverse covariance matrices for {Y1 ,Y2 , . . . ,Yi }, 1 ≤ i ≤ m, we can obtain the required matrices A−1 . This requires the storage of O(m) matrices, and each matrix can be obtained by the previous one by a low-rank update (20) with a O(m2 ) cost. Arbitrary orderings do not guarantee such an incremental pattern and, hence, no efficient lowrank updates. Notice that depending on the problem, many of such inverse matrices can be dynamically removed from memory if they are not used by any node placed after a particular position. 6.3 Remarks In Gaussian undirected models, the problem of covariance matrix sampling can also be reduced to sampling within each prime graph at the cost of O(|P |4 ), |P | being the size of the largest prime component (Atay-Kayis and Massam, 2005). Since both κ and κnsp are O(|P |), our procedure costs O(m2 |P |2 + |P |4 ) per prime graph, plus a cost of O(m2 ) per node to compute the importance weights. Considering a number of m/|P | prime graphs and |P | < m, the total cost is O(m3 |P |), down from O(m4 ). For undirected models, the corresponding cost by sampling step using the perfect ordering decomposition is O(m|P |3 ). The higher-order dependency on m in bi-directed models is to be expected, since the Markov blanket of any node Yt in a connected bi-directed graph is V\{Yt }. It is clear that inference with a given bi-directed graph model will never scale at the same rate of a undirected model with the same adjacencies, but this does not justify adopting an undirected representation if it is ill-suited to the problem at hand. One has also to consider that in problems with directed and bi-directed edges, the actual maximum district size might be much smaller than the number of variables. For large problems, however, further approximation schemes will be necessary. Drton and Richardson (2008b) describe some reduction techniques for transforming bi-directed edges into directed edges such that the resulting Gaussian model remains the same. As future work, such methods could be adapted to the G -inverse Wishart sampling procedures and combined with the ordering techniques developed here into a single framework. It will also be interesting to develop similar schemes for the Gibbs sampler. 7. Experiments We now evaluate the advantages of the Gaussian and probit models in Bayesian inference on real problems. 7.1 Industrialization and Democratization Study Bollen (1989) describes a structural equation model of political and democratization factors within nations. “Democratization” and “industrialization” levels are abstract notions, but nevertheless of clearly observable impact. They are tied to empirical observations through different sets of indicators. For instance, an indicator of industrialization level is the gross national product. Hence, democratization and industrialization levels are here defined as scalar latent variables never observed directly, while the observed data is composed of indicators. In this model, there is a total of three indicators of industrialization, and four indicators of democratization. Democratization is 1215 S ILVA AND G HAHRAMANI Y1 Y2 Y3 Industrialization 1960 Democratization 1960 Y4 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Y5 Y6 Democratization 1965 Y7 Y8 Y9 Y 10 Y 11 Gross national product (GNP) 1960 Energy consumption per capita 1960 Percentage of labor force in industry 1960 Freedom of press 1960 Freedom of opposition 1960 Fairness of elections 1960 Elective nature of legislative body 1960 Freedom of press 1965 Freedom of opposition 1965 Fairness of elections 1965 Elective nature of legislative body 1965 Figure 10: A directed mixed graph representing dependencies between 11 observed political and economical indicators and three latent concepts (shaded nodes) (Dunson et al., 2005; Bollen, 1989). measured in a longitudinal study, where data was collected in two years (1960 and 1965). The indicators of democratization are pooled expert opinions summarized in an ordinal number scaled from 1 to 10. Following Bollen, we will treat the model as multivariate Gaussian, which provides an excellent fit (a p-value greater than 0.3 using a chi-square test) for a sample of 75 countries. The corresponding mixed graph is depicted in Figure 10, along with a description of all indicators. The graph is taken from Bollen (1989). Other hidden common causes affect the democratization indicators over time, but the nature of such hidden variables is irrelevant to the problem at hand: that is, the bi-directed edges are motivated by unmeasured causes of variability in the observed indicators that exist over time. For instance, the records of freedom of press in 1960 (Y4 ) and 1965 (Y8 ) co-vary due to other unmeasured factors not accounted by democratization factors. 1216 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Factor scores: countries in the latent space 5 0 Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) Figure 11: An embedding of 75 countries in a two-dimensional latent space: democratization level in 1960 and 1965. Boxplots of the Bayesian posterior distribution of the projection in the two dimensions are depicted in the vertical axis. Countries are arranged in the horizontal axis by the increasing order of their posterior expected industrialization level. Figure adapted from Dunson et al. (2005). An example of Bayesian inference application is shown in Figure 11. Boxplots of the posterior values of Democratization Level 1960 and Democratization Level 1965 are generated. Dunson et al. (2005) use this information to, for instance, find clusters of countries in the latent space. An example of a cluster is the one formed by the bottom 16 countries in the industrialization level ranking: the growing trend of democratization levels after the first 16 countries is interrupted. This type of analysis might provide new insights to a polical scientist, for example, by revealing particular characteristics for such a group of nations. 7.1.1 E VALUATING THE MCMC A LGORITHM FOR D IFFERENT M ODELS In our analysis, we fix to unity the coefficients corresponding to the edges Industrialization 1960 → Y1 , Democratization 1960 → Y4 and Democratization 1965 → Y8 , since the scale and sign of the latent variables is arbitrary. The intercept terms of the equations for Y1 ,Y4 and Y8 are set to zero, since the mean of the latents is also arbitrary. The resulting model is identifiable. We apply the Gibbs sampling procedure to three different models. The Gaussian DMG model as described in this paper, and two modified DAG models. The first DAG model is the one described by Dunson et al. (2005), where each bi-directed edge is substituted by an “ancillary” latent (as mentioned in Section 2.3). For instance, the pathway corresponding to Y4 ↔ Y8 is substituted by the chain Y4 ← D48 → Y8 , where D48 is unobserved. Dunson et al. further assume that all covariances due to such ancillary latents are positive. As such, the coefficients from Di j into {Yi ,Y j } are set 1217 S ILVA AND G HAHRAMANI Industrialization1960 → Democratization1965 Democratization1960 → Democratization1965 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 2 Density 1 0 0.0 0.0 0.5 1.0 Density 1.0 0.5 Density 3 1.5 1.5 4 2.0 Industrialization1960 → Democratization1960 -0.5 N = 50000 Bandwidth = 0.02317 0.0 0.5 1.0 1.5 2.0 2.5 -0.5 N = 50000 Bandwidth = 0.01972 (a) 0.0 0.5 1.0 1.5 2.0 2.5 N = 50000 Bandwidth = 0.009335 (b) (c) Figure 12: Posterior distribution of parameters associated with the respective edges in the industrialization/democratization domain. Smoothed posterior obtained using the output of our Gibbs sampler and the DENSITY function of R 2.6.0. Democratization1960 → Y5 0.8 5 0.6 0.4 Density 0.2 0.0 -1 0.0 0.0 0 0.5 0.5 1 1.0 2 1.0 1.5 3 2.0 4 2.5 2.0 1.5 Y7 ↔ Y11 Y7 ↔ Y11 3.0 Industrialization1960 → Democratization1960 0 1000 2000 3000 Iteration 4000 5000 0 1000 2000 3000 4000 5000 0 Iteration 1000 2000 3000 Iteration 4000 5000 -1 0 1 2 3 4 5 N = 50000 Bandwidth = 0.04786 Figure 13: The first three plots show the initial 5,000 iterations of a run of the Gibbs sampling algorithm for the DMG model for three different parameters associated with edges in the graph. The last plot depicts the posterior distribution the error covariance associated with the edge Y7 ↔ Y11 (smoothed with the kernel density estimator from the statistical software R). to unity, with the variance of Di j corresponding to the residual covariance of {Yi ,Y j } given their parents. Means of ancillary latents are fixed at zero. However, even for covariance matrices with positive covariances, this parameterization is not complete. This result is evident from the fact that the variances of Yi and Y j will both be larger than their covariance, which is not true of covariance matrices in general. For this particular problem, however, this extra restriction provides no measurable difference in terms of fitness. It does serve as a reminder, however, that “intuitive” parameterizations might hide undesirable constraints. The second DAG model is an extension of the DAG model suggested by Dunson et al., the only difference being that the coefficients corresponding to edges Di j → Yi , i < j, are free to vary (instead of being fixed to 1). In general, there are Gaussian DMG models that cannot be parameterized this way (Richardson and Spirtes, 2002). Notice also that because of chains such as Democratization 1960 → Y4 ↔ Y8 ← Democratization 1965, the set of independence constraints in this graph can only be represented by a DAG if we include the ancillary latents Di j . That is, there is no DAG with 1218 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 50000 MCMC comparison: DMG vs. unconstrained DAG 30000 0 10000 20000 Effective Sample Size 30000 20000 10000 0 Effective Sample Size DMG DAG 40000 DMG posDAG 40000 50000 MCMC comparison: DMG vs. positive covariance DAG 1 4 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 1 4 7 Parameter 11 15 19 23 27 31 35 39 43 47 51 55 59 63 Parameter Figure 14: Comparison of the effective sample size of the MCMC algorithm applied to the three models, DMG, DAG with positive covariances (posDAG) and general DAG, as explained in the main text. The horizontal axis is the boxplot for each independent entry of the observed covariance matrix, 66 in total. The boxplots are obtained from 80 independent chains initialized randomly, where each chain runs for 50,000 iterations. exactly the same set of independence constraints as the given DMG, unless ancillary latent variables are added. We study the behavior of the MCMC algorithm for these three models.12 It turns out that the mixing properties of the chain are considerably affected by the choice of model. Recall that, in the Gibbs sampling algorithm for the DMG model, a whole row of the error covariance matrix is sampled jointly conditioning on the other parameters. For the DAG models all entries of the error covariance matrix are independent and can be sampled jointly, but this requires conditioning on the ancillary latents, which do not exist in the DMG model and have to be sampled only in the DAG case. For the majority of the covariance entries, the MCMC procedure mixed quite well, as illustrated in Figure 13. Notice how about 12% of the sampled DMG error covariances for Y7 ↔ Y11 were under zero, which could raise suspicion over the assumption of positive covariances. Autocorrelation is 12. A few technical notes: we used the priors suggested in Dunson et al. (2005), except that we changed the confidence in the prior of the covariance of the error terms V to be smaller (in order to minimize the influence of the priors in the models, since in this particular problem the DMG and DAG models are nearly likelihood equivalent but not posterior distribution equivalent − the priors belong to different families). We used 1 degree of freedom in our G -Inverse Wishart, with the matrix parameter being the expected value of Dunson et al.’s prior. For the DAG models, we also used the G -inverse Wishart prior for the error terms, but where all error terms are independent. For the DAG model with a free coefficient per ancillary latent, we assigned a standard Gaussian prior to such coefficients. The chains were initialized randomly by sampling standard Gaussians for the coefficients and latent variables. Error covariance matrices were initialized to diagonal matrices with diagonal entries sampled uniformly in [1, 2]. Coefficient parameters were sampled jointly given the error covariance matrix and latent variables. Latent variables were also sampled jointly, given the parameters. 1219 S ILVA AND G HAHRAMANI Comparison of average effective sample size 30000 DMG 0 0 10000 20000 20000 10000 DMG 30000 40000 40000 Comparison of average effective sample size 0 10000 20000 30000 40000 0 positive covariance DAG 10000 20000 30000 40000 unconstrained DAG Figure 15: Comparison of the effective sample size of the MCMC algorithm applied to the three models. Here we plot the average effective sample sizes over 80 trials of 50,000 samples for each of the 66 entries of the covariance matrix. Points over the line indicate parameters where the DMG approach performed better. essentially zero for most parameters at a lag of 50. The degree of autocorrelation, however, varied significantly between the DMG model and each DAG model. The chains for the DMG model mixed considerably better. To summarize such behavior, we calculated the effective sample size of the samples obtained from several chains. The parameters of interest in this comparison are the independent entries in the 11 × 11 dimensional observed covariance matrix. This is a total of 66 parameters. The effective sample size statistics were obtained by 80 independent chains of 50,000 samples each, for the three models. For each chain and each parameter, we compute the desired statistic using the EFFECTIVE S IZE function implemented in the R package CODA, freely available in the Internet. Results are summarized by boxplots in Figure 14. Parameters are ordered in the x-axis following the upper triangular covariance matrix, scanning it in the order {σY1Y1 , σY1Y2 , . . . , σY1Y11 , σY2Y2 , . . . , σY11Y11 }. White boxplots correspond to the distribution of effective sample size statistics with the DMG model across the 80 independent chains. Gray boxplots correspond to the two DAG variants. There is no significant difference between the behaviour of the Gibbs sampling procedure for the two DAG models. The procedure with the DMG model is clearly better behaved. As a summary statistic, the average effective sample size over 80 trials was steadly larger in the DMG outcome than in the positive DAG outcome (61 out of 66 parameters) and unconstrained DAG (59 out of 66). The comparison of averages is illustrated by Figure 15. By caching the sufficient statistics of the data and factorizing the sampling procedure according to the districts of the graph, the running time for generating 50,000 samples out of the DMG model was of 34 seconds in a dual core Pentium IV 2.0 GHz. Depending on the connectivity of the bidirected components of the graph and on the implementation of matrix inversion, sampling from the 1220 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS DAG model might be faster than sampling from the DMG. In this particular study, sampling from the DAG models was substantially slower, an approximate average of 60 seconds for both variants. This can be explained by the fact that sampling latent variables is very expensive, especially considering that in the given DAG models all ancillary latents become dependent when conditioning on the data. To summarize, the DMG approach allowed for a complete parameterization with significantly better mixing properties, while still resulting in a faster MCMC procedure. 7.2 Structure Learning Applications When trying to find a point estimate of graphical structures (i.e., returning a single graph that explains the data well), simple approaches such as testing for marginal independencies are reasonable learning algorithms under the Gaussian assumption. The Bayesian approach, however, allows one to compute odds and distributions over graphs and graph statistics, for example, the joint probability of small substructures (Friedman and Koller, 2003). Moreover, it is not clear how the independence test procedure controls for the predictive ability of the model, which is not a straightforward function of the edges that are selected due to the quantitative aspects of the dependencies. We evaluate our Bayesian model selection contribution, focusing on the Monte Carlo sampler for bi-directed models. Jones et al. (2005) propose the following priors for graphs: P (G |β) = β|E| (1 − β)0.5m(m−1)−|E| where β is a hyperparameter, |E| is the number of edges in G , and m is the number of nodes. As suggested by Jones et al., we choose β = 0.5/(m − 1), which puts more mass on graphs with O(m) edges than the uniform prior. We start with a brief synthetic study to compare the approach against a simple but effective approach based on the BIC approximation.13 An experiment with gene expression data closes this subsection. 7.2.1 S YNTHETIC S TUDIES As a sanity check for the procedure, we generate synthetic 10-dimensional Gaussian data from models that are Markov with respect to a bi-directed graph. One hundred data sets of 50 datapoints each are generated, each coming from a different model.14 We initially find a structure by marginal independence tests using the Fisher’s Z statistic at a 0.05 level. From this starting point, we perform two searches: one using the BIC score, and the other using the marginal likelihood with a G -IW prior.15 Given the best model for each procedure, we evaluate the predictive log-likelihood on a test set of 2,000 points which are independently sampled for each of the 100 models. 13. The BIC approach is an asymptotically consistent score for selecting the maximum a posteriori Gaussian bi-directed graph model (Richardson and Spirtes, 2002). 14. The details of the simulated data are as follows: we start with DAG with no edges, with observed nodes {Y1 ,Y2 , . . . ,Y10 } and hidden nodes {X1 , X2 , X3 , X4 }. Each individual edge Xi → Y j is added with probability 0.35, and no other edges are allowed. We reject graphs with fewer than 10 edges. All coefficient parameters are sampled from a standard Gaussian, and variances from an uniform distribution in [0, 1]. The model over Y corresponds to a bi-directed graph, where the edge Yi ↔ Y j exists if and only if Yi and Y j have a common latent parent Xk in the DAG. We then store 50 samples for the Y variables in a data set. The procedure is repeated 100 times with different parameters and graphical structures each time. The average number of edges in the resulting simulation was of 18.4 edges per graph. 15. In both cases, we center the data at the empirical mean of the training set and assume the data to have been generated from a zero-mean Gaussian. The G -Inverse Wishart is an empirical prior: a diagonal matrix with the training variance 1221 S ILVA AND G HAHRAMANI MLE Difference (GIW Greedy - BIC Greedy) 0.0008 Density 0.0000 0.0000 0.0002 0.0002 0.0004 0.0006 0.0006 0.0004 Density 0.0008 0.0010 0.0010 0.0012 0.0012 Difference (GIW Greedy - BIC Greedy) -500 0 500 1000 1500 2000 -500 N = 100 Bandwidth = 125.5 0 500 1000 1500 2000 N = 100 Bandwidth = 116.1 (a) (b) Figure 16: The differente in predictive log-likelihood with models learned with the G -IW prior and the best BIC models found by greedy search. Although the difference per point is small, it reflects a persistent advantage of the full Bayesian approach. Figure (a) shows the estimated density of the distribution of differences when predicting points using the Bayesian predictive log-likelihood. Since the BIC search method does not atempt to maximize the finite sample posterior distribution, we provide Figure (b) for completeness: in this case, the predictive log-likelihood for the BIC model was calculated using the maximum likelihood estimator. The difference hardly changes, and the fully Bayesian model still wins (density estimates produced by the DENSITY(·) function of R 2.6.0.). The average difference in log-likelihood prediction16 between the structure learned with the Bayesian prior and the BIC-induced model is depicted in Figure 16(a). This is computed by conditioning on the learned structures (fully Bayesian vs. BIC maximum a posteriori graphs) and marginalizing over the posterior of the parameters. The parameter priors are those used for the structure learning step. This might be unfair for the BIC procedure, since it is not designed to maximize the finite sample posterior: hence we also show in Figure 16(b) the results obtained when the predictions given the BIC model are obtained by using the maximum likelihood estimators of the of each variable used as the diagonal. The number of degrees of freedom is set to 1. The search is a standard greedy procedure: we evaluate the marginal log-likelihood or BIC score for each graph that differs from the current candidate by one edge (i.e., graphs with one more or one fewer edge) and pick the one with the highest score. We stop when no improvement is possible. 16. In terms of incorrect edge additions and deletions, the procedures behave about the same: an average of one third of the edges is missed, and 7% of edges are incorrectly added (individual percentages are with respect to total number of possible mistakes in each graph). Unlike BIC, however, our procedure allows for different trade-offs by using different priors. It should also be pointed out that counting edge errors is just one possible measure. A more global quantitative score such as predictive log-likelihood takes into account, indirectly, the magnitude of the errors—although it is not a direct measure of model fitness. 1222 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS parameters. The average difference in the first case is 400.07, and only slightly less for the second case (389.63). Most of the mass of the difference distribution is positive (85 out of 100 for the first case, 89 out of 100 in the second case), which passes a sign test at a 0.05 level. The difference is still relatively small, suggesting that informative prior knowledge might be necessary to produce substantially better predictions. 7.2.2 G ENE E XPRESSION A NALYSIS To illustrate the use of Bayesian model selection approaches, we analyse the gene expression data previously studied by Drton and Perlman (2007), also as Gaussian bi-directed models. As before, our goal will be to compare the predictive power of models learned by greedy search with BIC and greedy search with the Bayesian posterior. The data consists of 13 gene expression measurements from a metabolic network. A total of 118 points is available. Using all data, the BIC-induced graph has 39 edges, while the finite sample posterior graph had 44. The same procedure used in the synthetic studies, for initializing graphs and choosing priors and centering the data, was applied in this case with two choices of degrees of freedom δ for the G -IW prior: δ = 1 and δ = 5. Preliminary experiments where 90% of the samples are assigned to the training set showed a negligible difference between methods. We then generate 10 random splits of the data, 50% of them assigned to the training set. Predictive results using the MCMC method for evaluating the Bayesian predictions (with half a million samples) are shown in Table 1. The BIC graphs are by definition the same in the three sets of evaluation, but parameters are learned in three different ways (maximum likelihood point estimation and Bayesian averaging with two different priors). There is a steady advantage for the Bayesian approach, although a small one. Notice that using Bayesian averaging over parameters given the BIC graph improves prediction when compared to using the maximum likelihood point estimate, despite the simplistic choice of prior in this study. Notice also that the cases where the Monte Carlo method has small or no advantage over the BIC method were the ones where the maximum likelihood estimators produced their best results. 7.2.3 R EMARKS The procedure based on the sampler is doable for reasonably sized problem on the order of a few dozen variables in desktop machines. Further improvements are necessary for larger problems. One aspect that was not explored here was re-using previous computations when calculating the probability of a new candidate, in a way similar to the local updates in DAG models (Chickering, 2002). How to combine local updates with the ordering-based improved sampler of Section 6 is left as future research. Several practical variations can also be implemented, such as vetoing the inclusion of edges associated with high p-values in the respective independence tests. Such tabu lists can significantly shrink the search space. It is important to evaluate how the Monte Carlo procedure for computing normalizing constants behaves in practice. For all practical purposes, the procedure is an importance sampler and as such is not guaranteed to work within a reasonable amount of time for problems of high dimensionality (MacKay, 1998). We can, however, exploit the nature of the problem for our benefit. Notice that the procedure depends upon a choice of ordering ≺ for the variables. Different orderings correspond in general to different importance distributions. We can play with this feature to choose an suitable ordering. Consider the following algorithm for choosing an ordering given a bi-directed graph G : 1223 S ILVA AND G HAHRAMANI Folder 1 2 3 4 5 6 7 8 9 10 MLE BIC -6578.44 -6392.67 -8194.54 -6284.00 -9428.00 -7111.45 -6411.43 -6350.44 -6374.31 -7247.82 δ=1 BIC MC -6382.45 -6308.19 -6284.64 -6277.94 -6567.89 -6433.51 -6265.16 -6285.77 -6473.93 -6400.51 -6573.85 -6572.74 -6329.53 -6317.18 -6319.87 -6295.19 -6307.13 -6308.21 -6584.96 -6468.51 δ=5 BIC MC -6342.82 -6296.14 -6279.54 -6285.26 -6553.88 -6452.15 -6252.54 -6258.42 -6483.43 -6469.45 -6528.76 -6513.02 -6313.05 -6309.18 -6299.53 -6297.80 -6297.47 -6304.25 -6528.61 -6444.55 Table 1: Results for the 10 random splits of the gene expression data, with 50% of the points assigned to the training set. The first column shows the predictive log-likelihood for the graph learned with the BIC criterion and parameters fit by maximum likelihood. The next two columns show predictive log-likelihood results for the graphs learned with BIC and the Monte Carlo (MC) marginal likelihood method using a G -IW prior with degrees of freedom δ = 1. The last two columns are the results of a prior where δ = 5. Best results in bold. 1. Let ≺ be an empty queue. 2. Let G ′ be the graph complement of G , that is, the graph where {Yi ,Y j } are neighbors if and only if they are not adjacent in G . 3. Let C be an arbitrary maximum clique of G ′ . Add all elements of C to the end of ≺ in any arbitrary order. 4. For each pair {Yi ,Y j }, not intersecting C , such that the path Yi ↔ Yk ↔ Y j exists in G and Yk ∈ C , add the edge Yi ↔ Y j to G . 5. Remove all elements Yk ∈ C from G , including any edge into Yk . 6. Iterate Steps 2-5 until G is an empty graph. The resulting queue ≺ is an ordering that attempts to maximize the number of variables that are marginally independent given their common predecessors. This is just one possibility to simplify the importance sampling distribution: perfect orderings and the approaches for simplifying maximum likelihood estimation described by Drton and Richardson (2008b) could be adapted to provide even better orderings, but we leave this as future work.17 17. In our actual implementation used in the experiments in this Section, we implemented an even simpler approach: instead of finding maximum cliques, we start to build a clique from a particular node, “greedily” adding other nodes to the clique according to the column order of the data set. Each node generates a candidate clique, and we pick an arbitrary clique of maximal size to be our new set C . 1224 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS −13720 −13722 −13728 −13726 −13724 −13722 −13724 −13726 −13728 Marginal likelihood −13720 −13718 Gene data, optimized ordering −13718 Gene data, default ordering 0 2 × 104 4 × 104 6 × 104 8 × 104 105 0 2 × 104 4 × 104 6 × 104 8 × 104 Sample Synthetic data (123 parameters) Synthetic data (658 parameters) 4 × 104 6 × 104 8 × 104 −48625 −48620 −48615 2 × 104 −48630 Marginal likelihood Arbitrary ordering Optimized ordering 0 −48635 −48640 −24171.5 −24172.5 −24173.5 Marginal likelihood −24170.5 −48610 Sample 105 105 Arbitrary ordering Optimized ordering 0 Sample 2 × 104 4 × 104 6 × 104 8 × 104 105 Sample Figure 17: An evaluation on the stability of the Monte Carlo normalizing function procedure. The top row depicts the marginal likelihood estimates for the gene problem using two different distributions implied by two different orderings, as explained in the text. Experiments with synthetic data are shown in the bottom, and the bottom-right figure illustrates major differences. Figure 17 illustrates the difference that a smart choice of ordering can make. The top left graph in Figure 17 depicts the progress of the marginal likelihood Monte Carlo estimator for the gene expression problem using the graph given by the hypothesis testing procedure. The model has 55 parameters. We obtain three estimates, each using a sample of 100,000 points, which allows us to observe how the estimates change at the initial stages. The variable ordering in this case is the 1225 S ILVA AND G HAHRAMANI ordering present in the original database (namely, DXPS1, DXPS2, DXPS3, DXR, MCT, CMK, MECPS, HDS, HDR, IPPI1, GPPS, PPDS1 and PPDS2). The top right graph shows three runs using the optimized ordering criterion. Convergence is much faster in this case, and both samplers agree on the normalizing constant estimate. As an illustration of the power of the procedure and its limitations, we generated a synthetic sample of 1,000 training points from a graph with 25 nodes, using the same procedure of Section 7.2.1. A run of two different samplers is shown at the bottom left of Figure 17. They are seemingly well-behaved, the ratio between the largest and median weight being at the order of one hundred in the “optimally” ordered case. In contrast, the bottom right corner of Figure 17 illustrates the method with a covariance matrix of 50 random variables and 1,000 training points. Notice this is a particularly dense graph. Much bigger jumps are observed in this case and there is no clear sign of convergence at 100,000 iterations. While there is no foolproof criterion to evaluate the behavior of an importance sampler, the relationship between orderings provides a complementary technique: if the normalizing constant estimators vary substantially for a given set of random permutations of the variables, then the outcomes are arguably not to be trusted even if the respective estimators appear to have converged. Concerning the choice of priors, in this Section we exploited empirical priors. The G -Inverse Wishart matrix hyperparameter is a diagonal matrix where variance entries are the sample variances. While this adds an extra bias towards diagonal matrices, at least in our experiments we performed close to or better than other approaches—it is however not clear whether we could have done much better. It is still an open question which practical “default” hyperparameters will prove useful for the G -IW . Elicitation of subjective priors in the context of structural equation models can benefit from pre-existing work on Bayesian regression, although again practical matters might be different for the G -IW . Dunson et al. (2005) describe some limitations of default priors for structural equation models. A thorough evaluation of methods for eliciting subjective priors is out of the context of this work, but existing work on inverse Wishart elicitation provides a starting point (Al-Awadhi and Garthwaite, 1998). As in the case of the inverse Wishart, the G -Inverse Wishart has a single hyperparameter for specifying degrees of freedom, a limitation which might motivate new types of priors (Brown et al., 1993). 7.3 Discrete Data Applications We now show results on learning a discrete distribution that factorizes according to a mixed graph. Drton and Richardson (2008a) describe applications on real-world binary data modeled according to bi-directed graphs. The empirical contingency tables for two studies can be found in the corresponding technical report (Drton and Richardson, 2005). Drton and Richardson used a complete parameterization for bi-directed binary models and a maximum likelihood estimation procedure. In this section, we analyze these two data sets to illustrate the behavior of our Bayesian procedure using the probit model. Our model imposes probit constraints that are not enforced by Drton and Richardson, but it allows us to obtain Bayesian credible intervals and predictions. The graphs used in the two studies are depicted in Figure 18. The first problem is a study on the dependence between alcoholism and depression, as shown in Figure 18(a). A data point is collected for a given pair of mono-zygotic twins. For each sibling Si , it is recorded whether Si is/is not alcoholic (Ai ), and whether Si suffers/does not suffer from depression (Di ). The hypothesis encoded by the graph is that alcoholism and depression do not share a common genetic cause, despite A and 1226 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A1 Success A2 Type of offense Prior sentence D1 D2 Drugs (a) Age (b) Figure 18: Two learning problems with discrete data. In (a), the graph shows dependencies concerning alcoholism (Ai ) and depression (Di ) symptoms for paired twins {1, 2}. In (b), a model for dependencies among features of a study on parole appeals, including the success of the parole, if the type of offense was a person offense or not, and if the offender had a dependency on drugs and was over 25 years old. All variables in these studies are binary and further details and references are provided by Drton and Richardson (2008a). A1A2D1D2 = 0001 A1A2D1D2 = 1000 0.0 0.2 0.4 0.6 0.8 1.0 N = 5000 Bandwidth = 0.003151 80 Density 40 60 0 0 0 5 20 5 10 Density 10 Density 15 20 15 25 20 30 A1A2D1D2 = 0000 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.002153 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.000712 Figure 19: Posterior distribution of some of the marginal contingency table entries for the twin model. D having some hidden (but different) genetic causes. If A and D did have genetic common causes, one would expect that the edges A1 ↔ D2 and A2 ↔ D1 would be also required. The compounded hypothesis of marginal independencies for Ai and D j , i = j, can be tested jointly by testing a bidirected model. Notice that no reference to particular genetic hidden causes of alcoholism and depression is necessary, which again illustrates the power of modeling by marginalizing out latent variables. The second study, as shown in Figure 18(b), concerns the dependencies among several variables in an application for parole. The model implies, for instance, that the success of a parole application (Success node, in the Figure) is independent of the age of the offender being under 25 (Age node). However, if it is known that the offender had a prior sentence, these two variables become dependent (through the path Success ↔ Prior sentence ↔ Age). As reported by Drton and Richardson, their 1227 S ILVA AND G HAHRAMANI Entry A1 A2 D1 D2 0000 0001 0010 0011 0100 0101 0110 0111 E[Θ|D ] 0.461 0.136 0.157 0.097 0.032 0.022 0.007 0.012 Estimates MLE 0.461 0.138 0.159 0.096 0.032 0.021 0.008 0.012 uMLE 0.482 0.134 0.154 0.085 0.025 0.015 0.012 0.017 Entry A1 A2 D1 D2 1000 1001 1010 1011 1100 1101 1110 1111 E[Θ|D ] 0.018 0.003 0.021 0.009 0.008 0.003 0.003 0.006 Estimates MLE 0.018 0.004 0.020 0.009 0.010 0.002 0.005 0.005 uMLE 0.013 0.007 0.013 0.015 0.005 0.003 0.007 0.012 Figure 20: The posterior expected value of the 16 entries in the twin study table (E[Θ|D ]). Results generated with a chain of 5,000 points. We also show the maximum likelihood estimates of Drton and Richardson (MLE) and the maximum likelihood values obtained using an unconstrained model (uMLE). Despite the probit parameterization, in this particular study there is a reasonable agreement between the Bayesian estimator and the estimator of Drton and Richardson. binary bi-directed model passes a significance test. Drton and Richardson also attempted to learn an undirected (Markov) network structure with this data, but the outcome was a fully connected graph. This is expected, since Markov networks cannot represent marginal independencies unless the graph is disconnected, which would introduce all sorts of other independencies and possibly not fit the data well. If many marginal independencies exist in the data generating process, Markov networks might be a bad choice of representation. For problems with symmetries such as the twin study, DAGs are not a natural choice either. 7.3.1 R ESULTS For the twin data problem, we used a simple prior for the covariance matrix of the underlying latent variables: a G -inverse Wishart with 1 degree of freedom and a complete covariance with a value of 2 for each element in the diagonal and 1 outside the diagonals. Thresholds are fixed at zero, since we have binary data. We present the expected posterior values of the contingency table entries in Figure 20. The outcome is essentially identical to the maximum likelihood estimates of Drton and Richardson despite the probit parameterization. Moreover, with our procedure we are able to generate Bayesian confidence intervals, as illustrated in Figure 19. The results are very stable for a chain of 1,000 points. For the parole data, we used a G -inverse Wishart prior for the covariance matrix of underlying variables Y⋆ with 1 degree of freedom and the identity matrix as hyperparameters. We compare the effective sample size of the Gibbs sampler for our DMG model and the DAG model obtained by using the ancillary latent parameterization of Section 7.1 for the underlying latent variable covariance matrix.18 Boxplots for the 16 contingency table entries of the twin network and the 32 entries of the parole study are shown in Figure 21. The setup is the same as in the democratization and 18. The priors used are as follows: the ancillary representation was given a prior with mean 1 and variance 1 for the coefficients Xi j → Y j⋆ , for j > i, and set constant to 1, if i < j. The means of the ancillary latents were fixed at 1228 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS industrialization experiment, where we run 80 independent chains and plot the distribution of the effective sample sizes to measure the mixing time. We ran a shorter chain of 2,000 points, since computing the contingency table entries is expensive. There is a substantial difference in effective sample size for the parole study. Notice that we are comparing MCMC samples for the entries in the contingency table, which in the DAG case requires integrating out not only the underlying latent variables implicit in the probit parameterization, but also the ancillary latents that account for the bi-directed edges. This hierarchy of latent variables, which does not exist in the DMG case, causes a considerable increase on autocorrelation of the chain compared to the DMG model. The standard DMG parameterization can be seen as a way of obtaining a collapsed Gibbs sampler, where the parameterization by construction reflects latent variables that were analytically marginalized. MCMC comparison: DMG vs. DAG (parole data) 800 700 MCMC comparison: DMG vs. DAG (twin data) DMG DAG 200 400 Effective Sample Size 400 300 0 100 200 Effective Sample Size 500 600 600 DMG DAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Parameter 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Parameter (a) (b) Figure 21: Comparison of effective sample sizes for the twin data (a) and parole data (b). 80 independent chains of 2,000 points were obtained using the Gibbs sampling algorithm, and the respective box-plots shown above. The Markov chain with the DMG approach easily dominates the DAG one. For the parole data, the average effective sample size for the DAG was as low as 60 points. 8. Conclusion Directed mixed graph models are a generalization of directed graph models. Whenever a machine learning application requires directed graphs, one should first consider whether directed mixed graphs are a better choice of representation instead. DMGs represent conditional independencies of DAGs where hidden variables have been marginalized out. Given that in most applications it is 0. Variance parameters were given (0.5, 0.5) inverse gamma priors, which approximately matches the priors in the DMG model. 1229 S ILVA AND G HAHRAMANI Y1 Y1 Y2 Y3 Y2 Y3 Y4 Y4 (a) (b) Figure 22: In (a), a simple bi-directed chain with four random variables. In (b), the respective factor graph that is obtained from a Bartlett parameterization using the ordering ≺≡ {Y1 ,Y2 ,Y3 ,Y4 }. In this case, the factors are p(Y1 ) × p(Y2 |Y1 ) × p(Y3 |Y1 ,Y2 ) × p(Y4 |Y1 ,Y2 ,Y3 ). A different choice of ordering (e.g., the perfect ordering) could provide simpler factors on average, but the presence of a factor linked to all variables is unavoidable. unlikely that all relevant variables are known, DMGs are a natural representation to use. In this paper, we introduced priors and inference algorithms for Bayesian learning with two popular families of mixed graph models: Gaussian and probit. We discussed some implementations and approximations to scale up algorithms. We showed examples of applications with real data, and demonstrated that Bayesian inference in Gaussian and probit DMG models using MCMC can have substantially faster mixing than in comparable DAGs. It is part of the machine learning folklore that factor graphs can subsume directed networks. In an important sense, this is known not to be true: undirected and factor graphs only allow for monotonic independence models, where explaining away is ruled out. This excludes a vast number of realistic, non-monotonic, models. While factor graphs are perhaps the data structures of choice for general message-passing algorithms (e.g., Yedidia et al., 2005), they are far from being universal modeling languages for independencies. What is true is that for any distribution that is Markov with respect to a DAG or DMG there is at least one corresponding factor graph model, but this is a vacuous claim of little interest: any distribution can be represented by a single-factor model involving all variables. Some will require a factor with all variables, even under the presence of a large number of independence constraints. For instance, a factor graph corresponding to any given bi-directed chain will necessarily include a factor node adjacent to all variable nodes, as illustrated in Figure 22. When parameterizing a distribution with many marginal independencies (e.g., a bi-directed tree), the respective factor graph would be no more than an unhelpful drawing. A better strategy for solving real-world problems is to define a family of models according to the (directed/undirected/factor) graphs of choice, and let the inference algorithm decide which re-expression of the model suits the problem. This has been traditional in graphical modeling literature (Lauritzen, 1996). The strategy adopted in this paper followed this spirit. An alternative has been recently introduced by Huang and Frey (2008). This paper discusses graphical families of marginal independence constraints (essentially identical to bi-directed graphs, although other types of constraints might implicitly follow from the parameterization). Models are parameterized using a very different strategy. The idea is to parameterize cumulative distribution functions (CDFs) instead of densities or probability mass functions. A simple factorization criterion can be defined in the space of CDFs, but densities have to be computed by a novel message-passing 1230 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS scheme. The particular application discussed by Huang and Frey (2008) could in principle be approached using the Gaussian bi-directed probit model of Section 5, but the parameterization in Huang and Frey (2008) does not need to rely on Gaussian distributions. It is not clear, however, how to efficiently perform Bayesian inference in this case and which constraints are implicitly implied by the different choices of parameterization. The different perspective given by products of CDFs is novel and promising. It should point out to new directions in mixed graph modeling. The structural equation modeling literature also describes several pragmatic ways of specifying non-linearities in the structural equations (Lee, 2007). Less common is the specification of nonGaussian models for the joint density of the error terms. Silva and Ghahramani (2009) introduce a flexible mixture of Gaussians approach for models of marginal independence. There is a need on how to combine this approach with flexible families of structural equations in a computationally efficient way. Also, models with non-additive error terms remain to be explored. Current interest in estimating sparse statistical models has lead to approaches that estimate structured covariance matrices (e.g., Bickel and Levina, 2008). This development could also lead to new families of priors. In particular, different matrix decompositions have motivated different ways of specifying priors on covariance matrices. For instance, Chen and Dunson (2003) propose a modified Cholesky decomposition for the covariance matrix of random effect parameters: standard deviations are parameterized separately with a prior that puts positive mass on zero variances (effectively allowing the random effect to be neutralized). Wong et al. (2003) describe a prior for inverse correlation matrices that is uniform conditioned on the number of structural zeros. Metropolis-Hastings schemes are necessary in this case. Shrinkage methods have also been applied to the estimation of covariance matrices. A common approach, shrinkage towards a diagonal matrix (e.g., Daniels and Kass, 1999), could be generalized towards some sparse matrix corresponding to a bi-directed graph. Although shrinkage will not generate structural zeros in the resulting matrix, allowing for sparse shrinkage matrices other than the identity matrix could be interesting in prediction problems. Some approaches can exploit an ordering for the variables, which is natural in some domains such as time-series analysis. While the G -Inverse Wishart is invariant to a permutation of the variables, new types of priors that exploit a natural variable ordering should be of interest, as in the original work of Brown et al. (1993) that motivated our approach. Other directions and applications are suggested by recent papers: • learning measurement models: the industrialization and democratization problem of Section 7.1 provides an example of a measurement model. In such a family of problems, observed variables are children of latent variables, and connections from latents to observables define the measurement model. Sparsity in the measurement can be exploited to allow for more general dependencies connecting latent variables. One role of the bi-directed component is to allow for extra dependencies connecting observed variables that are not accounted by the explicit latent variables in the model. Silva et al. (2006) describes a learning algorithm for mixed graph measurement models using the “ancillary” parameterization. The natural question is which alternative optimization strategies could be used and how to scale them up; • structural and relational learning: in prediction problems where given an input vector X we have to predict an output vector Y, the dependence structure of Y given X can also lie within the directed mixed graph family. Silva et al. (2007) introduces mixed graph models within the context of relational classification, where Y are labels of different data points 1231 S ILVA AND G HAHRAMANI not independently distributed. In such a class of problems, novel kinds of parameterization are necessary since the dimensionality of the covariance matrix increases with the sample size. Structural features of the graph are used to propose different parameterizations of the dependencies, and many other alternatives are possible; • causal inference: mixed graphs have been consistently used as a language for representing causal dependencies under unmeasured confounding. Zhang (2008) describes recent advances in identifying causal effects with ancestral graphs. Algorithms for learning mixed graph structures are described by Spirtes et al. (2000) and the recent advances in parameterizing such models should result in new algorithms; Many challenges remain. For instance, more flexible models for DMG discrete models are being developed (Drton and Richardson, 2008a), but for large graphs they pose a formidable computational problem. An important question is which other less flexible, but more tractable, parameterizations could be used, and which approximation algorithms to develop. The probit family discussed here was a choice among many. The parameterization by Drton and Richardson (2008a) could be a starting point for trading-off flexibility and computational effort. And while it is true that Gaussian copula models (Pitt et al., 2006) can be adapted to generalize the approach introduced here, it remains to be seen if other copula parameterizations easily lead to DMG models. Acknowledgments We thank the anonymous reviewers for their suggestions, Kenneth Bollen for providing us with the industrialization and democratization data set, and Robert Gramacy for helpful discussions. An earlier version of this paper (Silva and Ghahramani, 2006) appeared in the proceedings of the Uncertainty in Artificial Intelligence conference. This work was funded by the Gatsby Charitable Foundation and a EPSRC grant #EP/D065704/1. Appendix A. Deriving the Sampling Distribution for the Monte Carlo Computation of Normalizing Constants We give here the details on how to derive the sampling distribution used for computing normalizing constants IG (δ, U), as described in Section 3.2.2. Let Ai ≡ Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . Recall from Equation (7) that Bi,nsp≺ (i) = −Bi,sp≺ (i) Ai . nsp The original density p(Bi | γi ), as given by Lemma 1, is a multivariate Gaussian with the following kernel: 1 exp − 2γi T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) (22) where U{i−1},{i−1} in Lemma 1 was rearranged above as the partitioned matrix in (14). The pair {Msp≺ (i) , Mnsp≺ (i) } corresponds to the respective partition of the mean vector Mi . Plugging in the expression for Bi,nsp≺ (i) in (22), we obtain the modified kernel 1232 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS exp − 1 2γi T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) (23) which can be rewritten as pb (Bi,sp≺ (i) ; Ki mi , γi Ki ) × (2π)#sp≺ (i)/2 |γi |#sp≺ (i)/2 |Ki (Φi−1 )|1/2 1 × exp − γ−1 Ui 2 i (24) where #sp≺ (i) is the size of set sp≺ (i), pb (·; α, Σ) is the density function of a multivariate Gaussian distribution with mean α and covariance Σ, Ki (Φi−1 ) ≡ Ki to emphasize the contribution of previous parameters, and mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i Ui = MT U{i−1},{i−1} Mi − mT Ki mi . i i / If sp≺ (i) = 0, it follows that Bi = Bi,nsp≺ (i) = 0. The kernel (23) reduces to exp(−0.5Ui /γi ), / and Ui ≡ MT U{i−1},{i−1} Mi . If nsp≺ (i) = 0, then the expression for the kernel does not change i (Ui ≡ 0), and Equation (24) corresponds to the original kernel in Equation (11). Inserting the re-expressed kernel into the original function (11), we obtain pb (Bi,sp≺ (i) ; Ki mi , γi Ki )pg γi ; δ + i − 1 + #nsp≺ (i) uii.{i−1},{i−1} + Ui , 2 2 fi (Φi−1 ) where pg (·; α, β) is an inverse gamma density function and (i−1)−#sp≺ (i) 2 |Ki (Φi−1 )|1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 Appendix B. Variational Updates for Gaussian Mixed Graph Models The variational updates for the coefficient and intercept parameters are essentially identical to their joint conditional distribution given V and X, where occurrences of V and X are substituted by expectations V−1 q(V) and X q(X) , respectively. Let Vi j be the i j-th entry of V−1 q(V) . The covariance matrix of (B, α) is the covariance matrix of the vector vec(B, α). Such vector is constructed using all (non-zero) coefficients and intercepts. We denote this covariance matrix by ΣB,α . For simplicity of notation, we will treat αi as the coefficient bi(m+1) , m being the number of variables. We will (d) also adopt the notation Ym+1 ≡ 1 in the following derivations. As an abuse of notation, let Y also 1233 S ILVA AND G HAHRAMANI refer to latent variables. In this case, if Yi and Y j refer to latent variables Xhi and Xh j , then define Yi ≡ Xhi q(X) , and YiY j ≡ Xhi Xh j q(X) . Let bi j and btv be the r-th and s-th entries of vec(B, α), respectively. The rs-th entry of the inverse matrix Σ−1 is given by Bα (Σ−1 )rs = Vit Bα n ∑ Yj (d) (d) Yv + 1(i = t)1( j = v) d=1 cbj i sbj i where bxpx ≡ 0 if no edge Yx ← Ypx exists in the graph, 1(·) is the indicator function, and cbj , sbj are i i the given prior parameters defined in Section 4. Similarly to the factorization criterion explained in Section 6, the matrix q(V) will in general be block-diagonal, and this summation can be highly simplified. Define now a vector cb analogous to the Gibbs sampling case, where m cb = r ∑ Vit t=1 n ∑ Y j Yt (d) (d) d=1 + cbj i sbj i . The variational distribution q(B, α) is then a N(ΣB,α c, ΣB,α ). The variational distribution for the latent variables will exactly the same as the Gibbs distribution, except that references to B, α, V−1 are substituted by B q(B,α) , α q(B,α) and V−1 q(V) . Appendix C. Proofs Proof of Lemma 2: Arrange the columns of the Jacobian such that their order corresponds to the sequence σ11 , σ21 , σ22 , σ31 , σ32 , σ33 , . . . , σmm , excluding the entries σi j that are identically zero by construction. Arrange the rows of the Jacobian such that their order corresponds to the sequence γ1 , β21 , γ2 , β31 , β32 , . . . , γm , excluding the entries βi j that are not in ΦG (i.e., exclude any βi j corresponding to a pair {Yi ,Y j } that is not adjacent in the graph). By the definition of Bartlett’s decomposition, Σ{i},{i} and βst are functionally independent for s > i. The same holds for Σ{i},{i} and γs . As such, ∂σi j /∂βst = 0 and ∂σi j /∂γs = 0 for s > i. This implies that J(ΦG ) is a (lower) block triangular matrix of 2m − 1 blocks: for k odd, the k-th block is the singleton ∂σii /∂γi = 1, where i = (k + 1)/2. For k even, the k-th block is the Jacobian ∂Σi,sp≺(i) /∂Bi,sp≺(i) , where i = 1 + k/2 and Σi,sp≺(i) is the vector of covariances of Yi and its preceding spouses. From the interpretation given by Equation (8), it follows that Bi,sp≺ (i) can also be defined by the regression of Yi on Zi . That is Bi,sp≺ (i) = ΣYi ,Zi Σ−1 i ≡ ΣYi ,Zi R−1 . i Zi ,Z (25) However, ΣYi ,Zi = Σi,sp≺ (i) , since Yi is independent of its non-spouses. From (25) we get Σi,sp≺ (i) = Bi,sp≺ (i) Ri , and as such the submatrix ∂Σi,sp≺(i) /∂Bi,sp≺(i) turns out to be Ri . Since the determinant of the block triangular Jacobian J(ΦG ) is given by the determinant of the blocks, this implies m |J(ΦG )| = ∏ |Ri |. i=2 By the matrix identity 1234 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A B C D |Σ{i−1},{i−1} | = = |A||D − CA−1 B|, |Σnsp≺(i) ,nsp≺(i) ||Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) | nsp (26) ≡ i−1 |Σnsp≺(i) ,nsp≺(i) ||Ri |. Since |Σ{i−1},{i−1} | = ∏t=1 γt , the second equality holds. Proof of Theorem 4: We first describe a mapping from each path in G to a path in G ⋆ , and viceversa (such mappings are not inverse functions of each other, since the number of paths in G ⋆ is larger than in G ). By construction, all bi-directed edges in G ∗ have two UVs as endpoints, with an one-to-one mapping between each Ys⋆ ↔ Yt⋆ in G ⋆ and each Ys ↔ Yt in G . All directed edges in G ⋆ are of two types: Ys → Yt⋆ , with s = t, or Ys⋆ → Ys . Therefore, one can define an unique path P in G as a function of a path P⋆ in G ⋆ , obtained by relabeling each Y ⋆ as Y , and by collapsing any Y → Y edges that might result from this relabeling into a single vertex Y . A mapping in the opposite direction is analogous as given by the construction rule of Type-II models. A collider in a path is any vertex within a head-to-head collision in the path, that is, any vertex Yt where the preceding and the next vertex in the path are connected to Yt with an edge (directed or bi-directed) into Yt . Yi and Y j are m-separated by Z in an acyclic DMG if and only if there is no active path connecting Yi and Y j . Like in d-separation, a path is active if all of its colliders have some descendant in Z, and none of its non-colliders is in Z (Richardson, 2003). The mappings between paths P and P⋆ are such that Yt is a collider in P if and only if Yt is in P⋆ and is a collider, or Yt⋆ is in P⋆ and is a collider. Since by construction any Yt⋆ will have the same Y-descendants in G ⋆ as Yt has in G , and Z ⊂ Y, the result follows. Proof of Theorem 7: The first of the two claims of the theorem trivially holds, since connectivity is a transitive property and as such this partition will always exist (where K(i) = 1 is a possibility). We will prove the validity of the second claim by induction. Let {R1 , . . . , Rk } be the perfect sequence that generated our perfect ordering. The second claim automatically holds for all vertices in Rk , since Rk is a clique. Assume the second claim holds for the subsequence {Rl+1 , Rl+2 , . . . , Rk }. Let Yi be an element of Rl . Assume there is some non-spouse Yq of Yi in Rl ′ , and some spouse Yp of Yi in Rl ′′ , such that l < l ′ ≤ l ′′ . We will assume that both Yq and Yp belong to the same component Vt and show this leads to a contradiction. Without loss of generality, we can assume that Yq and Yp are adjacent: otherwise, the fact that Yq and Yp are in the connected set Vt will imply there is a path connecting Yq and Yp in the subgraph induced by {Rl+1 , . . . , Rk }. We can redefine {Yq ,Yp } to be the endpoints of the first edge in the path containing a non-spouse and a spouse of Yi . It will still be the case that q > p, by the induction hypothesis. Since Yp ∈ Rl ′′ , there is a separator Sl ′′ between Hl ′′ \Sl ′′ and Rl ′′ . But Yi ∈ Hl ′′ , and Yi is adjacent to Yp , which implies Yi ∈ Sl ′′ . If l ′ < l ′′ , this will also imply that Yq ∈ Sl ′′ , which is a contradiction, since Sl ′′ is a complete set. If l ′ = l ′′ , this implies that Yi and Yq are both in YP(l ′′ ) , which is also a contradiction since YP(l ′′ ) is a clique. 1235 S ILVA AND G HAHRAMANI References A. Al-Awadhi and P. Garthwaite. An elicitation method for multivariate normal distributions. Communications in Statistics - Theory and Methods, 27:1123–1142, 1998. J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679, 1993. A. Atay-Kayis and H. Massam. A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika, 92:317–335, 2005. D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers, 1999. M. Beal. Variational algorithms for approximate Bayesian inference. PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. M. Beal and Z. Ghahramani. Variational Baeysian learning of directed graphical models with hidden variables. Bayesian Analysis, 1:793–832, 2006. P. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36:2577– 2604, 2008. K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989. P. Brown, N. Le, and J. Zidek. Inference for a covariance matrix. In P.R. Freeman, A.F.M. Smith (editors), Aspects of Uncertainty, a tribute to D. V. Lindley, pages 77–92, 1993. Z. Chen and D. Dunson. Random effects selection in linear mixed models. Biometrics, 59:762–769, 2003. D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. M. Daniels and R. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94:1254–1263, 1999. M. Drton and M. Perlman. Multiple testing and error control in Gaussian graphical model selection. Statistical Science, pages 430–449, 2007. M. Drton and T. Richardson. A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2003. M. Drton and T. Richardson. Iterative conditional fitting for Gaussian ancestral graph models. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004. M. Drton and T. Richardson. Binary models for marginal independence. Department of Statistics, University of Washington, Tech. report 474, 2005. M. Drton and T. Richardson. Binary models for marginal independence. Journal of the Royal Statistical Society, Series B, 70:287–309, 2008a. 1236 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS M. Drton and T. Richardson. Graphical methods for efficient likelihood inference in Gaussian covariance models. Journal of Machine Learning Research, pages 893–914, 2008b. D. Dunson, J. Palomo, and K. Bollen. Bayesian structural equation modeling. Statistical and Applied Mathematical Sciences Institute, Technical Report #2005-5, 2005. N. Friedman and D. Koller. Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Machine Learning Journal, 50:95–126, 2003. P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨ lkopf. Nonlinear causal discovery with additive o noise models. Neural Information Processing Systems, 2008. J. Huang and B. Frey. Cumulative distribution networks and the derivative-sum-product algorithm. Proceedings of 24th Conference on Uncertainty in Artificial Intelligence, 2008. B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20:388–400, 2005. M. Jordan. Learning in Graphical Models. MIT Press, 1998. M. Jordan, Z. Ghaharamani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in Graphical Models, pages 105–162, 1998. C. Kang and J. Tian. Local Markov property for models satisfying the composition axiom. Proceedings of 21st Conference on Uncertainty in Artificial Intelligence, 2005. J. Kotecha and P. Djuric. Gibbs sampling approach for the generation of truncated multivariate Gaussian random variables. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1757–1760, 1999. S. Lauritzen. Graphical Models. Oxford University Press, 1996. S.-Y. Lee. Structural Equation Modeling: a Bayesian Approach. Wiley, 2007. D. MacKay. Introduction to Monte Carlo methods. Learning in Graphical Models, pages 175–204, 1998. I. Murray, Z. Ghahramani, and D. MacKay. MCMC for doubly-intractable distributions. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. R. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. M. Pitt, D. Chan, and R. Kohn. Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93:537–554, 2006. 1237 S ILVA AND G HAHRAMANI T. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145–157, 2003. T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. A. Roverato. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29: 391–411, 2002. R. Scheines, R. Hoijtink, and A. Boomsma. Bayesian estimation and testing of structural equation models. Psychometrika, 64:37–52, 1999. R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Silva and Z. Ghahramani. Factorial mixtures of Gaussians and the marginal independence model. Artificial Intelligence & Statistics (AISTATS ’09), 2009. R. Silva and R. Scheines. Bayesian learning of measurement and structural models. 23rd International Conference on Machine Learning, 2006. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191–246, 2006. R. Silva, W. Chu, and Z. Ghahramani. Hidden common cause relations in relational learning. Neural Information Processing Systems (NIPS ’07), 2007. J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860, 2006. P. Spirtes. Directed cyclic graphical representations of feedback models. Proceedings of 11th Conference on Uncertainty in Artificial Intelligence, 1995. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000. R. Tarjan. Decomposition by clique separators. Discrete Mathematics, 55:221–232, 1985. E. Webb and J. Forster. Bayesian model determination for multivariate ordinal and binary data. Technical report, Southampton Statistical Sciences Research Institute, 2006. F. Wong, C. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90:809–830, 2003. S. Wright. Correlation and causation. Journal of Agricultural Research, pages 557–585, 1921. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005. J. Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, pages 1437–1474, 2008. 1238
6 0.068556935 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
7 0.067464091 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
8 0.053134531 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
9 0.046677068 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
10 0.045689389 23 jmlr-2009-Discriminative Learning Under Covariate Shift
11 0.045560513 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
12 0.043086 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
13 0.04207487 29 jmlr-2009-Estimating Labels from Label Proportions
14 0.04155729 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
15 0.040218644 25 jmlr-2009-Distributed Algorithms for Topic Models
16 0.039082408 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
17 0.037867859 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
18 0.037697248 16 jmlr-2009-Classification with Gaussians and Convex Loss
19 0.034513425 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
20 0.033313509 46 jmlr-2009-Learning Halfspaces with Malicious Noise
topicId topicWeight
[(0, 0.181), (1, -0.089), (2, 0.077), (3, 0.038), (4, -0.089), (5, -0.088), (6, -0.117), (7, -0.072), (8, 0.033), (9, 0.085), (10, 0.172), (11, 0.058), (12, -0.289), (13, -0.105), (14, 0.043), (15, 0.212), (16, 0.134), (17, 0.188), (18, -0.204), (19, 0.096), (20, -0.07), (21, -0.086), (22, 0.123), (23, -0.046), (24, -0.084), (25, -0.149), (26, -0.077), (27, -0.13), (28, 0.013), (29, 0.01), (30, 0.012), (31, 0.002), (32, 0.03), (33, 0.054), (34, 0.091), (35, 0.004), (36, 0.04), (37, 0.027), (38, -0.047), (39, -0.049), (40, -0.049), (41, -0.106), (42, -0.005), (43, 0.03), (44, 0.06), (45, -0.067), (46, 0.077), (47, -0.033), (48, -0.041), (49, 0.273)]
simIndex simValue paperId paperTitle
same-paper 1 0.96149826 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
2 0.49083111 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
Author: Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu
Abstract: Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of SegreVeronese varieties. Keywords: marginal likelihood, exact integration, mixture of independence model, computational algebra
Author: Sébastien Bubeck, Ulrike von Luxburg
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency
4 0.46398038 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
5 0.40118393 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
Author: Dirk Gorissen, Tom Dhaene, Filip De Turck
Abstract: Due to the scale and computational complexity of currently used simulation codes, global surrogate (metamodels) models have become indispensable tools for exploring and understanding the design space. Due to their compact formulation they are cheap to evaluate and thus readily facilitate visualization, design space exploration, rapid prototyping, and sensitivity analysis. They can also be used as accurate building blocks in design packages or larger simulation environments. Consequently, there is great interest in techniques that facilitate the construction of such approximation models while minimizing the computational cost and maximizing model accuracy. Many surrogate model types exist (Support Vector Machines, Kriging, Neural Networks, etc.) but no type is optimal in all circumstances. Nor is there any hard theory available that can help make this choice. In this paper we present an automatic approach to the model type selection problem. We describe an adaptive global surrogate modeling environment with adaptive sampling, driven by speciated evolution. Different model types are evolved cooperatively using a Genetic Algorithm (heterogeneous evolution) and compete to approximate the iteratively selected data. In this way the optimal model type and complexity for a given data set or simulation code can be dynamically determined. Its utility and performance is demonstrated on a number of problems where it outperforms traditional sequential execution of each model type. Keywords: model type selection, genetic algorithms, global surrogate modeling, function approximation, active learning, adaptive sampling
6 0.34629205 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
7 0.33269113 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
8 0.30105647 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
9 0.26899534 16 jmlr-2009-Classification with Gaussians and Convex Loss
10 0.24975394 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
11 0.23884325 25 jmlr-2009-Distributed Algorithms for Topic Models
12 0.20895103 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
14 0.20249251 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
15 0.19785705 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
16 0.19730754 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
17 0.18689112 46 jmlr-2009-Learning Halfspaces with Malicious Noise
18 0.18024494 23 jmlr-2009-Discriminative Learning Under Covariate Shift
19 0.17737687 29 jmlr-2009-Estimating Labels from Label Proportions
20 0.17429122 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
topicId topicWeight
[(8, 0.015), (11, 0.045), (19, 0.018), (21, 0.015), (25, 0.39), (26, 0.012), (38, 0.046), (47, 0.017), (52, 0.04), (55, 0.051), (58, 0.041), (66, 0.103), (68, 0.026), (88, 0.011), (90, 0.055), (96, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.72826338 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
Author: Junning Li, Z. Jane Wang
Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton
3 0.34982643 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
Author: Ricardo Silva, Zoubin Ghahramani
Abstract: Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly. Keywords: graphical models, structural equation models, Bayesian inference, Markov chain Monte Carlo, latent variable models 1. Contribution The introduction of graphical models (Pearl, 1988; Lauritzen, 1996; Jordan, 1998) changed the way multivariate statistical inference is performed. Graphical models provide a suitable language to decompose many complex real-world processes through conditional independence constraints. Different families of independence models exist. The directed acyclic graph (DAG) family is a particularly powerful representation. Besides providing a language for encoding causal statements (Spirtes et al., 2000; Pearl, 2000), it is in a more general sense a family that allows for non-monotonic independence constraints: that is, models where some independencies can be destroyed by conditioning on new information (also known as the “explaining away” effect — Pearl, 1988), a feature to be expected in many real problems. ∗. Part of this work was done while RS was at the Gatsby Computational Neuroscience Unit, UCL, and at the Statistical Laboratory, University of Cambridge. †. Also affiliated with the Machine Learning Department, Carnegie Mellon University. c 2009 Ricardo Silva and Zoubin Ghahramani. S ILVA AND G HAHRAMANI Y1 Y2 Y4 Y3 Y5 Y6 Y2 Y1 Y4 (a) Y3 Y5 (b) Y2 Y1 Y4 Y3 Y5 (c) Figure 1: Consider the DAG in (a). Suppose we want to represent the marginal dependencies and independencies that result after marginalizing out Y6 . The simplest resulting DAG (i.e., the one with fewest edges) is depicted in (b). However, notice that this graph does not encode some of the independencies of the original model. For instance, Y3 and Y4 are no longer marginally independent in the modified DAGs. A different family of graphical models, encoded with more than one type of edge (directed and bi-directed), is the focus of this paper. The graph in (c) depicts the solution using this “mixed” representation. However, DAG independence models have an undesirable feature: they are not closed under marginalization, as we will illustrate. Consider the regression problem where we want to learn the effect of a cocktail of two drugs for blood pressure, while controlling for a chemotherapy treatment of liver cancer. We refer to Y1 , Y2 as the dosage for the blood pressure drugs, Y3 as a measure of chemotherapy dosage, Y4 as blood pressure, and Y5 as an indicator of liver status. Moreover, let Y6 be an hidden physiological factor that affects both blood pressure and liver status. It is assumed that the DAG corresponding to this setup is given by Figure 1(a). In this problem, predictions concerning Y6 are irrelevant: what we care is the marginal for {Y1 , . . . ,Y5 }. Ideally, we want to take such irrelevant hidden variables out of the loop. Yet the set of dependencies within the marginal for {Y1 , . . . ,Y5 } cannot be efficiently represented as a DAG model. If we remove the edge Y3 → Y4 from Figure 1(b), one can verify this will imply a model where Y3 and Y4 are independent given Y5 , which is not true in our original model. To avoid introducing unwanted independence constraints, a DAG such as the one in Figure 1(b) will be necessary. Notice that in general this will call for extra dependencies that did not exist originally (such as Y3 and Y4 now being marginally dependent). Not only learning from data will be more difficult due to the extra dependencies, but specifying prior knowledge on the parameters becomes less intuitive and therefore more error prone. In general, it will be the case that variables of interest have hidden common causes. This puts the researcher using DAGs in a difficult position: if she models only the marginal comprising the variables of interest, the DAG representation might not be suitable anymore. If she includes all hidden variables for the sake of having the desirable set of independencies, extra assumptions about hidden variables will have to be taken into account. In this sense, the DAG representation is flawed. There is a need for a richer family of graphical models, for which mixed graphs are an answer. Directed mixed graphs (DMGs) are graphs with directed and bi-directed edges. In particular, acyclic directed mixed graphs (ADMGs) have no directed cycle, that is, no sequence of directed edges X → · · · → X that starts and ends on the same node. Such a representation encodes a set of conditional independencies among random variables, which can be read off a graph by using a 1188 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y1 Y2 Y3 Y1 (a) Y2 Y3 Y1 (b) Y2 Y3 Y1 Y2 (c) Y3 (d) Figure 2: Different examples of directed mixed graphs. The graph in (b) is cyclic, while all others are acyclic. A subgraph of two variables where both edges Y1 → Y2 and Y1 ↔ Y2 are present is sometimes known as a “bow pattern” (Pearl, 2000) due to its shape. Y1 Y1 Y2 H1 Y4 Y3 H2 Y1 Y2 Y4 Y3 Y2 Y3 Y4 Figure 3: After marginalizing variables H1 and H2 from the DAG on the left, one possible DMG representation of the same dependencies is shown by the graph in the middle. Notice that there are multiple DMGs within a same Markov equivalence class, that is, encoding the same set of conditional independencies (Richardson and Spirtes, 2002). The two last graphs above are on the same class. criterion known as m-separation, a natural extension of the d-separation criterion used for directed acyclic graphs (Richardson, 2003). In a ADMG, two adjacent nodes might be connected by up to two edges, where in this case one has to be bi-directed and the other directed. A cyclic model can in principle allow for two directed edges of opposite directions. Figure 2 provides a few examples of DMGs. The appeal of this graphical family lies on the representation of the marginal independence structure among a set of observed variables, assuming they are part of a larger DAG structure that includes hidden variables. This is illustrated in Figure 3.1 More details on DMGs are given in Sections 2 and 8. In our blood pressure\liver status multiple regression problem, the suitable directed mixed graph is depicted in Figure 1(c). The contribution of this paper is how to perform Bayesian inference on two different families of mixed graph models: Gaussian and probit. Markov chain Monte Carlo (MCMC) and variational approximations will be discussed. Current Bayesian inference approaches for DMG models have limitations, as discussed in Section 2, despite the fact that such models are widely used in several sciences. The rest of the paper is organized as follows. Section 3 describes a special case of Gaussian mixed graph models, where only bi-directed edges are allowed. Priors and a Monte Carlo algorithm are described. This case will be a building block for subsequent sections, such as Section 4, where 1. Notice that it is not necessarily the case that the probability model itself is closed under marginalization. This will happen to some models, including the Gaussian model treated in this paper. But the basic claim of closure concerns the graph, that is, the representation of independence constraints. 1189 S ILVA AND G HAHRAMANI Gaussian DMG models are treated. Section 5 covers a type of discrete distribution for binary and ordinal data that is Markov with respect to an acyclic DMG. In Section 6 we discuss more sophisticated algorithms that are useful for scaling up Bayesian learning to higher-dimensional problems. Section 7 presents several empirical studies. Since the use of mixed graph models in machine learning applications is still in its early stages, we briefly describe in Section 8 a variety of possible uses of such graphs in machine learning applications. 2. Basics of DMGs, Gaussian Models and Related Work In this section, we describe the Gaussian DMG model and how it complements latent variable models. At the end of the section, we also discuss a few alternative approaches for the Bayesian inference problem introduced in this paper. 2.1 Notation and Terminology In what follows, we will use standard notions from the graphical modeling literature, such as vertex (node), edge, parent, child, ancestor, descendant, DAG, undirected graph, induced subgraph, Markov condition and d-separation. Refer to Pearl (1988) and Lauritzen (1996) for the standard definitions if needed. Less standard definitions will be given explicitly when appropriate. A useful notion is that of m-separation (Richardson, 2003) for reading off which independencies are entailed by a DMG representation. This can be reduced to d-separation (Pearl, 1988) by the following trick: for each bi-directed edge Yi ↔ Y j , introduce a new hidden variable Xi j and the edges Xi j → Yi and Xi j → Y j . Remove then all bi-directed edges and apply d-separation to the resulting directed graph. As usual, we will refer to vertices (nodes) in a graph and the corresponding random variables in a distribution interchangeably. Data points are represented by vectors with an upper index, such as ( j) Y(1) , Y(2) , . . . , Y(n) . The variable corresponding to node Yi in data point Y( j) is represented by Yi . 2.2 Gaussian Parameterization The origins of mixed graph models can be traced back to Sewall Wright (Wright, 1921), who used special cases of mixed graph representations in genetic studies. Generalizing Wright’s approach, many scientific fields such as psychology, social sciences and econometrics use linear mixed graph models under the name of structural equation models (Bollen, 1989). Only recently the graphical and parametrical aspects of mixed graph models have been given a thorough theoretical treatment (Richardson and Spirtes, 2002; Richardson, 2003; Kang and Tian, 2005; Drton and Richardson, 2008a). In practice, many structural equation models today are Gaussian models. We will work under this assumption unless stated otherwise. For a DMG G with a set of vertices Y, a standard parameterization of the Gaussian model is given as follows. For each variable Yi with a (possibly empty) parent set {Yi1 , ...,Yik }, we define a “structural equation” Yi = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi where εi is a Gaussian random variable with zero mean and variance vii . Notice that this parameterization allows for cyclic models. Unlike in standard Gaussian DAG models, the error terms {εi } are not necessarily mutually independent. Independence is asserted by the graphical structure: given two vertices Yi and Y j , 1190 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS the respective error terms εi and ε j are marginally independent if Yi and Y j are not connected by a bi-directed edge. By this parameterization, each directed edge Yi ← Y j in the graph corresponds to a parameter bi j . Each bi-directed edge Yi ↔ Y j in the graph is associated with a covariance parameter vi j , the covariance of εi and ε j . Each vertex Y j in the graph is associated with variance parameter v j j , the variance of ε j . Algebraically, let B be a m × m matrix, m being the number of observed variables. This matrix is such that (B)i j = bi j if Yi ← Y j exists in the graph, and 0 otherwise. Let V be a m × m matrix, where (V)i j = vi j if i = j or if Yi ↔ Y j is in the graph, and 0 otherwise. Let Y be the column vector of observed variables, α the column vector of intercept parameters, and ε be the corresponding vector of error terms. The set of structural equations can be given in matrix form as Y = BY + α + ε ⇒ Y = (I − B)−1 (ε + α) ⇒ Σ(Θ) = (I − B)−1 V(I − B)−T (1) where A−T is the transpose of A−1 and Σ(Θ) is the implied covariance matrix of the model, Θ ≡ {B, V, α}. 2.2.1 C OMPLETENESS OF PARAMETERIZATION AND A NCESTRAL G RAPHS An important class of ADMGs is the directed ancestral graph. Richardson and Spirtes (2002) provide the definition and a thorough account of the Markov properties of ancestral graphs. One of the reasons for the name “ancestral graph” is due to one of its main properties: if there is a directed path Yi → · · · → Y j , that is, if Yi is an ancestor of Y j , then there is no bi-directed edge Yi ↔ Y j . Thus directed ancestral graphs are ADMGs with this constraint.2 In particular, they show that any Gaussian distribution that is Markov with respect to a given ADMG can be represented by some Gaussian ancestral graph model that is parameterized as above. For the ancestral graph family, the given parameterization is complete: that is, for each Markov equivalence class, it is always possible to choose an ancestral graph where the resulting parameterization imposes no further constraints on the distribution besides the independence constraints of the class. Since the methods described in this paper apply to general DMG models, they also apply to directed ancestral graphs. In principle, it is possible to define and parameterize a Gaussian DAG model that entails exactly the same independence constraints encoded in an directed ancestral graph. One possibility, as hinted in the previous Section, is to replace each bi-directed edge Yi ↔ Y j by a new path Yi ← Xi j → Y j . Variables {Xi j } are “ancillary” hidden variables, in the sense that they are introduced for the sake of obtaining the same independence constraints of an ancestral graph. Standard Bayesian methodology can then be applied to perform inference in this Gaussian DAG model. However, this parameterization might have undesirable consequences, as discussed in Section 8.6 of Richardson and Spirtes (2002). Moreover, when Markov chain Monte Carlo algorithms are applied to compute posteriors, the “ancillary” hidden variables will have to be integrated out numerically. The resulting Markov chain can suffer from substantial autocorrelation when compared to a model with no ancillary variables. We illustrate this behavior in Section 7. Further constraints beyond independence constraints are certainly desirable depending on the context. For instance, general ADMGs that are not ancestral graphs may impose other constraints (Richardson and Spirtes, 2002), and such graphs can still be sensible models of, for example, the 2. Notice this rules out the possibility of having both edges Yi → Y j and Yi ↔ Y j in the same ancestral graph. 1191 S ILVA AND G HAHRAMANI causal processes for the problem at hand. When many observed variables are confounded by a same hidden common cause, models based on factor analysis are appropriate (Silva et al., 2006). However, it is useful to be able to build upon independence models that are known to have a complete parameterization. In any case, even the latent variables in any model might have dependencies that arise from other latent variables that were marginalized, and a latent variable ADMG model will be necessary. When it comes to solving a problem, it is up to the modeler (or learning algorithm) to decide if some set of latent variables should be included, or if they should be implicit, living their hidden life through the marginals. Richardson and Spirtes (2002) provide further details on the advantages of a complete parameterization. Drton and Richardson (2004) provide an algorithm for fitting Gaussian ancestral graph models by maximum likelihood. 2.3 Bayesian Inference The literature on Bayesian structural equation models is extensive. Scheines et al. (1999) describe one of the first approaches, including ways of testings such models. Lee (2007) provides details on many recent advances. Standard Bayesian approaches for Gaussian DMG models rely on either attempting to reduce the problem to inference with DAG models, or on using rejection sampling. In an application described by Dunson et al. (2005), the “ancillary latent” trick is employed, and Gibbs sampling for Gaussian DAG models is used. This parameterization has the disadvantages mentioned in the previous section. Scheines et al. (1999) use the complete parameterization, with a single parameter corresponding to each bi-directed edge. However, the global constraint of positive-definiteness in the covariance matrix is enforced only by rejection sampling, which might be inefficient in models with moderate covariance values. The prior is setup in an indirect way. A Gaussian density function is independently defined for each error covariance vi j . The actual prior, however, is the result of multiplying all of such functions and the indicator function that discards non-positive definite matrices, which is then renormalized. In contrast, the Bayesian approach delineated in the next sections uses the complete parameterization, does not appeal to rejection sampling, makes use of a family of priors which we believe is the natural choice for the problem, and leads to convenient ways of computing marginal likelihoods for model selection. We will also see that empirically they lead to much better behaved Markov chain Monte Carlo samplers when compared to DAGs with ancillary latent variables. 3. Gaussian Models of Marginal Independence This section concerns priors and sampling algorithms for zero-mean Gaussian models that are Markov with respect to a bi-directed graph, that is, a DMG with no directed edges. Focusing on bi-directed graphs simplifies the presentation, while providing a convenient starting point to solve the full DMG case in the sequel. Concerning the notation: the distribution we introduce in this section is a distribution over covariance matrices. In the interest of generality, we will refer to the random matrix as Σ. In the context of the previous section, Σ ≡ Σ(Θ) = V, since we are assuming B = 0, α = 0. 1192 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 3.1 Priors Gaussian bi-directed graph models are sometimes called covariance graph models. Covariance graphs are models of marginal independence: each edge corresponds to a single parameter in the covariance matrix (the corresponding covariance); the absence of an edge Yi ↔ Y j is a statement that σYiY j = 0, σXY being the covariance of random variables X and Y . More precisely, if Σ is a random covariance matrix generated by a covariance model, a distribution of Σ is the distribution over the (non-repeated) entries corresponding to variances and covariances of adjacent nodes.3 In a model with a fully connected bi-directed graph, this reduces to a space of unrestricted covariance matrices. A common distribution for covariance matrices is the inverse Wishart IW (δ, U). In this paper, we adopt the following inverse Wishart parameterization: 1 p(Σ) ∝ |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ positive definite, 2 p(·) being the density function, tr(·) the trace function, and m the number of variables (nodes) in our model.4 We will overload the symbol p(·) wherever it is clear from the context which density function we are referring to. It is assumed that δ > 0 and U is positive definite. Following Atay-Kayis and Massam (2005), let M + (G ) be the cone of positive definite matrices such that, for a given bi-directed graph G and Σ ∈ M + (G ), σi j = 0 if nodes Yi and Y j are not adjacent in G . It is convenient to choose a distribution that is conjugate to the Gaussian likelihood function, since one can use the same algorithms for performing inference both in the prior and posterior. In a zero-mean Gaussian model, the likelihood function for a fixed data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined by the sufficient statistic S = ∑n (Y(d) )(Y(d) )T as follows: d=1 1 2 L (Σ; D ) = (2π)−nm/2 |Σ|−n/2 exp − tr(Σ−1 S) . (2) We extend the inverse Wishart distribution to the case of constrained covariance matrices in order to preserve conjugacy. This define the following distribution: p(Σ) = 1 1 |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ ∈ M + (G ) IG (δ, U) 2 (3) which is basically a re-scaled inverse Wishart prior with a different support and, consequently, different normalizing constant IG (δ, U). An analogous concept exists for undirected graphs, where Σ−1 ∈ M + (G ) is given a Wishart-like prior: the “G -Wishart” distribution (Atay-Kayis and Massam, 2005). We call the distribution with density function defined as in Equation (3) the G -Inverse Wishart distribution (G -IW ). It will be the basis of our framework. There are no analytical formulas for the normalizing constant. 3. As such, the density function for Σ is defined with respect to the Lebesgue measure of the non-zero, independent elements of this matrix. 4. We adopt this non-standard parameterization of the inverse Wishart because it provides a more convenient reparameterization used in the sequel. Notice this is the parameterization used by Brown et al. (1993) and Atay-Kayis and Massam (2005), which developed other distributions for covariance matrices. 1193 S ILVA AND G HAHRAMANI 3.2 The Normalizing Constant We now derive a Monte Carlo procedure to compute IG (δ, U). In the sequel, this will be adapted into an importance sampler to compute functionals of a G -IW distribution. The core ideas are also used in a Gibbs sampler to obtain samples from its posterior. The normalizing constant is essential for model selection of covariance graphs. By combining the likelihood equation (2) with the prior (3), we obtain the joint nm p(D , Σ | G ) = (2π)− 2 IG (δ, U)−1 × |Σ|− δ+2m+n 2 1 exp − tr[Σ−1 (S + U)] 2 where we make the dependency on the graphical structure G explicit. By the definition of IG , integrating Σ out of the above equation implies the following marginal likelihood: p(D | G ) = 1 IG (δ + n, S + U) nm IG (δ, U) (2π) 2 from which a posterior P (G | D ) can be easily derived as a function of quantities of the type IG (·, ·). The normalizing constant IG (δ, U) is given by the following integral:5 IG (δ, U) = Z M + (G ) |Σ|− δ+2m 2 1 exp − tr(Σ−1 U) dΣ. 2 (4) The space M + (G ) can be described as the space of positive definite matrices conditioned on the event that each matrix has zero entries corresponding to non-adjacent nodes in graph G . We will reduce the integral (4) to an integral over random variables we know how to sample from. The given approach follows the framework of Atay-Kayis and Massam (2005) using the techniques of Drton and Richardson (2003). Atay-Kayis and Massam (2005) show how to compute the marginal likelihood of nondecomposable undirected models by reparameterizing the precision matrix through the Cholesky decomposition. The zero entries in the inverse covariance matrix of this model correspond to constraints in this parameterization, where part of the parameters can be sampled independently and the remaining parameters calculated from the independent ones. We will follow a similar framework but with a different decomposition. It turns out that the Cholesky decomposition does not provide an easy reduction of (4) to an integral over canonical, easy to sample from, distributions. We can, however, use Bartlett’s decomposition to achieve this reduction. 3.2.1 BARTLETT ’ S D ECOMPOSITION Before proceeding, we will need a special notation for describing sets of indices and submatrices. Let {i} represent the set of indices {1, 2, . . . , i}. Let Σi,{i−1} be the row vector containing the covariance between Yi and all elements of {Y1 ,Y2 , . . . ,Yi−1 }. Let Σ{i−1},{i−1} be the marginal covariance matrix of {Y1 ,Y2 , . . . ,Yi−1 }. Let σii be the variance of Yi . Define the mapping Σ → Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }, 5. Notice this integral is always finite for any choice of δ > 0 and positive definite U, since it is no greater than the normalizing constant of the inverse Wishart. 1194 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS such that Bi is a row vector with i − 1 entries, γi is a scalar, and γ1 = σ11 , Bi = Σi,{i−1} Σ−1 i > 1, {i−1},{i−1} , −1 γi = σii.{i−1},{i−1} ≡ σii − Σi,{i−1} Σ{i−1},{i−1} Σ{i−1},i , i > 1. (5) The set Φ provides a parameterization of Σ, in the sense that the mapping (5) is bijective. Given that σ11 = γ1 , the inverse mapping is defined recursively by Σi,{i−1} = Bi Σ{i−1},{i−1} , i > 1, σii = γi + Bi Σ{i−1},i , i > 1. (6) We call the set Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } the Bartlett parameters of Σ, since the decomposition (6) is sometimes known as Bartlett’s decomposition (Brown et al., 1993). For a random inverse Wishart matrix, Bartlett’s decomposition allows the definition of its density function by the joint density of {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }. Define U{i−1},{i−1} , U{i−1},i and uii.{i−1},{i−1} in a way analogous to the Σ definitions. The next lemma follows directly from Lemma 1 of Brown et al. (1993): Lemma 1 Suppose Σ is distributed as IW (δ, U). Then the distribution of the corresponding Bartlett parameters Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } is given by: 1. γi is independent of Φ\{γi , Bi } 2. γi ∼ IG((δ + i − 1)/2, uii.{i−1,i−1} /2), where IG(α, β) is the inverse gamma distribution −1 3. Bi | γi ∼ N(U−1 {i−1},{i−1} U{i−1},i , γi U{i−1},{i−1} ), where N(M, C) is a multivariate Gaussian −1 distribution and U−1 {i−1},{i−1} ≡ (U{i−1},{i−1} ) . 3.2.2 BARTLETT ’ S D ECOMPOSITION OF M ARGINAL I NDEPENDENCE M ODELS What is interesting about Bartlett’s decomposition is that it provides a simple parameterization of the inverse Wishart distribution with variation independent parameters. This decomposition allows the derivation of new distributions. For instance, Brown et al. (1993) derive a “Generalized Inverted Wishart” distribution that allows one to define different degrees of freedom for different submatrices of an inverse Wishart random matrix. For our purposes, Bartlett’s decomposition can be used to reparameterize the G -IW distribution. For that, one needs to express the independent elements of Σ in the space of Bartlett parameters. The original reparameterization maps Σ to Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bd , γd }. To impose the constraint that Yi and Y j are uncorrelated, for i > j, is to set Bi Σ{i−1},{i−1} j = σYiY j (Φ) = 0. For a fixed Σ{i−1},{i−1} , this implies a constraint on (Bi ) j ≡ βi j . Following the terminology used by Richardson and Spirtes (2002), let a spouse of node Y in a mixed graph be any node adjacent to Y by a bi-directed edge. The set of spouses of Yi is denoted by sp(i). The set of spouses of Yi according to order Y1 ,Y2 , . . . ,Ym is defined by sp≺ (i) ≡ sp(i) ∩ {Y1 , . . . ,Yi−1 }. The set of non-spouses of Yi is denoted by nsp(i). Analogously, nsp≺ (i) ≡ {Y1 , . . . ,Yi−1 }\sp≺ (i). Let Bi,sp≺ (i) be the subvector of Bi corresponding to the the respective spouses of Yi . Define Bi,nsp≺ (i) analogously. 1195 S ILVA AND G HAHRAMANI Given the constraint Bi Σ{i−1},nsp≺ (i) = 0, it follows that Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) + Bi,nsp≺ (i) Σnsp≺ (i),nsp≺ (i) = 0 ⇒ Bi,nsp≺ (i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp (7) Identity (7) was originally derived by Drton and Richardson (2003). A property inherited from the original decomposition for unconstrained matrices is that Bi,sp≺ (i) is functionally independent of Σ{i−1},{i−1} . From (7), we obtain that the free Bartlett parameters of Σ are ΦG ≡ {γ1 , B2,sp≺ (2) , γ2 , B3,sp≺ (3) , γ3 , . . . , Bm,sp≺ (m) , γm }. Notice that, according to (5), Φ corresponds to the set of parameters of a fully connected, zeromean, Gaussian DAG model. In such a DAG, Yi is a child of {Y1 , . . . ,Yi−1 }, and Yi = Bi Yi−1 + ζ j , ζ j ∼ N(0, γ j ) where Yi−1 is the (i − 1) × 1 vector corresponding to {Y1 , . . . ,Yi−1 }. As discussed by Drton and Richardson (2003), this interpretation along with Equation (7) implies Yi = Bi,sp≺ (i) Zi + ζ j (8) where the entries in Zi are the corresponding residuals of the regression of sp≺ (i) on nsp≺ (i). The next step in solving integral (4) is to find the Jacobian J(ΦG ) of the transformation Σ → ΦG . This is given by the following Lemma: Lemma 2 The determinant of the Jacobian for the change of variable Σ → ΦG is m |J(ΦG )| = ∏ |Ri | = i=2 m−1 1 γm−i m ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 i ∏ where Ri ≡ Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) , that is, the covariance matrix of the nsp / respective residual Zi (as parameterized by ΦG ). If nsp≺(i) = 0, Ri is defined as Σsp≺(i) ,sp≺(i) and |Σnsp≺(i) ,nsp≺(i) | is defined as 1. The proof of this Lemma is in Appendix C. A special case is the Jacobian of the unconstrained covariance matrix (i.e., when the graph has no missing edges): m−1 |J(Φ)| = ∏ γm−i . i (9) i=1 Now that we have the Jacobian, the distribution over Bartlett’s parameters given by Lemma 1, and the identities of Drton and Richardson (2003) given in Equation (7), we have all we need to provide a Monte Carlo algorithm to compute the normalizing constant of a G -IW with parameters (δ, U). Let Σ(ΦG ) be the implied covariance matrix given by our set of parameters ΦG . We start from the integral in (4), and rewrite it as a function of ΦG . This can be expressed by substituting Σ for Σ(ΦG ) and multiplying the integrand by the determinant of the Jacobian. Notice that the parameters in Σ(ΦG ) are variation independent: that is, their joint range is given by the product of 1196 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS their individual ranges (positive reals for the γ variables and the real line for the β coefficients). This range will replace the original M + (G ) space, which we omit below for simplicity of notation: IG (δ, U) = Z |J(ΦG )||Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . 2 We now multiply and divide the above expression by the normalizing constant of an inverse Wishart (δ, U), which we denote by IIW (δ, U): IG (δ, U) = IIW (δ, U) Z −1 |J(ΦG )| × IIW (δ, U)|Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . (10) 2 The expression −1 IIW (δ, U)|Σ|− δ+2m 2 1 exp − tr(Σ−1 U) 2 corresponds to the density function of an inverse Wishart Σ. Lemma 1 allows us to rewrite the inverse Wishart density function as the density of Bartlett parameters, but this is assuming no independence constraints. We can easily reuse the result of Lemma 1 as follows: 1. write the density of the inverse Wishart as the product of gamma-normal densities given in Lemma 1; 2. this expression contains the original Jacobian determinant |J(Φ)|. We have to remove it, since we are plugging in our own Jacobian determinant. Hence, we divide the reparameterized density by the expression in Equation (9). This ratio |J(ΦG )|/|J(Φ)| can be rewritten as m |J(ΦG )| |Ri | 1 = ∏ m−i = m |J(Φ)| ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 γi / where |Σnsp≺(i) ,nsp≺(i) | ≡ 1 if nsp≺ (i) = 0; 3. substitute each vector Bi,nsp≺ (i) , which is not a free parameter, by the corresponding expression −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp This substitution takes place into the original factors given by Bartlett’s decomposition, as introduced in Lemma 1: −(i−1)/2 p(Bi , γi ) = (2π)−(i−1)/2 γi |U{i−1},{i−1} |1/2 1 × exp − (BiT − Mi )T U{i−1},{i−1} (BiT − Mi ) 2γi (uii.{i−1},{i−1} /2)(δ+i−1)/2 −( δ+i−1 +1) 1 exp − uii.{i−1},{i−1} × γi 2 Γ((δ + i − 1)/2) 2γi where Mi ≡ U−1 {i−1},{i−1} U{i−1},i . Plugging in this in (10) results in 1197 (11) S ILVA AND G HAHRAMANI IG (δ, U) = IIW (δ, U) Z 1 ∏m |Σnsp≺(i) ,nsp≺(i) | i=2 m × p(γ1 ) ∏ p(Bi , γi ) dΦG . i=2 However, after substitution, each factor p(Bi , γi ) is not in general a density function for {Bi,sp≺ (i) , γi } and will include also parameters {B j,sp≺ ( j) , γ j }, j < i. Because of the non-linear relationships that link Bartlett parameters in a marginal independence model, we cannot expect to reduce this expression to a tractable distribution we can easily sample from. Instead, we rewrite each original density factor p(Bi , γi ) such that it includes all information about Bi,sp≺ (i) and γi within a canonical density function. That is, factorize p(Bi , γi ) as p(Bi , γi |Φi−1 ) = pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) × fi (Φi−1 ) (12) where we absorb any occurrence of Bi,sp≺ (i) within the sampling distribution and factorize the remaining dependence on previous parameters Φi−1 ≡ {γ1 , γ2 , B2,sp≺ (2) , . . . , γi−1 , Bi−1,sp≺ (i−1) } into a separate function.6 We derive the functions pb (·), pg (·) and fi (·) in Appendix A. The result is as follows. The density pb (Bi,sp≺ (i) |γi , Φi−1 ) is the density of a Gaussian N(Ki mi , γi Ki ) such that mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i (13) Ai = Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp where Uss Usn Uns Unn ≡ Usp≺ (i),sp≺ (i) Usp≺ (i),nsp≺ (i) Unsp≺ (i),sp≺ (i) Unsp≺ (i),nsp≺ (i) . (14) The density pg (γi |Φi−1 ) is the density of an inverse gamma IG(g1 , g2 ) such that δ + i − 1 + #nsp≺ (i) , 2 uii.{i−1},{i−1} + Ui = , 2 TU = Mi {i−1},{i−1} Mi − mT Ki mi . i g1 = g2 Ui where uii.{i−1},{i−1} was originally defined in Section 3.2.1. Finally, (i−1)−#sp≺ (i) 2 |Ki |1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 6. A simpler decomposition was employed by Silva and Ghahramani (2006) (notice however that paper used an incorrect expression for the Jacobian). The following derivation, however, can be adapted with almost no modification to define a Gibbs sampling algorithm, as we show in the sequel. 1198 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS / Density function pb (Bi,sp≺ (i) |·, ·) and determinant |Ki |1/2 are defined to be 1 if sp≺ (i) = 0. Ui TU / / is defined to be zero if nsp≺ (i) = 0, and Ui = Mi {i−1},{i−1} Mi if sp≺ (i) = 0. The original normalizing constant integral is the expected value of a function of ΦG over a factorized inverse gamma-normal distribution. The density function of this distribution is given below: m pI(δ,U) (ΦG ) = m ∏ pg (γi |Φi−1 ) ∏ pb (Bi,sp i=1 ≺ (i) |γi , Φi−1 ) . i=2 We summarize the main result of this section through the following theorem: Theorem 3 Let f (X) p(X) be the expected value of f (X) where X is a random vector with density p(X). The normalizing constant of a G -Inverse Wishart with parameters (δ, U) is given by m IG (δ, U) = IIW (δ, U) × fi (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | . pI(δ,U) (ΦG ) This can be further simplified to m IG (δ, U) = fi′ (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | (15) pI(δ,U) (ΦG ) where fi′ (Φi−1 ) ≡ (2π) #sp≺ (i) 2 |Ki (Φi−1 )|1/2 Γ((δ + i − 1 + #nsp≺ (i))/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 which, as expected, reduces IG (δ, U) to IIW (δ, U) when the graph is complete. (1) (2) A Monte Carlo estimate of IG (δ, U) is then given from (15) by obtaining samples {ΦG , ΦG , (M) . . . , ΦG } according to pI(δ,U) (·) and computing: (s) fi′ (Φi−1 ) 1 M m IG (δ, U) ≈ ∑∏ M s=1 i=1 |Σnsp ,nsp (Φ(s) )| i−1 ≺(i) ≺(i) where here we emphasize that Σnsp≺(i) ,nsp≺(i) is a function of ΦG as given by (6). 3.3 General Monte Carlo Computation If Y follows a Gaussian N(0, Σ) where Σ is given a G -IW (δ, U) prior, then from a sample D = {Y(1) , . . . , Y(n) } with sufficient statistic S = ∑n (Y(d) )(Y(d) )T , the posterior distribution for Σ d=1 given S will be a G -IW (δ + n, U + S). In order to obtain samples from the posterior or to compute its functionals, one can adapt the algorithm for computing normalizing constants. We describe an importance sampler for computing functionals, followed by a Gibbs sampling algorithm that also provides samples from the posterior. 1199 S ILVA AND G HAHRAMANI Algorithm S AMPLE GIW-1 Input: a m × m matrix U, scalar δ, bi-directed graph G , an ordering ≺ 1. Let Σ be a m × m matrix 2. Define functions sp≺ (·), nsp≺ (·) according to G and ordering ≺ 3. Sample σ11 from IG(δ/2, u11 /2) 4. For i = 2, 3, . . . , m 5. Sample γi ∼ IG((δ + i − 1 + #nsp≺ (i))/2, (uii.{i−1},{i−1} + Ui )/2) 6. Sample Bi,sp≺ (i) ∼ N(Ki mi , γi Ki ) 7. Set Bi,nsp≺(i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp 8. Set ΣT {i−1},i = Σi,{i−1} = Bi Σ{i−1},{i−1} 9. Set σii = γi + Bi Σi,{i−1} 10. Set w = ∏m fi′ (Φi−1 )/|Σnsp≺ (i),nsp≺ (i) | i=1 11. Return (w, Σ). Figure 4: A procedure for generating an importance sample Σ and importance weight w for computing functionals of a G -Inverse Wishart distribution. Variables {Mi , mi , Ki , Ui } and function fi′ (Φi−1 ) are defined in Section 3.2.2. 3.3.1 T HE I MPORTANCE S AMPLER One way of computing functionals of the G -IW distribution, that is, functions of the type g(δ, U; G ) ≡ Z M + (G ) g(Σ)p(Σ | δ, U, G ) dΣ is through the numerical average g(δ, U; G ) ≈ ∑M ws g(Σ(s) ) s=1 , ∑M ws s=1 where weights {w1 , w2 , . . . , wM } and samples {Σ(1) , Σ(2) , . . . , Σ(M) } are generated by an importance sampler. The procedure for computing normalizing constants can be readily adapted for this task using pI(δ,U) (·) as the importance distribution and the corresponding weights from the remainder factors. The sampling algorithm is shown in Figure 4. 3.3.2 T HE G IBBS S AMPLER While the importance sampler can be useful to compute functionals of the distribution, we will need a Markov chain Monte Carlo procedure to sample from the posterior. In the Gibbs sampling 1200 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Algorithm S AMPLE GIW-2 Input: a m × m matrix U, scalar δ, bi-directed graph G , a m × m matrix Σstart 1. Let Σ be a copy of Σstart 2. Define functions sp(·), nsp(·) according to G 3. For i = 1, 2, 3, . . . , m 4. Sample γi ∼ IG((δ + (m − 1) + #nsp(i))/2, (uii.{\i},{\i} + U\i )/2) 5. Sample Bi,sp(i) from a N(K\i m\i , γi K\i ) 6. Set Bi,nsp(i) = −Bi,sp(i) Σsp(i),nsp(i) Σ−1 nsp(i),nsp(i) 7. Set ΣT = Σi,{\i} = Bi Σ{\i},{\i} {\i},i 8. Set σii = γi + Bi Σi,{\i} 9. Return Σ. Figure 5: A procedure for generating a sampled Σ within a Gibbs sampling procedure. procedure, we sample the whole i-th row of Σ, for each 1 ≤ i ≤ m, by conditioning on the remaining independent entries of the covariance matrix as obtained on the previous Markov chain iteration. The conditional densities required by the Gibbs sampler can be derived from (12), which for a particular ordering ≺ implies m p(Σ; δ, U, G ) ∝ pg (γ1 ) ∏ pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) fi (Φi−1 ). i=2 By an abuse of notation, we used Σ in the left-hand side and the Bartlett parameters in the righ-hand side. The conditional density of {Bm,sp≺ (m) , γm } given all other parameters is therefore p(Bm,sp≺ (m) , γm |ΦG \{Bm,sp≺ (m) , γm }) = pb (Bm,sp≺ (m) |γm , Φm−1 )pg (γm |Φm−1 ) from which we can reconstruct a new sample of the m-th row/column of Σ after sampling {Bm,sp≺ (m) , γm }. Sampling other rows can be done by redefining a new order where the corresponding target variable is the last one. More precisely: let {\i} denote the set {1, 2, . . . , i − 1, i + 1, . . . , m}. The Gibbs algorithm is analogous to the previous algorithms. Instead of sp≺ (i) and nsp≺ (i), we refer to the original sp(i) and nsp(i). Matrices Σ{\i},{\i} and U{\i},{\i} are defined by deleting the respective i-th row and i-th columns. Row vector Σi,{\i} and scalar uii.{\i} are defined accordingly, as well as any other vector and matrix originally required in the marginal likelihood/importance sampling procedure. The algorithm is described in Figure 5. The procedure can be interpreted as calling a modification of the importance sampler with a dynamic ordering ≺i which, at every step, moves Yi to the end of the global ordering ≺. 1201 S ILVA AND G HAHRAMANI 3.4 Remarks The importance sampler suffers from the usual shortcomings in high-dimensional problems, where a few very large weights dominate the procedure (MacKay, 1998). This can result in unstable estimates of functionals of the posterior and the normalizing constant. The stability of the importance sampler is not a simple function of the number of variables in the domain. For large but sparse graphs, the number of parameters might be small. For large but fairly dense graphs, the importance distribution might be a good match to the actual distribution since there are few constraints. In Section 7, we performe some experiments to evaluate the sampler. When used to compute functionals, the Gibbs sampler is more computationally demanding considering the cost per step, but we expect it to be more robust in high-dimensional problems. In problems that require repeated calculations of functionals (such as the variational optimization procedure of Section 4.3), it might be interesting to run a few preliminary comparisons between the estimates of the two samplers, and choose the (cheaper) importance sampler if the estimates are reasonably close. Na¨vely, the Gibbs sampler costs O(m4 ) per iteration, since for each step we have to invert the ı matrix Σnsp{\i},nsp{\i} , which is of size O(m) for sparse graphs. However, this inversion can cost much less than O(m3 ) if sparse matrix inversion methods are used. Still, the importance sampler can be even more optimized by using the methods of Section 6. 4. Gaussian Directed Mixed Graph Models As discussed in Section 2, Gaussian directed mixed graph models are parameterized by the set with parameters Θ = {V, B, α}. Our prior takes the form p(Θ) = p(B)p(α)p(V). We assign priors for the parameters of directed edges (non-zero entries of matrix B) in a standard way: each parameter bi j is given a Gaussian N(cBj , sBj ) prior, where all parameters are marginally independent in the i i prior, that is, p(B) = ∏i j p(bi j ). The prior for intercept parameters α is analogous, with αi being a Gaussian N(cα , sα ). i i Recall from Equation (1) that the implied covariance of the model is given by the matrix Σ(Θ) = (I − B)−1 V(I − B)−T . Similarly, we have the implied mean vector µ(Θ) ≡ (I − B)−1 α. The likelihood function for data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined as L (Θ; D ) = |Σ(Θ)|−n/2 ∏n exp − 1 (Y(d) − µ(Θ))T Σ(Θ)−1 (Y(d) − µ(Θ)) d=1 2 = |(I − B)−1 ||V||(I − B)−T | −n/2 exp − 1 tr(V−1 (I − B)S(I − B)T ) 2 , where now S ≡ ∑n (Y(d) − µ(Θ))(Y(d) − µ(Θ))T . d=1 Given a prior G -IW (δ, U) for V, it immediately follows that the posterior distribution of V given the data and other parameters is V | {B, α, D } ∼ G -IW (δ + n, U + (I − B)S(I − B)T ). Therefore it can be sampled using the results from the previous section. Notice this holds even if the directed mixed graph G is cyclic. Sampling αi given {D , Θ\{αi }} can also be done easily for both cyclic and acyclic models: the ′ ′ ′ posterior is given by a normal N(cα /sα , 1/sα ) where i i i 1202 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS sα i cα i ′ ≡ ′ ≡ 1 + n(V−1 )ii , sα i n m m cα (d) (d) −1 −1 i , − ∑ bt pt Ypt α − n ∑ (V )it αt + ∑ ∑ (V )it Yt si pt d=1 t=1 t=1,t=i with pt being an index running over the parents of Yt in G . However, sampling the non-zero entries of B results in two different cases depending whether G is cyclic or not. We deal with them separately. 4.1 Sampling from the Posterior: Acyclic Case The acyclic case is simplified by the fact that I − B can be rearranged in a way it becomes lower triangular, with each diagonal element being 1. This implies the identity |(I−B)−1 ||V||(I−B)−T | = |V|, with the resulting log-likelihood being a quadratic function of the non-zero elements of B. Since the prior for coefficient bi j is Gaussian, its posterior given the data and all other parameters will be ′ ′ ′ the Gaussian N(cbj /sbj , 1/sbj ) where i i i ′ sbj i ′ cbj i ≡ ≡ n 1 (d) + (V−1 )ii ∑ (Y j )2 , b si j d=1 m n cbj i (d) (d) (d) + ∑ Y j ∑ (V−1 )it Yt − ∑ bt pt Ypt − αt . b si j d=1 t=1 pt ,(t,pt )=(i, j) (16) As before, pt runs over the indices of the parents of Yt in G . Notice that in the innermost (d) summation we exclude bi jY j . We can then sample bi j accordingly. It is important to notice that, in practice, better mixing behavior can be obtained by sampling the coefficients (and intercepts) jointly. The joint distribution is Gaussian and can be obtained in a way similar to the above derivation. The derivation of the componentwise conditionals is nevertheless useful in the algorithm for cyclic networks. 4.2 Sampling from the Posterior: Cyclic Case Cyclic directed graph models have an interpretation in terms of causal systems in equilibrium. The simultaneous presence of directed paths Yi → · · · → Y j and Y j → · · · → Yi can be used to parameterize instantaneous causal effects in a feedback loop (Spirtes, 1995). This model appears also in the structural equation modeling literature (Bollen, 1989). In terms of cyclic graphs as families of conditional independence constraints, methods for reading off constraints in linear systems also exist (Spirtes et al., 2000). The computational difficulty in the cyclic case is that the determinant |I − B| is no longer a constant, but a multilinear function of coefficients {bi j }. Because bi j will appear outside the exponential term, its posterior will no longer be Gaussian. From the definition of the implied covariance matrix Σ(Θ), it follows that |Σ(Θ)|−n/2 = (|I − B||V|−1 |I − B|)n/2 . As a function of coefficient bi j , |I − B| = (−1)i+ j+1Ci j bi j + k=m ∑ k=1,k= j 1203 (−1)i+k+1Cik bik , S ILVA AND G HAHRAMANI where Ci j is the determinant of respective co-factor of I − B, bik ≡ 0 if there is no edge Yi ← Yk , and bii ≡ −1. The resulting density function of bi j given D and Θ\{bi j } is ′ p(bi j |Θ\{bi j }, D ) ∝ |bi j − κi j | exp − n ′ (bi j − cbj /sbj )2 i i ′ 2sbj i , where κi j ≡ Ci−1 j k=m ∑ (−1)k− j+1Cik bik k=1,k= j ′ ′ and {cbj , sbj } are defined as in Equation (16). Standard algorithms such as Metropolis-Hastings can i i be applied to sample from this posterior within a Gibbs procedure. 4.3 Marginal Likelihood: A Variational Monte Carlo Approach While model selection of bi-directed graphs can be performed using a simple Monte Carlo procedure as seen in the previous Section, the same is not true in the full Gaussian DMG case. Approaches such as nested sampling (Skilling, 2006) can in principle be adapted to deal with the full case. For problems where there are many possible candidates to be evaluated, such a computationally demanding sampling procedure might be undesirable (at least for an initial ranking of graphical structures). As an alternative, we describe an approximation procedure for the marginal likelihood p(D |G ) by combining variational bounds (Jordan et al., 1998) with the G -Inverse Wishart samplers, and therefore avoiding a Markov chain over the joint model of coefficients and error covariances. This is described for acyclic DMGs only. We adopt the following approximation in our variational approach, accounting also for possible latent variables X: n p(V, B, α, X|D ) ≈ q(V)q(B, α) ∏ q(X(d) ) ≡ q(V)q(B, α)q(X) d=1 with q(B, α) being a multivariate Gaussian density of the non-zero elements of B and α. Function q(X(d) ) is also a Gaussian density, and function q(V) is a G -Inverse Wishart density. From Jensen’s inequality, we obtain the following lower-bound (Beal, 2003, p. 47): ln p(D |G ) = ln p(Y, X|V, B, α)p(V, B, α) dX dB dV dα ≥ ln p(Y, X|V, B, α) q(V)q(B,α)q(X) + ln p(V)/q(V) q(V) + ln p(B, α)/q(B, α) q(B,α) − ln q(X) q(X) R (17) where this lower bound can be optimized with respect to functions q(V), q(B), q(X). This can be done by iterative coordinate ascent, maximizing the bound with respect to a single q(·) function at a time. The update of q(V) is given by qnew (V) = pG -IW (δ + d, U + (I − B)S(I − B)T 1204 q(X)q(B,α) ) BAYESIAN L EARNING WITH M IXED G RAPH M ODELS where pG -IW (·) is the density function for a G -Inverse Wishart, and S is the empirical second moment matrix summed over the completed data set (X, Y) (hence the expectation over q(X)) centered at µ(Θ). The updates for q(B, α) and q(X) are tedious but straightforward derivations, and described in Appendix B. The relevant fact about these updates is that they are functions of V−1 q(V) . Fortunately, we pay a relatively small cost to obtain these inverses using the Monte Carlo sampler of Figure 4: from the Bartlett parameters, define a lower triangular m × m matrix B (by placing on the ith line the row vector Bi , followed by zeroes) and a diagonal matrix Γ from the respective vector of γi ’s. The matrix V−1 can be computed from (I − B )T Γ−1 (I − B ), and the relevant expectation computed according to the importance sampling procedure. For problems of moderate dimensionality,7 the importance sampler might not be recommended, but the Gibbs sampler can be used. At the last iteration of the variational maximization, the (importance or posterior) samples from q(V) can then be used to compute the required averages in (17), obtaining a bound on the marginal log-likelihood of the model. Notice that the expectation ln p(V)/q(V) q(V) contains the entropy of q(V), which will require the computation of G -inverse Wishart normalizing constants. For large problems, the cost of this approximation might still be prohibitive. An option is to partially parameterize V in terms of ancillary latents and another submatrix distributed as a G -inverse Wishart, but details on how to best do this partition are left as future work (this approximation will be worse but less computationally expensive if ancillary latents are independent of the coefficient parameters in the variational density function q(·)). Laplace approximations might be an alternative, which have been successfully applied to undirected non-decomposable models (Roverato, 2002). We emphasize that the results present in this section are alternatives that did not exist before in previous approaches for learning mixed graph structures through variational methods (e.g., Silva and Scheines, 2006). It is true that the variational approximation for marginal likelihoods will tend to underfit the data, that is, generate models simpler than the true model in simulations. Despite the bias introduced by the method, this is less of a problem for large data sets (Beal and Ghahramani, 2006) and the method has been shown to be useful in model selection applications (Silva and Scheines, 2006), being consistently better than standard scores such as BIC when hidden variables are present (Beal and Ghahramani, 2006). An application in prediction using the variational posterior instead of MCMC samples is discussed by Silva and Ghahramani (2006). It is relevant to explore other approaches for marginal likelihood evaluation of DMG models using alternative methods such as annealed importance sampling (Neal, 2001) and nested sampling (Skilling, 2006), but it is unrealistic to expect that such methods can be used to evaluate a large number of candidate models. A pre-selection by approximations such as variational methods might be essential. 5. Discrete Models: The Probit Case Constructing a discrete mixed graph parameterization is not as easy as in the Gaussian case. Advances in this area are described by Drton and Richardson (2008a), where a complete parameterization of binary bi-directed graph models is given. In our Bayesian context, inference with the mixed graph discrete models of Drton and Richardson would not to be any computationally easier than the case for Markov random fields, which has been labeled as doubly-intractable (Murray et al., 2006). 7. We observed a high ratio of the highest importance weight divided by the median weight in problems with dimensionality as low as 15 nodes. However, notice that in practice the error covariance matrix V has a block diagonal structure, and only the size of the largest block is relevant. This is explained in more detail in Section 6. 1205 S ILVA AND G HAHRAMANI Instead, in this paper we will focus on a class of discrete models that has been widely used in practice: the probit model (Bartholomew and Knott, 1999). This model is essentially a projection of a Gaussian distribution into a discrete space. It also allows us to build on the machinery developed in the previous sections. We will describe the parameterization of the model for acyclic DMGs, and then proceed to describe algorithms for sampling from the posterior distribution. 5.1 Parameterizing Models of Observable Independencies A probit model for the conditional probability of discrete variable Yi given a set of variables {Yi1 , ..., Yik } can be described by the two following relationships: P (Yi = vi l Yi⋆ = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi | Yi⋆ ) = 1(τi ≤ Yi⋆ < τi ) l−1 l (18) where P (·) is the probability mass function of a given random variable, as given by the context, and 1(·) is the indicator function. Yi assumes values in {vi , vi , . . . , vi }. Thresholds {τi = −∞ < τi < 1 1 0 1 κ(i) τi < · · · < τi = ∞} are used to define the mapping from continuous Yi⋆ to discrete Yi . This model 2 κ(i) has a sensible interpretation for ordinal and binary values as the discretization of some underlying latent variable (UV) Yi⋆ . Such a UV is a conditionally Gaussian random variable, which follows by assuming normality of the error term εi . This formulation, however, is not appropriate for general discrete variables, which are out of the scope of this paper. Albert and Chib (1993) describe alternative Bayesian treatments of discrete distributions not discussed here. Given this binary/ordinal regression formulation, the natural step is how to define a graphical model accordingly. As a matter of fact, the common practice does not strictly follow the probit regression model. Consider the following example: for a given graph G , a respective graphical representation of a probit model can be built by first replicating G as a graph G ⋆ , where each vertex Yi is relabeled as Yi⋆ . Those vertices represent continuous underlying latent variables (UVs). To each vertex Yi⋆ in G ⋆ , we then add a single child Yi . We call this the Type-I UV model. Although there are arguments for this approach (see, for instance, the arguments by Webb and Forster (2006) concerning stability to ordinal encoding), this is a violation of the original modeling assumption as embodied by G : if the given graph is a statement of conditional independence constraints, it is expected that such independencies will be present in the actual model. The Type-I formulation does not fulfill this basic premise: by construction there are no conditional independence constraints among the set of variables Y (the marginal independencies are preserved, though). This is illustrated by Figure 6(b), where the conditional independence of Y1 and Y3 given Y2 disappears. An alternative is illustrated in Figure 6(c). Starting from the original graph G (as in Figure 6(a)), the probit graph model G ⋆ shown in the Figure is built from G by the following algorithm: 1. add to empty graph G ⋆ the vertices Y of G , and for each Yi ∈ Y, add a respective UV Yi⋆ and the edge Yi⋆ → Yi ; 2. for each Yi → Y j in G , add edge Yi → Y j⋆ to G ⋆ ; 3. for each Yi ↔ Y j in G , add edge Yi⋆ ↔ Y j⋆ to G ⋆ ; We call this the Type-II UV model, which has the following property (the proof is in Appendix C): 1206 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y* 1 Y1 Y2 (a) Y2 Y* 2 Y3 Y* 4 Y4 Y1 Y3 Y* 1 Y* 3 (b) Y2 Y* 2 Y* 4 Y4 Y1 Y3 Y* 3 Y4 (c) Figure 6: The model in (a) has at least two main representations as a probit network. In (b), the original structure is given to the underlying variables, with observed variables being children of their respective latents. In (c), the underlying variable inherits the parents of the original variable and the underlying latents of the spouses. Theorem 4 Suppose G is acyclic with vertex set Y. Yi and Y j are m-separated given Z ⊆ Y\{Yi ,Y j } in G if and only if Yi and Y j are m-separated given Z in G ⋆ . The parameterization of the Type-II UV model follows from the definition of probit regression: the conditional distribution Yi given its parents in {Yi1 , ...,Yik } in G is given as in Equation (18), while the error terms {ε1 , ε2 , . . . , εm } follow the multivariate Gaussian N(0, V). The entry corresponding to the covariance of εi and ε j is assumed to be zero if there is no bi-directed edge Yi ↔ Y j in G . In what follows, we discuss algorithms for Type-II models. The approach here described can be easily adapted to cover Type-I models. We say that Type-II models are models of observable independencies, since independencies hold even after marginalizing all UVs. 5.2 Algorithm As before, we provide a Gibbs sampling scheme to sample parameters Θ = {α, B, V, T } from the posterior distribution given data set D = {Y(1) , Y(2) , . . . , Y(n) }. The set T = {Ti } is the set of threshold parameters, Ti = {τi = −∞ < τi < τi < · · · < τi = ∞} for each random variable Yi with 2 1 0 κ(i) κ(i) different values. We will not discuss priors and algorithms for sampling T given the other parameters: this can be done by standard approaches (e.g., Albert and Chib, 1993).8 For the purposes of the Gibbs procedure, we augment the data set with the underlying variables ⋆ = {Y⋆(1) , Y⋆(2) , . . . , Y⋆(n) } at each sampling step. D From the set of structural equations Y⋆(d) = α + BY(d) + ε it follows that the conditional distribution of Y⋆(d) given the D ∪ Θ is a truncated Gaussian with mean α + BY(d) and covariance matrix V. The truncation levels are given by the thresholds and (d) ⋆(d) observed data Y(d) : for each Yi = vi , the range for Yi becomes [τi , τi ). Sampling from a l l−1 l truncated Gaussian is a standard procedure. We used the algorithm of Kotecha and Djuric (1999) in our implementation. To sample V from its conditional, we will rely on the following result. 8. In Section 7, we perform experiments with binary data only. In this case, the thresholds are set to fixed values: {τi = −∞, τi = 0, τi = ∞} for all 0 ≤ i ≤ m. 0 1 2 1207 S ILVA AND G HAHRAMANI Proposition 5 Let G be an acyclic DMG, and (α, B, V, T ) be the respective set of parameters that defines the probit model. For a fixed (α, B, T ), there is a bijective function fBαT (·) mapping Y⋆ to ε. This is not true in general if G is cyclic. Proof: If the graph is acyclic, this follows directly by recursively solving the model equations, starting from those corresponding to Y j⋆ vertices with no parents. This results in ε = Y⋆ − α − BY, as expected. For cyclic graphs, the following model provides a counter-example. Let the graph be Y1⋆ → Y1 → Y2⋆ → Y2 → Y1⋆ . Let the model be Y1⋆ = Y2 + ε1 ,Y2⋆ = Y1 + ε2 , that is, b12 = b21 = 1 and α = 0. Let the variables be binary, with a threshold at zero (Yi = 1 if and only if Yi⋆ ≥ 0). Then the two instantiations (Y1⋆ = −0.8,Y2⋆ = 0), (Y1⋆ = 0.2,Y2⋆ = 1) imply the same pair (ε1 = −0.8, ε2 = 0). The negative result for discrete models with cycles is the reason why such models are out of the scope of the paper. ⋆ Let Dε = {ε(1) , . . . , ε(n) }, where ε(d) = fBαT (y(d)⋆ ). Due to this bijection (and the determinism mapping Y⋆ to Y), the density p(V | Θ\V, D , D ⋆ ) = p(V | Θ\V, D ⋆ ) = p(V | Θ\V, y(1)⋆ , . . . , y(d)⋆ ) is equivalent to p(V | Θ\V, D ⋆ ) = = ∝ ∝ ⋆ p(V | α, B, T , D ⋆ , Dε ) ⋆) p(V | α, B, T , Dε ⋆ p(V | α, B, T )p(Dε | α, B, T , V) n (d) | V). p(V) ∏d=1 p(ε For the given data set D ∪ D ⋆ , define S⋆ as the sum of (Y⋆(d) − α − BY(d) )(Y⋆(d) − α − BY(d) )T over all d ∈ {1, 2, . . . , n}. Since p(ε | V) is normal with zero mean and covariance matrix V, the posterior for V given all other parameters and variables is V | {Θ\V, D , D ⋆ } ∼ G -IW (δ + n, U + S⋆ ). Sampling B and α is analogous to the Gaussian case, except that we have to consider that the left-hand side of the structural equations now refer to Y⋆ . We give the explicit conditional for αi , with the conditional for bi j being similarly adapted from Section 4. The posterior for αi is given by a normal N((s′ )−1 m′ , s′ ) where i i i sα i ′ cα i ′ = = 1 + n(V−1 )ii , sα i m n m cα (d) ⋆(d) i − n ∑ (V−1 )it αt + ∑ ∑ (V−1 )it Yt . − ∑ bt pt Ypt sα pt i d=1 t=1 t=1,t=i 5.3 A Note on Identifiability The scale of the underlying latent variables in the probit model is arbitrary. As such, it has been often suggested that such latents should have constant (e.g., unity) variance (Pitt et al., 2006). There are two usual arguments for fixing the variance: improving the interpretability of the model, and improving the mixing of the Markov chain. The interpretability argument is not particularly appealing within the Bayesian setting with proper priors, such as the one proposed in this paper: the posterior distribution of the parameters is well-defined by the prior uncertainty and the data. 1208 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The goal of improving the mixing of the chain might be important: if some parameters can assume arbitrary values and still allow for the same model over the observables, then fixing such parameters may help sampling by eliminating largely flat regions from the posterior (which will happen for large data sets and broad priors). In practice, however, scaling UVs might not be advantageous. In some cases it might increase the computational cost of each sampling step, while sampling from the non-scaled model might work just fine. Many MCMC algorithms work well on highly unidentifiable models such as multilayer perceptrons (Neal, 1996). In our experiments, we do not use any scaling. 5.4 Remarks It is clear that the given approach can be generalized to other generalized linear models by changing the link function that maps underlying latent variables (UVs) to observables. For instance, a model containing discrete and continuous variables can be constructed by using the identity link function instead of probit for the continuous variables. Notice that the continuous variables will not necessarily be marginally Gaussian if some of its parents are discrete. Other link functions will have different parameters besides thresholds, such as in multivalued (“polychotomous”) discrete distributions. A Bayesian account of Gaussian copula models is given by Pitt et al. (2006), to which a DMG-based family could in principle be defined. For continuous, marginally non-Gaussian, variables joined by a Gaussian copula, it is possible that all link functions are invertible. In this case, it is easier in principle to define cyclic models through Type-I UV models (e.g., Figure 6(b)) while preserving the observable independencies. It is important to point out that Type-II probit models with Markov equivalent graphs will not, in general, be likelihood equivalent. A simple example is given by the two-node graphs Y1 → Y2 and Y1 ↔ Y2 : if Y1 is binary, then the marginal for Y2 in the first case is equivalent to having an underlying latent variable that follows a mixture of two Gaussians. While some of these issues can be solved by adopting a mixture of Gaussians marginal independence model to account for bi-directed edges (Silva and Ghahramani, 2009), details need to be worked out. When the goal of model selection is to find causal structures (Spirtes et al., 2000), the usual guarantees of search methods based on Markov equivalence classes do not hold. However, it remains to be seen whether the parametric constraints implied by the Type-II formulation will allow for other consistent approaches for causal discovery, as shown in the case of non-linearities with additive noise (Hoyer et al., 2008). 6. Scaling Up: Factorizations and Perfect Sequences Each Monte Carlo sampling step for the given mixed graph models is theoretically tractable, but not necessarily practical when the dimensionality m of the data is high. By using clever factorizations of the graph and ordering of the variables, it is possible to sometimes scale to high-dimensional problems. In this section, we describe approaches to minimize the run-time of the marginal likelihood computation for bi-directed graphs, which is also important for computing variational bounds for DMG models. We start, however, with a discussion on factorizations of the posterior density for coefficient parameters B. The context is the Gibbs sampler for acyclic models. 1209 S ILVA AND G HAHRAMANI Y2 Y1 Y3 b31 Y3 Y1 Y2 Y3 Y4 Y5 Y1 b32 Y4 b54 Y5 (a) b21 Y4 b32 b43 Y6 Y2 b53 (b) (c) Figure 7: The coefficients b31 and b32 , represented as nodes in (a), become dependent after conditioning on Y. However, they are still independent of b54 . This a general property of DAG models. In DMG models, a sequence of bi-directed edges will connect extra coefficients. In graph (b), coefficients b21 , b32 and b43 will all be dependent given Y. Coefficients into nodes in different districts will still be independent. The graph in (c) has districts {Y1 ,Y2 ,Y3 ,Y4 } and {Y5 ,Y6 }. 6.1 Factorizations Our prior for coefficients {bi j } is fully factorized. In directed acyclic graphs, this is particularly advantageous: coefficients corresponding to edges into different nodes are independent in the posterior.9 One can then jointly sample a whole set of {bi j } coefficients with same i index, with no concern for the other coefficients. Figure 7(a) illustrates this factorization. This means that, in Equation (16), the summation over t does not go over all variables, but only for t = i. This also follows from the fact that (V)−1 = 0 unless i = t, since V is diagonal. it In ADMGs, however, this is not true anymore. For any pair of vertices linked by a path of bi-directed edges, for example, Yi ↔ Yi+1 ↔ · · · ↔ Yt , one will have in general that (V)−1 = 0. This it can be shown by using the graphical properties of the model when conditioning on some arbitrary datapoint Y: Proposition 6 Let G be an acyclic DMG with vertex set Y, and G ′ the DMG obtained by augmenting G with a vertex for each parameter bi j and a respective edge bi j → Yi . Then if there is a bi-directed path Yi ↔ · · · ↔ Yt in G , {bi j , btv } are not m-separated given Y in G ′ . Proof: The joint model for {Y, B} with independent priors on the non-zero entries of B is Markov with respect to G ′ . The sequence of bi-directed edges between Yi and Yt implies a path between bi j and btv where every vertex but the endpoints is a collider in this path. Since every collider is in Y, this path is active. This Proposition is illustrated by Figure 7(b). The practical implication is as follows: mconnection means that there is no further graphical property that would entail (V)−1 = 0 (i.e., only it particular cancellations on the expression of the inverse, unlikely to happen in practice, would happen to generate such zeroes). 9. Sampling in Gaussian DAG models is still necessary if the model includes latent variables (Dunson et al., 2005). 1210 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Consider the maximal sets of vertices in an ADMG such that each pair of elements in this set is connected by a path of bi-directed edges. Following Richardson (2003), we call this a district.10 It follows that is not possible in general to factorize the posterior of B beyond the set of districts of G . Figure 7(c) illustrates a factorization. Fortunately, for many DMG models with both directed and bi-directed edges found in practical applications (e.g., Bollen, 1989), the maximum district size tends to be considerably smaller than the dimensionality of the problem. 6.2 Perfect Sequences It is still important to speed up marginal likelihood (or variational bound) computations for models with districts of moderate size, particularly if many models are to be evaluated. Without loss of generality, assume our graph G is a bi-directed graph with a single district, since the problem can be trivially separated into the disjoint bi-directed components. We will consider the case where the bi-directed graph is sparse: otherwise there is little to be gained by exploring the graphical structure. In that case, we will assume that the largest number of spouses of any node in G is bounded by a constant κ that is independent of the total number of nodes, m. The goal is to derive algorithms that are of lower complexity in m than the original algorithms. The bottleneck of our procedure is the computation of the Σ−1 ≺ (i),nsp≺ (i) matrices, required nsp in the mapping between independent and dependent Bartlett parameters (Equation 7), as well as computing the determinants |Σnsp≺ (i),nsp≺ (i) |. Since in sparse districts nsp≺ (i) grows linearly with m, the cost of a na¨ve algorithm for a single sampling step is O(m3 ) per node. Iterating over all nodes ı implies a cost of O(m4 ) for a Monte Carlo sweep. Therefore, our goal is to find a procedure by which such mappings can be computed in less than O(m3 ) time. The general framework is reusing previous inverses and determinants instead of performing full matrix inversion and determinant calculation for each Yi . The difficulty on applying low-rank updates when we traverse the covariance matrix according to ≺ is that the sets of non-spouses nsp≺ (i) and nsp≺ (i + 1) might differ arbitrarily. We want sensible orderings where such sets vary slowly and allow for efficient low-rank updates, if any. The foundation of many scaling-up procedures for graphical models is the graph decomposition by clique separators (Tarjan, 1985), usually defined for undirected graphs. The definition for bi-directed graphs is analogous. Such a decomposition identifies overlapping prime subgraphs {GP(1) , GP(2) , . . . , GP(k) } of the original graph G . A prime graph is a graph that cannot be partitioned into a triple (Y′ , S, Y′′ ) of non-empty sets such that S is a complete separator (i.e., S is a clique and removing S disconnects the graph). Notice that a clique is also a prime subgraph. The prime components of a graph can be ordered in a perfect sequence {YP(1) , . . . , YP(k) } of subsets of Y (Roverato, 2002; Lauritzen, 1996). Define H j ≡ YP(1) ∪ · · · ∪ YP( j) as the history of the perfect sequence up to the j-th subgraph. Let R j ≡ YP( j) \H j−1 be the residual of this history (with R1 ≡ YP(1) ), and S j ≡ H j−1 ∩ YP( j) the separator. In a perfect sequence, the triple (H j−1 \S j , S j , R j ) forms a decomposition of the subgraph of G induced by the vertex set H j . Surprisingly, although bi-directed and undirected graph models have very different Markov properties (in undirected models, conditioning removes dependencies; in bi-directed models, it adds dependencies), perfect prime graph sequences prove to be also useful, but in an entirely different 10. Kang and Tian (2005) call such structures c-components and reserve the word “district” to refer to the function mapping a vertex to its respective c-component, as originally introduced by Richardson (2003). We choose to overload the word and call “district” both the structure and the mapping. 1211 S ILVA AND G HAHRAMANI Y7 Y3 Y1 Y5 Y2 Y6 V1 = {Y1 ,Y2 ,Y3 } V2 = {Y4 ,Y5 } V3 = {Y6 } Y4 Figure 8: On the left, we have a bi-directed graph of 7 vertices arranged and ordered such that nodes are numbered by a depth-first numbering starting from “root” Y7 , with {Y1 ,Y2 ,Y4 ,Y6 } 3 being leaves. Vertices {Y1 ,Y2 , . . . ,Y6 } can be partitioned as the union ∪t=1 Vt , as illustrated on the right. way. The next subsection describes the use of prime graph decompositions in a particularly interesting class of bi-directed graphs: the decomposable case. The general case is treated in the sequel. 6.2.1 D ECOMPOSABLE M ODELS In a recursively decomposable graph, all prime subgraphs are cliques. We will assume that any perfect sequence in this case contains all and only the (maximal) cliques of the graph. The resulting decomposition can be interpreted as a hypergraph where nodes are the maximal cliques of the original graph, and edges correspond to the separators. In the statistics literature, a decomposable model is defined as a model that is Markov with respect to a recursively decomposable undirected graph (Lauritzen, 1996). Its widespread presence on applications of Markov random fields is due to nice computational properties, with tree-structured distributions being a particular case. Our definition of bi-directed decomposable models is analogous: a model Markov with respect to a recursively decomposable bi-directed graph. Given the residual sequence {R1 , R2 , . . . , Rk } obtained through a perfect sequence of maximal cliques of G , we define a perfect ordering ≺ by numbering nodes in Rt before nodes in R1 , . . . , Rt−1 , 1 < t ≤ k and ordering nodes according to this numbering.11 Any ordering that satisfies this restriction is a perfect ordering. Such an ordering has the following property. Theorem 7 Let G be a recursively decomposable bi-directed graph such that the indexing of its vertices Y = {Y1 ,Y2 , . . . ,Ym } follows a perfect ordering ≺. Then for each 1 < i ≤ m, the set K(i) {Y1 ,Y2 , . . . ,Yi−1 } can be partitioned as ∪t=1 Vt such that: 1. each Vt induces a connected subgraph of G , and for each Yt ∈ Vt and Yt ′ ∈ Vt ′ , t = t ′ , Yt is not adjacent to Yt ′ in G ; 11. Lauritzen (1996) describes other uses of perfect sequences in undirected graphs. Notice that the notion of perfect numbering described by Lauritzen (1996) is not equivalent to our notion of perfect ordering, which is always derived from a perfect sequence of prime graphs. 1212 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 2. for each {Yp ,Yq } ⊆ Vt , if Yp is a spouse of Yi , and Yq is a non-spouse of Yi , then p > q; The proof is in Appendix C. This result is easier to visualize in trees. One can take as a perfect ordering some depth-first ordering for a given choice of root. Then for each vertex Yi , the set {Y1 ,Y2 , . . . ,Yi−1 } is partitioned according to the different branches “rooted” at Yi . The starting point of each branch is a spouse of Yi , and all other vertices are non-spouses of Yi . The ordering result then follows directly from the definition of depth-first traversal, as illustrated in Figure 8. Let Σ be the covariance matrix of a bi-directed decomposable model with graph G , where Σ follows a G -inverse Wishart distribution. Let ≺ be a perfect ordering for G . By the construction of Bartlett’s decomposition, mapping between parameters is given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) , nsp the computational bottleneck being the inversion. Notice this corresponds to the multiple regression coefficients of sp≺ (i) on nsp≺ (i). But according to Theorem 7, using a perfect ordering implies that within each Vs for a fixed Yi , all preceding non-spouses of Yi are ordered before the preceding spouses. Elements {Yp ,Yq } in different Vs are marginally independent given {Y1 , . . . ,Yi−1 }\{Yp ,Yq }. This implies that the regression coefficient of spouse Yp on non-spouse Yq will be zero if Yp and Yq are on different components Vs , and will be identical to the previously computed B p,q if they are in the same component. Splitting the set {Y1 ,Y2 , . . .Yi−1 } into preceding spouses Ysp≺ (i) and nonspouses Ynsp≺ (i) , we have Ysp≺ (i) = Bsp≺ (i),sp≺ (i) Ysp≺ (i) + Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ⇒ Ysp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 (Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ) where each ε j is an independent Gaussian with variance γ j , and each element (p, q) in Bsp≺ (i),nsp≺ (i) corresponds to the known (i.e., previously computed) regression coefficient of the spouse Yp on the non-spouse Yq . Matrix Bsp≺ (i),sp≺ (i) is defined analogously. Hence, the regression coefficients of Ysp≺ (i) on Ynsp≺ (i) are given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 Bsp≺ (i),nsp≺ (i) . nsp (19) No inversion of Σnsp≺ (i),nsp≺ (i) is ever necessary. Moreover, the determinant |Σnsp≺ (i),nsp≺ (i) | is given by ∏{q s.t. Yq ∈nsp≺ (i)} γq , since all non-spouses precede the spouses (which means their marginal covariance matrix is given by the previously computed Bartlett parameters). Hence, calculating Bi,nsp≺ (i) for all 1 ≤ i ≤ m according to a perfect ordering has as a bottleneck the inversion (of a triangular matrix) and multiplication in Equation (19), with a cost of O(κ2 +mκ2 ), κ being the maximum number of spouses for any given node. The cost of the remaining operations for the i-th stage in the importance sampler is O(κ3 ). As a function of m, the cost of the parameter sampling step falls from O(m3 ) to O(m). The cost of computing the weights is dominated by the computation of Ki from Equation (13), which is O(κ3 + κm2 ) = O(m2 ). Figure 9 illustrates the derivation of the new ordering in a tree-structured model. 6.2.2 N ON - DECOMPOSABLE M ODELS In a non-decomposable model, some prime graphs YP(t) will no longer be cliques. In what follows, we once again assume that ≺ is a perfect ordering. Unlike in the decomposable case, the product 1213 S ILVA AND G HAHRAMANI YA YC YD YA YC YC YB YC YC YC YD YB (a) (b) Figure 9: The tree-structured (i.e., cycle-free) bi-directed graph in (a) has as maximal cliques the adjacent pairs. Such cliques can be ordered in a perfect sequence as shown in (b), where rectangles indicate the separators. Notice that R1 = {YA ,YC }, R2 = {YB }, R3 = {YD }. One possible perfect ordering is {YD ,YB ,YC ,YA }. Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) does not simplify in general. Instead we will focus only on fast methods nsp to compute Σ−1 ≺ (i),nsp≺ (i) . nsp As we shall see, the function of the perfect sequence is now to provide a sensible choice of which inverse submatrices {Σ−1 }, W ⊆ Y, to cache and reuse when computing Σ−1 ≺ (i),nsp≺ (i) . W,W nsp The same can be done to compute determinants |Σnsp≺ (i),nsp≺ (i) |. A simple way of reusing the results from the previous section is by triangulating the nondecomposable graph G , transforming it into a decomposable one, G ′ , which is then used to generate the perfect sequence. We need to distinguish between the “true” spouses of a node Yi in G and the artificial spouses in G ′ that result from the extra edges added. Let nsp≺G ′ (i) be the non-spouses of Yi in G ′ that precede it according to ≺: by construction, these are also non-spouses of Yi in G . Let sp∆≺G ′ (i) be the spouses of Yi in G ′ that are not spouses of Yi in G . That is, the set of preceding non-spouses of Yi in G is given by nsp≺ (i) = nsp≺G ′ (i) ∪ sp∆≺G ′ (i). Recall that the inverse of a partitioned matrix can be given by the following identity: A B C D −1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1 −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1 . (20) In order to compute Σ−1 ≺ (i),nsp≺ (i) , we consider its partitioned version nsp Σ−1 ≺ (i),nsp≺ (i) nsp = Σnsp≺G ′ (i),nsp≺G ′ (i) Σnsp≺G ′ (i),sp∆≺G ′ (i) Σsp∆≺G ′ (i),nsp≺G ′ (i) Σsp∆≺G ′ (i),sp∆≺G ′ (i) −1 . (21) Let κnsp be the maximum number of non-spouses among all Yi within any prime subgraph induced by YP(t) . By using relation (20), where we assume for now that we know A−1 ≡ Σ−1 ′ (i),nsp ′ (i) , the cost of computing (21) is O(m2 κnsp ) + O(κ3 ) = O(m2 κnsp ) (the cost of comnsp nsp ≺G ≺G puting D − CA−1 B is O(m2 κnsp ) + O(κ2 ) = O(m2 κnsp ), while the cost of inverting it is O(κ3 )). nsp nsp Treating κnsp as a constant, this reduces the complexity of sampling the i-th row of Σ from O(m3 ) to O(m2 ). A similar procedure applies to the computation of the determinant |Σnsp≺ (i),nsp≺ (i) |, using in this case the relationship (26). 1214 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The advantage of using the perfect sequence is to allow for the computation of all A−1 ≡ at a total cost, across all nodes, of O(m3 ): each set nsp≺G ′ (i) is guaranteed to Σ−1 ′ (i),nsp ′ (i) nsp≺G ≺G be equal to {Y1 ,Y2 , . . . ,Ylns } where Ylns is the last non-spouse of Yi in G ′ that antecedes Yi . This follows from the result in the previous section, since all non-spouses of a node in a decomposable graph precede its spouses. Therefore, if we store the inverse covariance matrices for {Y1 ,Y2 , . . . ,Yi }, 1 ≤ i ≤ m, we can obtain the required matrices A−1 . This requires the storage of O(m) matrices, and each matrix can be obtained by the previous one by a low-rank update (20) with a O(m2 ) cost. Arbitrary orderings do not guarantee such an incremental pattern and, hence, no efficient lowrank updates. Notice that depending on the problem, many of such inverse matrices can be dynamically removed from memory if they are not used by any node placed after a particular position. 6.3 Remarks In Gaussian undirected models, the problem of covariance matrix sampling can also be reduced to sampling within each prime graph at the cost of O(|P |4 ), |P | being the size of the largest prime component (Atay-Kayis and Massam, 2005). Since both κ and κnsp are O(|P |), our procedure costs O(m2 |P |2 + |P |4 ) per prime graph, plus a cost of O(m2 ) per node to compute the importance weights. Considering a number of m/|P | prime graphs and |P | < m, the total cost is O(m3 |P |), down from O(m4 ). For undirected models, the corresponding cost by sampling step using the perfect ordering decomposition is O(m|P |3 ). The higher-order dependency on m in bi-directed models is to be expected, since the Markov blanket of any node Yt in a connected bi-directed graph is V\{Yt }. It is clear that inference with a given bi-directed graph model will never scale at the same rate of a undirected model with the same adjacencies, but this does not justify adopting an undirected representation if it is ill-suited to the problem at hand. One has also to consider that in problems with directed and bi-directed edges, the actual maximum district size might be much smaller than the number of variables. For large problems, however, further approximation schemes will be necessary. Drton and Richardson (2008b) describe some reduction techniques for transforming bi-directed edges into directed edges such that the resulting Gaussian model remains the same. As future work, such methods could be adapted to the G -inverse Wishart sampling procedures and combined with the ordering techniques developed here into a single framework. It will also be interesting to develop similar schemes for the Gibbs sampler. 7. Experiments We now evaluate the advantages of the Gaussian and probit models in Bayesian inference on real problems. 7.1 Industrialization and Democratization Study Bollen (1989) describes a structural equation model of political and democratization factors within nations. “Democratization” and “industrialization” levels are abstract notions, but nevertheless of clearly observable impact. They are tied to empirical observations through different sets of indicators. For instance, an indicator of industrialization level is the gross national product. Hence, democratization and industrialization levels are here defined as scalar latent variables never observed directly, while the observed data is composed of indicators. In this model, there is a total of three indicators of industrialization, and four indicators of democratization. Democratization is 1215 S ILVA AND G HAHRAMANI Y1 Y2 Y3 Industrialization 1960 Democratization 1960 Y4 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Y5 Y6 Democratization 1965 Y7 Y8 Y9 Y 10 Y 11 Gross national product (GNP) 1960 Energy consumption per capita 1960 Percentage of labor force in industry 1960 Freedom of press 1960 Freedom of opposition 1960 Fairness of elections 1960 Elective nature of legislative body 1960 Freedom of press 1965 Freedom of opposition 1965 Fairness of elections 1965 Elective nature of legislative body 1965 Figure 10: A directed mixed graph representing dependencies between 11 observed political and economical indicators and three latent concepts (shaded nodes) (Dunson et al., 2005; Bollen, 1989). measured in a longitudinal study, where data was collected in two years (1960 and 1965). The indicators of democratization are pooled expert opinions summarized in an ordinal number scaled from 1 to 10. Following Bollen, we will treat the model as multivariate Gaussian, which provides an excellent fit (a p-value greater than 0.3 using a chi-square test) for a sample of 75 countries. The corresponding mixed graph is depicted in Figure 10, along with a description of all indicators. The graph is taken from Bollen (1989). Other hidden common causes affect the democratization indicators over time, but the nature of such hidden variables is irrelevant to the problem at hand: that is, the bi-directed edges are motivated by unmeasured causes of variability in the observed indicators that exist over time. For instance, the records of freedom of press in 1960 (Y4 ) and 1965 (Y8 ) co-vary due to other unmeasured factors not accounted by democratization factors. 1216 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Factor scores: countries in the latent space 5 0 Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) Figure 11: An embedding of 75 countries in a two-dimensional latent space: democratization level in 1960 and 1965. Boxplots of the Bayesian posterior distribution of the projection in the two dimensions are depicted in the vertical axis. Countries are arranged in the horizontal axis by the increasing order of their posterior expected industrialization level. Figure adapted from Dunson et al. (2005). An example of Bayesian inference application is shown in Figure 11. Boxplots of the posterior values of Democratization Level 1960 and Democratization Level 1965 are generated. Dunson et al. (2005) use this information to, for instance, find clusters of countries in the latent space. An example of a cluster is the one formed by the bottom 16 countries in the industrialization level ranking: the growing trend of democratization levels after the first 16 countries is interrupted. This type of analysis might provide new insights to a polical scientist, for example, by revealing particular characteristics for such a group of nations. 7.1.1 E VALUATING THE MCMC A LGORITHM FOR D IFFERENT M ODELS In our analysis, we fix to unity the coefficients corresponding to the edges Industrialization 1960 → Y1 , Democratization 1960 → Y4 and Democratization 1965 → Y8 , since the scale and sign of the latent variables is arbitrary. The intercept terms of the equations for Y1 ,Y4 and Y8 are set to zero, since the mean of the latents is also arbitrary. The resulting model is identifiable. We apply the Gibbs sampling procedure to three different models. The Gaussian DMG model as described in this paper, and two modified DAG models. The first DAG model is the one described by Dunson et al. (2005), where each bi-directed edge is substituted by an “ancillary” latent (as mentioned in Section 2.3). For instance, the pathway corresponding to Y4 ↔ Y8 is substituted by the chain Y4 ← D48 → Y8 , where D48 is unobserved. Dunson et al. further assume that all covariances due to such ancillary latents are positive. As such, the coefficients from Di j into {Yi ,Y j } are set 1217 S ILVA AND G HAHRAMANI Industrialization1960 → Democratization1965 Democratization1960 → Democratization1965 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 2 Density 1 0 0.0 0.0 0.5 1.0 Density 1.0 0.5 Density 3 1.5 1.5 4 2.0 Industrialization1960 → Democratization1960 -0.5 N = 50000 Bandwidth = 0.02317 0.0 0.5 1.0 1.5 2.0 2.5 -0.5 N = 50000 Bandwidth = 0.01972 (a) 0.0 0.5 1.0 1.5 2.0 2.5 N = 50000 Bandwidth = 0.009335 (b) (c) Figure 12: Posterior distribution of parameters associated with the respective edges in the industrialization/democratization domain. Smoothed posterior obtained using the output of our Gibbs sampler and the DENSITY function of R 2.6.0. Democratization1960 → Y5 0.8 5 0.6 0.4 Density 0.2 0.0 -1 0.0 0.0 0 0.5 0.5 1 1.0 2 1.0 1.5 3 2.0 4 2.5 2.0 1.5 Y7 ↔ Y11 Y7 ↔ Y11 3.0 Industrialization1960 → Democratization1960 0 1000 2000 3000 Iteration 4000 5000 0 1000 2000 3000 4000 5000 0 Iteration 1000 2000 3000 Iteration 4000 5000 -1 0 1 2 3 4 5 N = 50000 Bandwidth = 0.04786 Figure 13: The first three plots show the initial 5,000 iterations of a run of the Gibbs sampling algorithm for the DMG model for three different parameters associated with edges in the graph. The last plot depicts the posterior distribution the error covariance associated with the edge Y7 ↔ Y11 (smoothed with the kernel density estimator from the statistical software R). to unity, with the variance of Di j corresponding to the residual covariance of {Yi ,Y j } given their parents. Means of ancillary latents are fixed at zero. However, even for covariance matrices with positive covariances, this parameterization is not complete. This result is evident from the fact that the variances of Yi and Y j will both be larger than their covariance, which is not true of covariance matrices in general. For this particular problem, however, this extra restriction provides no measurable difference in terms of fitness. It does serve as a reminder, however, that “intuitive” parameterizations might hide undesirable constraints. The second DAG model is an extension of the DAG model suggested by Dunson et al., the only difference being that the coefficients corresponding to edges Di j → Yi , i < j, are free to vary (instead of being fixed to 1). In general, there are Gaussian DMG models that cannot be parameterized this way (Richardson and Spirtes, 2002). Notice also that because of chains such as Democratization 1960 → Y4 ↔ Y8 ← Democratization 1965, the set of independence constraints in this graph can only be represented by a DAG if we include the ancillary latents Di j . That is, there is no DAG with 1218 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 50000 MCMC comparison: DMG vs. unconstrained DAG 30000 0 10000 20000 Effective Sample Size 30000 20000 10000 0 Effective Sample Size DMG DAG 40000 DMG posDAG 40000 50000 MCMC comparison: DMG vs. positive covariance DAG 1 4 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 1 4 7 Parameter 11 15 19 23 27 31 35 39 43 47 51 55 59 63 Parameter Figure 14: Comparison of the effective sample size of the MCMC algorithm applied to the three models, DMG, DAG with positive covariances (posDAG) and general DAG, as explained in the main text. The horizontal axis is the boxplot for each independent entry of the observed covariance matrix, 66 in total. The boxplots are obtained from 80 independent chains initialized randomly, where each chain runs for 50,000 iterations. exactly the same set of independence constraints as the given DMG, unless ancillary latent variables are added. We study the behavior of the MCMC algorithm for these three models.12 It turns out that the mixing properties of the chain are considerably affected by the choice of model. Recall that, in the Gibbs sampling algorithm for the DMG model, a whole row of the error covariance matrix is sampled jointly conditioning on the other parameters. For the DAG models all entries of the error covariance matrix are independent and can be sampled jointly, but this requires conditioning on the ancillary latents, which do not exist in the DMG model and have to be sampled only in the DAG case. For the majority of the covariance entries, the MCMC procedure mixed quite well, as illustrated in Figure 13. Notice how about 12% of the sampled DMG error covariances for Y7 ↔ Y11 were under zero, which could raise suspicion over the assumption of positive covariances. Autocorrelation is 12. A few technical notes: we used the priors suggested in Dunson et al. (2005), except that we changed the confidence in the prior of the covariance of the error terms V to be smaller (in order to minimize the influence of the priors in the models, since in this particular problem the DMG and DAG models are nearly likelihood equivalent but not posterior distribution equivalent − the priors belong to different families). We used 1 degree of freedom in our G -Inverse Wishart, with the matrix parameter being the expected value of Dunson et al.’s prior. For the DAG models, we also used the G -inverse Wishart prior for the error terms, but where all error terms are independent. For the DAG model with a free coefficient per ancillary latent, we assigned a standard Gaussian prior to such coefficients. The chains were initialized randomly by sampling standard Gaussians for the coefficients and latent variables. Error covariance matrices were initialized to diagonal matrices with diagonal entries sampled uniformly in [1, 2]. Coefficient parameters were sampled jointly given the error covariance matrix and latent variables. Latent variables were also sampled jointly, given the parameters. 1219 S ILVA AND G HAHRAMANI Comparison of average effective sample size 30000 DMG 0 0 10000 20000 20000 10000 DMG 30000 40000 40000 Comparison of average effective sample size 0 10000 20000 30000 40000 0 positive covariance DAG 10000 20000 30000 40000 unconstrained DAG Figure 15: Comparison of the effective sample size of the MCMC algorithm applied to the three models. Here we plot the average effective sample sizes over 80 trials of 50,000 samples for each of the 66 entries of the covariance matrix. Points over the line indicate parameters where the DMG approach performed better. essentially zero for most parameters at a lag of 50. The degree of autocorrelation, however, varied significantly between the DMG model and each DAG model. The chains for the DMG model mixed considerably better. To summarize such behavior, we calculated the effective sample size of the samples obtained from several chains. The parameters of interest in this comparison are the independent entries in the 11 × 11 dimensional observed covariance matrix. This is a total of 66 parameters. The effective sample size statistics were obtained by 80 independent chains of 50,000 samples each, for the three models. For each chain and each parameter, we compute the desired statistic using the EFFECTIVE S IZE function implemented in the R package CODA, freely available in the Internet. Results are summarized by boxplots in Figure 14. Parameters are ordered in the x-axis following the upper triangular covariance matrix, scanning it in the order {σY1Y1 , σY1Y2 , . . . , σY1Y11 , σY2Y2 , . . . , σY11Y11 }. White boxplots correspond to the distribution of effective sample size statistics with the DMG model across the 80 independent chains. Gray boxplots correspond to the two DAG variants. There is no significant difference between the behaviour of the Gibbs sampling procedure for the two DAG models. The procedure with the DMG model is clearly better behaved. As a summary statistic, the average effective sample size over 80 trials was steadly larger in the DMG outcome than in the positive DAG outcome (61 out of 66 parameters) and unconstrained DAG (59 out of 66). The comparison of averages is illustrated by Figure 15. By caching the sufficient statistics of the data and factorizing the sampling procedure according to the districts of the graph, the running time for generating 50,000 samples out of the DMG model was of 34 seconds in a dual core Pentium IV 2.0 GHz. Depending on the connectivity of the bidirected components of the graph and on the implementation of matrix inversion, sampling from the 1220 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS DAG model might be faster than sampling from the DMG. In this particular study, sampling from the DAG models was substantially slower, an approximate average of 60 seconds for both variants. This can be explained by the fact that sampling latent variables is very expensive, especially considering that in the given DAG models all ancillary latents become dependent when conditioning on the data. To summarize, the DMG approach allowed for a complete parameterization with significantly better mixing properties, while still resulting in a faster MCMC procedure. 7.2 Structure Learning Applications When trying to find a point estimate of graphical structures (i.e., returning a single graph that explains the data well), simple approaches such as testing for marginal independencies are reasonable learning algorithms under the Gaussian assumption. The Bayesian approach, however, allows one to compute odds and distributions over graphs and graph statistics, for example, the joint probability of small substructures (Friedman and Koller, 2003). Moreover, it is not clear how the independence test procedure controls for the predictive ability of the model, which is not a straightforward function of the edges that are selected due to the quantitative aspects of the dependencies. We evaluate our Bayesian model selection contribution, focusing on the Monte Carlo sampler for bi-directed models. Jones et al. (2005) propose the following priors for graphs: P (G |β) = β|E| (1 − β)0.5m(m−1)−|E| where β is a hyperparameter, |E| is the number of edges in G , and m is the number of nodes. As suggested by Jones et al., we choose β = 0.5/(m − 1), which puts more mass on graphs with O(m) edges than the uniform prior. We start with a brief synthetic study to compare the approach against a simple but effective approach based on the BIC approximation.13 An experiment with gene expression data closes this subsection. 7.2.1 S YNTHETIC S TUDIES As a sanity check for the procedure, we generate synthetic 10-dimensional Gaussian data from models that are Markov with respect to a bi-directed graph. One hundred data sets of 50 datapoints each are generated, each coming from a different model.14 We initially find a structure by marginal independence tests using the Fisher’s Z statistic at a 0.05 level. From this starting point, we perform two searches: one using the BIC score, and the other using the marginal likelihood with a G -IW prior.15 Given the best model for each procedure, we evaluate the predictive log-likelihood on a test set of 2,000 points which are independently sampled for each of the 100 models. 13. The BIC approach is an asymptotically consistent score for selecting the maximum a posteriori Gaussian bi-directed graph model (Richardson and Spirtes, 2002). 14. The details of the simulated data are as follows: we start with DAG with no edges, with observed nodes {Y1 ,Y2 , . . . ,Y10 } and hidden nodes {X1 , X2 , X3 , X4 }. Each individual edge Xi → Y j is added with probability 0.35, and no other edges are allowed. We reject graphs with fewer than 10 edges. All coefficient parameters are sampled from a standard Gaussian, and variances from an uniform distribution in [0, 1]. The model over Y corresponds to a bi-directed graph, where the edge Yi ↔ Y j exists if and only if Yi and Y j have a common latent parent Xk in the DAG. We then store 50 samples for the Y variables in a data set. The procedure is repeated 100 times with different parameters and graphical structures each time. The average number of edges in the resulting simulation was of 18.4 edges per graph. 15. In both cases, we center the data at the empirical mean of the training set and assume the data to have been generated from a zero-mean Gaussian. The G -Inverse Wishart is an empirical prior: a diagonal matrix with the training variance 1221 S ILVA AND G HAHRAMANI MLE Difference (GIW Greedy - BIC Greedy) 0.0008 Density 0.0000 0.0000 0.0002 0.0002 0.0004 0.0006 0.0006 0.0004 Density 0.0008 0.0010 0.0010 0.0012 0.0012 Difference (GIW Greedy - BIC Greedy) -500 0 500 1000 1500 2000 -500 N = 100 Bandwidth = 125.5 0 500 1000 1500 2000 N = 100 Bandwidth = 116.1 (a) (b) Figure 16: The differente in predictive log-likelihood with models learned with the G -IW prior and the best BIC models found by greedy search. Although the difference per point is small, it reflects a persistent advantage of the full Bayesian approach. Figure (a) shows the estimated density of the distribution of differences when predicting points using the Bayesian predictive log-likelihood. Since the BIC search method does not atempt to maximize the finite sample posterior distribution, we provide Figure (b) for completeness: in this case, the predictive log-likelihood for the BIC model was calculated using the maximum likelihood estimator. The difference hardly changes, and the fully Bayesian model still wins (density estimates produced by the DENSITY(·) function of R 2.6.0.). The average difference in log-likelihood prediction16 between the structure learned with the Bayesian prior and the BIC-induced model is depicted in Figure 16(a). This is computed by conditioning on the learned structures (fully Bayesian vs. BIC maximum a posteriori graphs) and marginalizing over the posterior of the parameters. The parameter priors are those used for the structure learning step. This might be unfair for the BIC procedure, since it is not designed to maximize the finite sample posterior: hence we also show in Figure 16(b) the results obtained when the predictions given the BIC model are obtained by using the maximum likelihood estimators of the of each variable used as the diagonal. The number of degrees of freedom is set to 1. The search is a standard greedy procedure: we evaluate the marginal log-likelihood or BIC score for each graph that differs from the current candidate by one edge (i.e., graphs with one more or one fewer edge) and pick the one with the highest score. We stop when no improvement is possible. 16. In terms of incorrect edge additions and deletions, the procedures behave about the same: an average of one third of the edges is missed, and 7% of edges are incorrectly added (individual percentages are with respect to total number of possible mistakes in each graph). Unlike BIC, however, our procedure allows for different trade-offs by using different priors. It should also be pointed out that counting edge errors is just one possible measure. A more global quantitative score such as predictive log-likelihood takes into account, indirectly, the magnitude of the errors—although it is not a direct measure of model fitness. 1222 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS parameters. The average difference in the first case is 400.07, and only slightly less for the second case (389.63). Most of the mass of the difference distribution is positive (85 out of 100 for the first case, 89 out of 100 in the second case), which passes a sign test at a 0.05 level. The difference is still relatively small, suggesting that informative prior knowledge might be necessary to produce substantially better predictions. 7.2.2 G ENE E XPRESSION A NALYSIS To illustrate the use of Bayesian model selection approaches, we analyse the gene expression data previously studied by Drton and Perlman (2007), also as Gaussian bi-directed models. As before, our goal will be to compare the predictive power of models learned by greedy search with BIC and greedy search with the Bayesian posterior. The data consists of 13 gene expression measurements from a metabolic network. A total of 118 points is available. Using all data, the BIC-induced graph has 39 edges, while the finite sample posterior graph had 44. The same procedure used in the synthetic studies, for initializing graphs and choosing priors and centering the data, was applied in this case with two choices of degrees of freedom δ for the G -IW prior: δ = 1 and δ = 5. Preliminary experiments where 90% of the samples are assigned to the training set showed a negligible difference between methods. We then generate 10 random splits of the data, 50% of them assigned to the training set. Predictive results using the MCMC method for evaluating the Bayesian predictions (with half a million samples) are shown in Table 1. The BIC graphs are by definition the same in the three sets of evaluation, but parameters are learned in three different ways (maximum likelihood point estimation and Bayesian averaging with two different priors). There is a steady advantage for the Bayesian approach, although a small one. Notice that using Bayesian averaging over parameters given the BIC graph improves prediction when compared to using the maximum likelihood point estimate, despite the simplistic choice of prior in this study. Notice also that the cases where the Monte Carlo method has small or no advantage over the BIC method were the ones where the maximum likelihood estimators produced their best results. 7.2.3 R EMARKS The procedure based on the sampler is doable for reasonably sized problem on the order of a few dozen variables in desktop machines. Further improvements are necessary for larger problems. One aspect that was not explored here was re-using previous computations when calculating the probability of a new candidate, in a way similar to the local updates in DAG models (Chickering, 2002). How to combine local updates with the ordering-based improved sampler of Section 6 is left as future research. Several practical variations can also be implemented, such as vetoing the inclusion of edges associated with high p-values in the respective independence tests. Such tabu lists can significantly shrink the search space. It is important to evaluate how the Monte Carlo procedure for computing normalizing constants behaves in practice. For all practical purposes, the procedure is an importance sampler and as such is not guaranteed to work within a reasonable amount of time for problems of high dimensionality (MacKay, 1998). We can, however, exploit the nature of the problem for our benefit. Notice that the procedure depends upon a choice of ordering ≺ for the variables. Different orderings correspond in general to different importance distributions. We can play with this feature to choose an suitable ordering. Consider the following algorithm for choosing an ordering given a bi-directed graph G : 1223 S ILVA AND G HAHRAMANI Folder 1 2 3 4 5 6 7 8 9 10 MLE BIC -6578.44 -6392.67 -8194.54 -6284.00 -9428.00 -7111.45 -6411.43 -6350.44 -6374.31 -7247.82 δ=1 BIC MC -6382.45 -6308.19 -6284.64 -6277.94 -6567.89 -6433.51 -6265.16 -6285.77 -6473.93 -6400.51 -6573.85 -6572.74 -6329.53 -6317.18 -6319.87 -6295.19 -6307.13 -6308.21 -6584.96 -6468.51 δ=5 BIC MC -6342.82 -6296.14 -6279.54 -6285.26 -6553.88 -6452.15 -6252.54 -6258.42 -6483.43 -6469.45 -6528.76 -6513.02 -6313.05 -6309.18 -6299.53 -6297.80 -6297.47 -6304.25 -6528.61 -6444.55 Table 1: Results for the 10 random splits of the gene expression data, with 50% of the points assigned to the training set. The first column shows the predictive log-likelihood for the graph learned with the BIC criterion and parameters fit by maximum likelihood. The next two columns show predictive log-likelihood results for the graphs learned with BIC and the Monte Carlo (MC) marginal likelihood method using a G -IW prior with degrees of freedom δ = 1. The last two columns are the results of a prior where δ = 5. Best results in bold. 1. Let ≺ be an empty queue. 2. Let G ′ be the graph complement of G , that is, the graph where {Yi ,Y j } are neighbors if and only if they are not adjacent in G . 3. Let C be an arbitrary maximum clique of G ′ . Add all elements of C to the end of ≺ in any arbitrary order. 4. For each pair {Yi ,Y j }, not intersecting C , such that the path Yi ↔ Yk ↔ Y j exists in G and Yk ∈ C , add the edge Yi ↔ Y j to G . 5. Remove all elements Yk ∈ C from G , including any edge into Yk . 6. Iterate Steps 2-5 until G is an empty graph. The resulting queue ≺ is an ordering that attempts to maximize the number of variables that are marginally independent given their common predecessors. This is just one possibility to simplify the importance sampling distribution: perfect orderings and the approaches for simplifying maximum likelihood estimation described by Drton and Richardson (2008b) could be adapted to provide even better orderings, but we leave this as future work.17 17. In our actual implementation used in the experiments in this Section, we implemented an even simpler approach: instead of finding maximum cliques, we start to build a clique from a particular node, “greedily” adding other nodes to the clique according to the column order of the data set. Each node generates a candidate clique, and we pick an arbitrary clique of maximal size to be our new set C . 1224 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS −13720 −13722 −13728 −13726 −13724 −13722 −13724 −13726 −13728 Marginal likelihood −13720 −13718 Gene data, optimized ordering −13718 Gene data, default ordering 0 2 × 104 4 × 104 6 × 104 8 × 104 105 0 2 × 104 4 × 104 6 × 104 8 × 104 Sample Synthetic data (123 parameters) Synthetic data (658 parameters) 4 × 104 6 × 104 8 × 104 −48625 −48620 −48615 2 × 104 −48630 Marginal likelihood Arbitrary ordering Optimized ordering 0 −48635 −48640 −24171.5 −24172.5 −24173.5 Marginal likelihood −24170.5 −48610 Sample 105 105 Arbitrary ordering Optimized ordering 0 Sample 2 × 104 4 × 104 6 × 104 8 × 104 105 Sample Figure 17: An evaluation on the stability of the Monte Carlo normalizing function procedure. The top row depicts the marginal likelihood estimates for the gene problem using two different distributions implied by two different orderings, as explained in the text. Experiments with synthetic data are shown in the bottom, and the bottom-right figure illustrates major differences. Figure 17 illustrates the difference that a smart choice of ordering can make. The top left graph in Figure 17 depicts the progress of the marginal likelihood Monte Carlo estimator for the gene expression problem using the graph given by the hypothesis testing procedure. The model has 55 parameters. We obtain three estimates, each using a sample of 100,000 points, which allows us to observe how the estimates change at the initial stages. The variable ordering in this case is the 1225 S ILVA AND G HAHRAMANI ordering present in the original database (namely, DXPS1, DXPS2, DXPS3, DXR, MCT, CMK, MECPS, HDS, HDR, IPPI1, GPPS, PPDS1 and PPDS2). The top right graph shows three runs using the optimized ordering criterion. Convergence is much faster in this case, and both samplers agree on the normalizing constant estimate. As an illustration of the power of the procedure and its limitations, we generated a synthetic sample of 1,000 training points from a graph with 25 nodes, using the same procedure of Section 7.2.1. A run of two different samplers is shown at the bottom left of Figure 17. They are seemingly well-behaved, the ratio between the largest and median weight being at the order of one hundred in the “optimally” ordered case. In contrast, the bottom right corner of Figure 17 illustrates the method with a covariance matrix of 50 random variables and 1,000 training points. Notice this is a particularly dense graph. Much bigger jumps are observed in this case and there is no clear sign of convergence at 100,000 iterations. While there is no foolproof criterion to evaluate the behavior of an importance sampler, the relationship between orderings provides a complementary technique: if the normalizing constant estimators vary substantially for a given set of random permutations of the variables, then the outcomes are arguably not to be trusted even if the respective estimators appear to have converged. Concerning the choice of priors, in this Section we exploited empirical priors. The G -Inverse Wishart matrix hyperparameter is a diagonal matrix where variance entries are the sample variances. While this adds an extra bias towards diagonal matrices, at least in our experiments we performed close to or better than other approaches—it is however not clear whether we could have done much better. It is still an open question which practical “default” hyperparameters will prove useful for the G -IW . Elicitation of subjective priors in the context of structural equation models can benefit from pre-existing work on Bayesian regression, although again practical matters might be different for the G -IW . Dunson et al. (2005) describe some limitations of default priors for structural equation models. A thorough evaluation of methods for eliciting subjective priors is out of the context of this work, but existing work on inverse Wishart elicitation provides a starting point (Al-Awadhi and Garthwaite, 1998). As in the case of the inverse Wishart, the G -Inverse Wishart has a single hyperparameter for specifying degrees of freedom, a limitation which might motivate new types of priors (Brown et al., 1993). 7.3 Discrete Data Applications We now show results on learning a discrete distribution that factorizes according to a mixed graph. Drton and Richardson (2008a) describe applications on real-world binary data modeled according to bi-directed graphs. The empirical contingency tables for two studies can be found in the corresponding technical report (Drton and Richardson, 2005). Drton and Richardson used a complete parameterization for bi-directed binary models and a maximum likelihood estimation procedure. In this section, we analyze these two data sets to illustrate the behavior of our Bayesian procedure using the probit model. Our model imposes probit constraints that are not enforced by Drton and Richardson, but it allows us to obtain Bayesian credible intervals and predictions. The graphs used in the two studies are depicted in Figure 18. The first problem is a study on the dependence between alcoholism and depression, as shown in Figure 18(a). A data point is collected for a given pair of mono-zygotic twins. For each sibling Si , it is recorded whether Si is/is not alcoholic (Ai ), and whether Si suffers/does not suffer from depression (Di ). The hypothesis encoded by the graph is that alcoholism and depression do not share a common genetic cause, despite A and 1226 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A1 Success A2 Type of offense Prior sentence D1 D2 Drugs (a) Age (b) Figure 18: Two learning problems with discrete data. In (a), the graph shows dependencies concerning alcoholism (Ai ) and depression (Di ) symptoms for paired twins {1, 2}. In (b), a model for dependencies among features of a study on parole appeals, including the success of the parole, if the type of offense was a person offense or not, and if the offender had a dependency on drugs and was over 25 years old. All variables in these studies are binary and further details and references are provided by Drton and Richardson (2008a). A1A2D1D2 = 0001 A1A2D1D2 = 1000 0.0 0.2 0.4 0.6 0.8 1.0 N = 5000 Bandwidth = 0.003151 80 Density 40 60 0 0 0 5 20 5 10 Density 10 Density 15 20 15 25 20 30 A1A2D1D2 = 0000 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.002153 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.000712 Figure 19: Posterior distribution of some of the marginal contingency table entries for the twin model. D having some hidden (but different) genetic causes. If A and D did have genetic common causes, one would expect that the edges A1 ↔ D2 and A2 ↔ D1 would be also required. The compounded hypothesis of marginal independencies for Ai and D j , i = j, can be tested jointly by testing a bidirected model. Notice that no reference to particular genetic hidden causes of alcoholism and depression is necessary, which again illustrates the power of modeling by marginalizing out latent variables. The second study, as shown in Figure 18(b), concerns the dependencies among several variables in an application for parole. The model implies, for instance, that the success of a parole application (Success node, in the Figure) is independent of the age of the offender being under 25 (Age node). However, if it is known that the offender had a prior sentence, these two variables become dependent (through the path Success ↔ Prior sentence ↔ Age). As reported by Drton and Richardson, their 1227 S ILVA AND G HAHRAMANI Entry A1 A2 D1 D2 0000 0001 0010 0011 0100 0101 0110 0111 E[Θ|D ] 0.461 0.136 0.157 0.097 0.032 0.022 0.007 0.012 Estimates MLE 0.461 0.138 0.159 0.096 0.032 0.021 0.008 0.012 uMLE 0.482 0.134 0.154 0.085 0.025 0.015 0.012 0.017 Entry A1 A2 D1 D2 1000 1001 1010 1011 1100 1101 1110 1111 E[Θ|D ] 0.018 0.003 0.021 0.009 0.008 0.003 0.003 0.006 Estimates MLE 0.018 0.004 0.020 0.009 0.010 0.002 0.005 0.005 uMLE 0.013 0.007 0.013 0.015 0.005 0.003 0.007 0.012 Figure 20: The posterior expected value of the 16 entries in the twin study table (E[Θ|D ]). Results generated with a chain of 5,000 points. We also show the maximum likelihood estimates of Drton and Richardson (MLE) and the maximum likelihood values obtained using an unconstrained model (uMLE). Despite the probit parameterization, in this particular study there is a reasonable agreement between the Bayesian estimator and the estimator of Drton and Richardson. binary bi-directed model passes a significance test. Drton and Richardson also attempted to learn an undirected (Markov) network structure with this data, but the outcome was a fully connected graph. This is expected, since Markov networks cannot represent marginal independencies unless the graph is disconnected, which would introduce all sorts of other independencies and possibly not fit the data well. If many marginal independencies exist in the data generating process, Markov networks might be a bad choice of representation. For problems with symmetries such as the twin study, DAGs are not a natural choice either. 7.3.1 R ESULTS For the twin data problem, we used a simple prior for the covariance matrix of the underlying latent variables: a G -inverse Wishart with 1 degree of freedom and a complete covariance with a value of 2 for each element in the diagonal and 1 outside the diagonals. Thresholds are fixed at zero, since we have binary data. We present the expected posterior values of the contingency table entries in Figure 20. The outcome is essentially identical to the maximum likelihood estimates of Drton and Richardson despite the probit parameterization. Moreover, with our procedure we are able to generate Bayesian confidence intervals, as illustrated in Figure 19. The results are very stable for a chain of 1,000 points. For the parole data, we used a G -inverse Wishart prior for the covariance matrix of underlying variables Y⋆ with 1 degree of freedom and the identity matrix as hyperparameters. We compare the effective sample size of the Gibbs sampler for our DMG model and the DAG model obtained by using the ancillary latent parameterization of Section 7.1 for the underlying latent variable covariance matrix.18 Boxplots for the 16 contingency table entries of the twin network and the 32 entries of the parole study are shown in Figure 21. The setup is the same as in the democratization and 18. The priors used are as follows: the ancillary representation was given a prior with mean 1 and variance 1 for the coefficients Xi j → Y j⋆ , for j > i, and set constant to 1, if i < j. The means of the ancillary latents were fixed at 1228 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS industrialization experiment, where we run 80 independent chains and plot the distribution of the effective sample sizes to measure the mixing time. We ran a shorter chain of 2,000 points, since computing the contingency table entries is expensive. There is a substantial difference in effective sample size for the parole study. Notice that we are comparing MCMC samples for the entries in the contingency table, which in the DAG case requires integrating out not only the underlying latent variables implicit in the probit parameterization, but also the ancillary latents that account for the bi-directed edges. This hierarchy of latent variables, which does not exist in the DMG case, causes a considerable increase on autocorrelation of the chain compared to the DMG model. The standard DMG parameterization can be seen as a way of obtaining a collapsed Gibbs sampler, where the parameterization by construction reflects latent variables that were analytically marginalized. MCMC comparison: DMG vs. DAG (parole data) 800 700 MCMC comparison: DMG vs. DAG (twin data) DMG DAG 200 400 Effective Sample Size 400 300 0 100 200 Effective Sample Size 500 600 600 DMG DAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Parameter 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Parameter (a) (b) Figure 21: Comparison of effective sample sizes for the twin data (a) and parole data (b). 80 independent chains of 2,000 points were obtained using the Gibbs sampling algorithm, and the respective box-plots shown above. The Markov chain with the DMG approach easily dominates the DAG one. For the parole data, the average effective sample size for the DAG was as low as 60 points. 8. Conclusion Directed mixed graph models are a generalization of directed graph models. Whenever a machine learning application requires directed graphs, one should first consider whether directed mixed graphs are a better choice of representation instead. DMGs represent conditional independencies of DAGs where hidden variables have been marginalized out. Given that in most applications it is 0. Variance parameters were given (0.5, 0.5) inverse gamma priors, which approximately matches the priors in the DMG model. 1229 S ILVA AND G HAHRAMANI Y1 Y1 Y2 Y3 Y2 Y3 Y4 Y4 (a) (b) Figure 22: In (a), a simple bi-directed chain with four random variables. In (b), the respective factor graph that is obtained from a Bartlett parameterization using the ordering ≺≡ {Y1 ,Y2 ,Y3 ,Y4 }. In this case, the factors are p(Y1 ) × p(Y2 |Y1 ) × p(Y3 |Y1 ,Y2 ) × p(Y4 |Y1 ,Y2 ,Y3 ). A different choice of ordering (e.g., the perfect ordering) could provide simpler factors on average, but the presence of a factor linked to all variables is unavoidable. unlikely that all relevant variables are known, DMGs are a natural representation to use. In this paper, we introduced priors and inference algorithms for Bayesian learning with two popular families of mixed graph models: Gaussian and probit. We discussed some implementations and approximations to scale up algorithms. We showed examples of applications with real data, and demonstrated that Bayesian inference in Gaussian and probit DMG models using MCMC can have substantially faster mixing than in comparable DAGs. It is part of the machine learning folklore that factor graphs can subsume directed networks. In an important sense, this is known not to be true: undirected and factor graphs only allow for monotonic independence models, where explaining away is ruled out. This excludes a vast number of realistic, non-monotonic, models. While factor graphs are perhaps the data structures of choice for general message-passing algorithms (e.g., Yedidia et al., 2005), they are far from being universal modeling languages for independencies. What is true is that for any distribution that is Markov with respect to a DAG or DMG there is at least one corresponding factor graph model, but this is a vacuous claim of little interest: any distribution can be represented by a single-factor model involving all variables. Some will require a factor with all variables, even under the presence of a large number of independence constraints. For instance, a factor graph corresponding to any given bi-directed chain will necessarily include a factor node adjacent to all variable nodes, as illustrated in Figure 22. When parameterizing a distribution with many marginal independencies (e.g., a bi-directed tree), the respective factor graph would be no more than an unhelpful drawing. A better strategy for solving real-world problems is to define a family of models according to the (directed/undirected/factor) graphs of choice, and let the inference algorithm decide which re-expression of the model suits the problem. This has been traditional in graphical modeling literature (Lauritzen, 1996). The strategy adopted in this paper followed this spirit. An alternative has been recently introduced by Huang and Frey (2008). This paper discusses graphical families of marginal independence constraints (essentially identical to bi-directed graphs, although other types of constraints might implicitly follow from the parameterization). Models are parameterized using a very different strategy. The idea is to parameterize cumulative distribution functions (CDFs) instead of densities or probability mass functions. A simple factorization criterion can be defined in the space of CDFs, but densities have to be computed by a novel message-passing 1230 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS scheme. The particular application discussed by Huang and Frey (2008) could in principle be approached using the Gaussian bi-directed probit model of Section 5, but the parameterization in Huang and Frey (2008) does not need to rely on Gaussian distributions. It is not clear, however, how to efficiently perform Bayesian inference in this case and which constraints are implicitly implied by the different choices of parameterization. The different perspective given by products of CDFs is novel and promising. It should point out to new directions in mixed graph modeling. The structural equation modeling literature also describes several pragmatic ways of specifying non-linearities in the structural equations (Lee, 2007). Less common is the specification of nonGaussian models for the joint density of the error terms. Silva and Ghahramani (2009) introduce a flexible mixture of Gaussians approach for models of marginal independence. There is a need on how to combine this approach with flexible families of structural equations in a computationally efficient way. Also, models with non-additive error terms remain to be explored. Current interest in estimating sparse statistical models has lead to approaches that estimate structured covariance matrices (e.g., Bickel and Levina, 2008). This development could also lead to new families of priors. In particular, different matrix decompositions have motivated different ways of specifying priors on covariance matrices. For instance, Chen and Dunson (2003) propose a modified Cholesky decomposition for the covariance matrix of random effect parameters: standard deviations are parameterized separately with a prior that puts positive mass on zero variances (effectively allowing the random effect to be neutralized). Wong et al. (2003) describe a prior for inverse correlation matrices that is uniform conditioned on the number of structural zeros. Metropolis-Hastings schemes are necessary in this case. Shrinkage methods have also been applied to the estimation of covariance matrices. A common approach, shrinkage towards a diagonal matrix (e.g., Daniels and Kass, 1999), could be generalized towards some sparse matrix corresponding to a bi-directed graph. Although shrinkage will not generate structural zeros in the resulting matrix, allowing for sparse shrinkage matrices other than the identity matrix could be interesting in prediction problems. Some approaches can exploit an ordering for the variables, which is natural in some domains such as time-series analysis. While the G -Inverse Wishart is invariant to a permutation of the variables, new types of priors that exploit a natural variable ordering should be of interest, as in the original work of Brown et al. (1993) that motivated our approach. Other directions and applications are suggested by recent papers: • learning measurement models: the industrialization and democratization problem of Section 7.1 provides an example of a measurement model. In such a family of problems, observed variables are children of latent variables, and connections from latents to observables define the measurement model. Sparsity in the measurement can be exploited to allow for more general dependencies connecting latent variables. One role of the bi-directed component is to allow for extra dependencies connecting observed variables that are not accounted by the explicit latent variables in the model. Silva et al. (2006) describes a learning algorithm for mixed graph measurement models using the “ancillary” parameterization. The natural question is which alternative optimization strategies could be used and how to scale them up; • structural and relational learning: in prediction problems where given an input vector X we have to predict an output vector Y, the dependence structure of Y given X can also lie within the directed mixed graph family. Silva et al. (2007) introduces mixed graph models within the context of relational classification, where Y are labels of different data points 1231 S ILVA AND G HAHRAMANI not independently distributed. In such a class of problems, novel kinds of parameterization are necessary since the dimensionality of the covariance matrix increases with the sample size. Structural features of the graph are used to propose different parameterizations of the dependencies, and many other alternatives are possible; • causal inference: mixed graphs have been consistently used as a language for representing causal dependencies under unmeasured confounding. Zhang (2008) describes recent advances in identifying causal effects with ancestral graphs. Algorithms for learning mixed graph structures are described by Spirtes et al. (2000) and the recent advances in parameterizing such models should result in new algorithms; Many challenges remain. For instance, more flexible models for DMG discrete models are being developed (Drton and Richardson, 2008a), but for large graphs they pose a formidable computational problem. An important question is which other less flexible, but more tractable, parameterizations could be used, and which approximation algorithms to develop. The probit family discussed here was a choice among many. The parameterization by Drton and Richardson (2008a) could be a starting point for trading-off flexibility and computational effort. And while it is true that Gaussian copula models (Pitt et al., 2006) can be adapted to generalize the approach introduced here, it remains to be seen if other copula parameterizations easily lead to DMG models. Acknowledgments We thank the anonymous reviewers for their suggestions, Kenneth Bollen for providing us with the industrialization and democratization data set, and Robert Gramacy for helpful discussions. An earlier version of this paper (Silva and Ghahramani, 2006) appeared in the proceedings of the Uncertainty in Artificial Intelligence conference. This work was funded by the Gatsby Charitable Foundation and a EPSRC grant #EP/D065704/1. Appendix A. Deriving the Sampling Distribution for the Monte Carlo Computation of Normalizing Constants We give here the details on how to derive the sampling distribution used for computing normalizing constants IG (δ, U), as described in Section 3.2.2. Let Ai ≡ Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . Recall from Equation (7) that Bi,nsp≺ (i) = −Bi,sp≺ (i) Ai . nsp The original density p(Bi | γi ), as given by Lemma 1, is a multivariate Gaussian with the following kernel: 1 exp − 2γi T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) (22) where U{i−1},{i−1} in Lemma 1 was rearranged above as the partitioned matrix in (14). The pair {Msp≺ (i) , Mnsp≺ (i) } corresponds to the respective partition of the mean vector Mi . Plugging in the expression for Bi,nsp≺ (i) in (22), we obtain the modified kernel 1232 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS exp − 1 2γi T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) (23) which can be rewritten as pb (Bi,sp≺ (i) ; Ki mi , γi Ki ) × (2π)#sp≺ (i)/2 |γi |#sp≺ (i)/2 |Ki (Φi−1 )|1/2 1 × exp − γ−1 Ui 2 i (24) where #sp≺ (i) is the size of set sp≺ (i), pb (·; α, Σ) is the density function of a multivariate Gaussian distribution with mean α and covariance Σ, Ki (Φi−1 ) ≡ Ki to emphasize the contribution of previous parameters, and mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i Ui = MT U{i−1},{i−1} Mi − mT Ki mi . i i / If sp≺ (i) = 0, it follows that Bi = Bi,nsp≺ (i) = 0. The kernel (23) reduces to exp(−0.5Ui /γi ), / and Ui ≡ MT U{i−1},{i−1} Mi . If nsp≺ (i) = 0, then the expression for the kernel does not change i (Ui ≡ 0), and Equation (24) corresponds to the original kernel in Equation (11). Inserting the re-expressed kernel into the original function (11), we obtain pb (Bi,sp≺ (i) ; Ki mi , γi Ki )pg γi ; δ + i − 1 + #nsp≺ (i) uii.{i−1},{i−1} + Ui , 2 2 fi (Φi−1 ) where pg (·; α, β) is an inverse gamma density function and (i−1)−#sp≺ (i) 2 |Ki (Φi−1 )|1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 Appendix B. Variational Updates for Gaussian Mixed Graph Models The variational updates for the coefficient and intercept parameters are essentially identical to their joint conditional distribution given V and X, where occurrences of V and X are substituted by expectations V−1 q(V) and X q(X) , respectively. Let Vi j be the i j-th entry of V−1 q(V) . The covariance matrix of (B, α) is the covariance matrix of the vector vec(B, α). Such vector is constructed using all (non-zero) coefficients and intercepts. We denote this covariance matrix by ΣB,α . For simplicity of notation, we will treat αi as the coefficient bi(m+1) , m being the number of variables. We will (d) also adopt the notation Ym+1 ≡ 1 in the following derivations. As an abuse of notation, let Y also 1233 S ILVA AND G HAHRAMANI refer to latent variables. In this case, if Yi and Y j refer to latent variables Xhi and Xh j , then define Yi ≡ Xhi q(X) , and YiY j ≡ Xhi Xh j q(X) . Let bi j and btv be the r-th and s-th entries of vec(B, α), respectively. The rs-th entry of the inverse matrix Σ−1 is given by Bα (Σ−1 )rs = Vit Bα n ∑ Yj (d) (d) Yv + 1(i = t)1( j = v) d=1 cbj i sbj i where bxpx ≡ 0 if no edge Yx ← Ypx exists in the graph, 1(·) is the indicator function, and cbj , sbj are i i the given prior parameters defined in Section 4. Similarly to the factorization criterion explained in Section 6, the matrix q(V) will in general be block-diagonal, and this summation can be highly simplified. Define now a vector cb analogous to the Gibbs sampling case, where m cb = r ∑ Vit t=1 n ∑ Y j Yt (d) (d) d=1 + cbj i sbj i . The variational distribution q(B, α) is then a N(ΣB,α c, ΣB,α ). The variational distribution for the latent variables will exactly the same as the Gibbs distribution, except that references to B, α, V−1 are substituted by B q(B,α) , α q(B,α) and V−1 q(V) . Appendix C. Proofs Proof of Lemma 2: Arrange the columns of the Jacobian such that their order corresponds to the sequence σ11 , σ21 , σ22 , σ31 , σ32 , σ33 , . . . , σmm , excluding the entries σi j that are identically zero by construction. Arrange the rows of the Jacobian such that their order corresponds to the sequence γ1 , β21 , γ2 , β31 , β32 , . . . , γm , excluding the entries βi j that are not in ΦG (i.e., exclude any βi j corresponding to a pair {Yi ,Y j } that is not adjacent in the graph). By the definition of Bartlett’s decomposition, Σ{i},{i} and βst are functionally independent for s > i. The same holds for Σ{i},{i} and γs . As such, ∂σi j /∂βst = 0 and ∂σi j /∂γs = 0 for s > i. This implies that J(ΦG ) is a (lower) block triangular matrix of 2m − 1 blocks: for k odd, the k-th block is the singleton ∂σii /∂γi = 1, where i = (k + 1)/2. For k even, the k-th block is the Jacobian ∂Σi,sp≺(i) /∂Bi,sp≺(i) , where i = 1 + k/2 and Σi,sp≺(i) is the vector of covariances of Yi and its preceding spouses. From the interpretation given by Equation (8), it follows that Bi,sp≺ (i) can also be defined by the regression of Yi on Zi . That is Bi,sp≺ (i) = ΣYi ,Zi Σ−1 i ≡ ΣYi ,Zi R−1 . i Zi ,Z (25) However, ΣYi ,Zi = Σi,sp≺ (i) , since Yi is independent of its non-spouses. From (25) we get Σi,sp≺ (i) = Bi,sp≺ (i) Ri , and as such the submatrix ∂Σi,sp≺(i) /∂Bi,sp≺(i) turns out to be Ri . Since the determinant of the block triangular Jacobian J(ΦG ) is given by the determinant of the blocks, this implies m |J(ΦG )| = ∏ |Ri |. i=2 By the matrix identity 1234 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A B C D |Σ{i−1},{i−1} | = = |A||D − CA−1 B|, |Σnsp≺(i) ,nsp≺(i) ||Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) | nsp (26) ≡ i−1 |Σnsp≺(i) ,nsp≺(i) ||Ri |. Since |Σ{i−1},{i−1} | = ∏t=1 γt , the second equality holds. Proof of Theorem 4: We first describe a mapping from each path in G to a path in G ⋆ , and viceversa (such mappings are not inverse functions of each other, since the number of paths in G ⋆ is larger than in G ). By construction, all bi-directed edges in G ∗ have two UVs as endpoints, with an one-to-one mapping between each Ys⋆ ↔ Yt⋆ in G ⋆ and each Ys ↔ Yt in G . All directed edges in G ⋆ are of two types: Ys → Yt⋆ , with s = t, or Ys⋆ → Ys . Therefore, one can define an unique path P in G as a function of a path P⋆ in G ⋆ , obtained by relabeling each Y ⋆ as Y , and by collapsing any Y → Y edges that might result from this relabeling into a single vertex Y . A mapping in the opposite direction is analogous as given by the construction rule of Type-II models. A collider in a path is any vertex within a head-to-head collision in the path, that is, any vertex Yt where the preceding and the next vertex in the path are connected to Yt with an edge (directed or bi-directed) into Yt . Yi and Y j are m-separated by Z in an acyclic DMG if and only if there is no active path connecting Yi and Y j . Like in d-separation, a path is active if all of its colliders have some descendant in Z, and none of its non-colliders is in Z (Richardson, 2003). The mappings between paths P and P⋆ are such that Yt is a collider in P if and only if Yt is in P⋆ and is a collider, or Yt⋆ is in P⋆ and is a collider. Since by construction any Yt⋆ will have the same Y-descendants in G ⋆ as Yt has in G , and Z ⊂ Y, the result follows. Proof of Theorem 7: The first of the two claims of the theorem trivially holds, since connectivity is a transitive property and as such this partition will always exist (where K(i) = 1 is a possibility). We will prove the validity of the second claim by induction. Let {R1 , . . . , Rk } be the perfect sequence that generated our perfect ordering. The second claim automatically holds for all vertices in Rk , since Rk is a clique. Assume the second claim holds for the subsequence {Rl+1 , Rl+2 , . . . , Rk }. Let Yi be an element of Rl . Assume there is some non-spouse Yq of Yi in Rl ′ , and some spouse Yp of Yi in Rl ′′ , such that l < l ′ ≤ l ′′ . We will assume that both Yq and Yp belong to the same component Vt and show this leads to a contradiction. Without loss of generality, we can assume that Yq and Yp are adjacent: otherwise, the fact that Yq and Yp are in the connected set Vt will imply there is a path connecting Yq and Yp in the subgraph induced by {Rl+1 , . . . , Rk }. We can redefine {Yq ,Yp } to be the endpoints of the first edge in the path containing a non-spouse and a spouse of Yi . It will still be the case that q > p, by the induction hypothesis. Since Yp ∈ Rl ′′ , there is a separator Sl ′′ between Hl ′′ \Sl ′′ and Rl ′′ . But Yi ∈ Hl ′′ , and Yi is adjacent to Yp , which implies Yi ∈ Sl ′′ . If l ′ < l ′′ , this will also imply that Yq ∈ Sl ′′ , which is a contradiction, since Sl ′′ is a complete set. If l ′ = l ′′ , this implies that Yi and Yq are both in YP(l ′′ ) , which is also a contradiction since YP(l ′′ ) is a clique. 1235 S ILVA AND G HAHRAMANI References A. Al-Awadhi and P. Garthwaite. An elicitation method for multivariate normal distributions. Communications in Statistics - Theory and Methods, 27:1123–1142, 1998. J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679, 1993. A. Atay-Kayis and H. Massam. A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika, 92:317–335, 2005. D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers, 1999. M. Beal. Variational algorithms for approximate Bayesian inference. PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. M. Beal and Z. Ghahramani. Variational Baeysian learning of directed graphical models with hidden variables. Bayesian Analysis, 1:793–832, 2006. P. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36:2577– 2604, 2008. K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989. P. Brown, N. Le, and J. Zidek. Inference for a covariance matrix. In P.R. Freeman, A.F.M. Smith (editors), Aspects of Uncertainty, a tribute to D. V. Lindley, pages 77–92, 1993. Z. Chen and D. Dunson. Random effects selection in linear mixed models. Biometrics, 59:762–769, 2003. D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. M. Daniels and R. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94:1254–1263, 1999. M. Drton and M. Perlman. Multiple testing and error control in Gaussian graphical model selection. Statistical Science, pages 430–449, 2007. M. Drton and T. Richardson. A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2003. M. Drton and T. Richardson. Iterative conditional fitting for Gaussian ancestral graph models. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004. M. Drton and T. Richardson. Binary models for marginal independence. Department of Statistics, University of Washington, Tech. report 474, 2005. M. Drton and T. Richardson. Binary models for marginal independence. Journal of the Royal Statistical Society, Series B, 70:287–309, 2008a. 1236 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS M. Drton and T. Richardson. Graphical methods for efficient likelihood inference in Gaussian covariance models. Journal of Machine Learning Research, pages 893–914, 2008b. D. Dunson, J. Palomo, and K. Bollen. Bayesian structural equation modeling. Statistical and Applied Mathematical Sciences Institute, Technical Report #2005-5, 2005. N. Friedman and D. Koller. Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Machine Learning Journal, 50:95–126, 2003. P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨ lkopf. Nonlinear causal discovery with additive o noise models. Neural Information Processing Systems, 2008. J. Huang and B. Frey. Cumulative distribution networks and the derivative-sum-product algorithm. Proceedings of 24th Conference on Uncertainty in Artificial Intelligence, 2008. B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20:388–400, 2005. M. Jordan. Learning in Graphical Models. MIT Press, 1998. M. Jordan, Z. Ghaharamani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in Graphical Models, pages 105–162, 1998. C. Kang and J. Tian. Local Markov property for models satisfying the composition axiom. Proceedings of 21st Conference on Uncertainty in Artificial Intelligence, 2005. J. Kotecha and P. Djuric. Gibbs sampling approach for the generation of truncated multivariate Gaussian random variables. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1757–1760, 1999. S. Lauritzen. Graphical Models. Oxford University Press, 1996. S.-Y. Lee. Structural Equation Modeling: a Bayesian Approach. Wiley, 2007. D. MacKay. Introduction to Monte Carlo methods. Learning in Graphical Models, pages 175–204, 1998. I. Murray, Z. Ghahramani, and D. MacKay. MCMC for doubly-intractable distributions. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. R. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. M. Pitt, D. Chan, and R. Kohn. Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93:537–554, 2006. 1237 S ILVA AND G HAHRAMANI T. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145–157, 2003. T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. A. Roverato. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29: 391–411, 2002. R. Scheines, R. Hoijtink, and A. Boomsma. Bayesian estimation and testing of structural equation models. Psychometrika, 64:37–52, 1999. R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Silva and Z. Ghahramani. Factorial mixtures of Gaussians and the marginal independence model. Artificial Intelligence & Statistics (AISTATS ’09), 2009. R. Silva and R. Scheines. Bayesian learning of measurement and structural models. 23rd International Conference on Machine Learning, 2006. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191–246, 2006. R. Silva, W. Chu, and Z. Ghahramani. Hidden common cause relations in relational learning. Neural Information Processing Systems (NIPS ’07), 2007. J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860, 2006. P. Spirtes. Directed cyclic graphical representations of feedback models. Proceedings of 11th Conference on Uncertainty in Artificial Intelligence, 1995. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000. R. Tarjan. Decomposition by clique separators. Discrete Mathematics, 55:221–232, 1985. E. Webb and J. Forster. Bayesian model determination for multivariate ordinal and binary data. Technical report, Southampton Statistical Sciences Research Institute, 2006. F. Wong, C. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90:809–830, 2003. S. Wright. Correlation and causation. Journal of Agricultural Research, pages 557–585, 1921. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005. J. Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, pages 1437–1474, 2008. 1238
4 0.34787166 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
5 0.34716707 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
6 0.34708539 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
7 0.34686065 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
8 0.34588861 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
9 0.34403881 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
10 0.33920181 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
11 0.33731937 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
12 0.33723637 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
13 0.3354879 38 jmlr-2009-Hash Kernels for Structured Data
14 0.33530575 18 jmlr-2009-Consistency and Localizability
15 0.33525184 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
16 0.33473936 78 jmlr-2009-Refinement of Reproducing Kernels
17 0.33437145 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
18 0.33292478 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
19 0.33271483 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
20 0.33203384 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning