jmlr jmlr2009 jmlr2009-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
Reference: text
sentIndex sentText sentNum sentScore
1 In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. [sent-9, score-0.349]
2 In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. [sent-12, score-0.542]
3 Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. [sent-16, score-0.38]
4 Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. [sent-19, score-0.356]
5 Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation 1. [sent-20, score-0.405]
6 LDA models each of D documents in a collection as a mixture over K latent topics, with each topic being a multinomial distribution over a vocabulary of W words. [sent-46, score-0.379]
7 Nk j = ∑w Nwk j and Nwk = ∑ j Nwk j are the two primary count arrays used in computations, representing the number of words assigned to topic k in document j, and the number of times word w is assigned to topic k in the corpus, respectively. [sent-64, score-0.554]
8 Given the observed words x = {xi j }, the task of Bayesian inference for LDA is to compute the posterior distribution over the latent topic assignments z = {zi j }, the mixing proportions θ j , and the topics φk . [sent-66, score-0.593]
9 While HDP is defined to have infinitely many topics, the sampling algorithm only instantiates topics as needed. [sent-89, score-0.342]
10 In this partially collapsed scheme, the latent variables zi j on each processor can be concurrently sampled, where the concurrency is over processors. [sent-94, score-0.414]
11 , zP }, where processor p stores x p , the words from documents j = (p − 1)DP + 1, . [sent-119, score-0.343]
12 To asymptotically sample from the posterior distribution, the update of any topic assignment zi j can not be performed concurrently with the update of any other topic assignment zi′ j′ . [sent-127, score-0.57]
13 But given the typically large number of word tokens compared to the number of processors, to what extent will the update of one topic assignment zi j depend on the update of any other topic assignment zi′ j′ ? [sent-128, score-0.539]
14 , wi j = wi′ j′ ), then concurrent sampling will be very close to sequential sampling because the only term affecting the order of operations is the total count of topics ∑w Nwk in the denominator of (3). [sent-133, score-0.415]
15 After processor p has swept through its local data and updated topic assignments z p , the processor has modified count arrays Nk jp and Nwkp . [sent-136, score-0.942]
16 The topic-document counts Nk jp are distinct because of the document index, j, and will be consistent with the topic assignments z. [sent-137, score-0.562]
17 After the synchronization and update operations, each processor has the same values in the Nwkp array which are consistent with the global vector of topic assignments z. [sent-140, score-0.552]
18 2 Hierarchical Distributed Latent Dirichlet Allocation In AD-LDA we constructed an algorithm where each processor is independently computing an LDA model, but at the end of each sweep through a processor’s data, a consistent global array of topic counts Nwk is reconstructed. [sent-147, score-0.541]
19 This global array of topic counts could be thought of as a parent topic distribution, from which each processor draws its own local topic distribution. [sent-148, score-0.936]
20 The generative process is given 1806 D ISTRIBUTED A LGORITHMS FOR T OPIC M ODELS c ,d αp a,b γ θ jp βk Φk Zijp ϕkp X ijp N jp P Dp K P Figure 3: Graphical model for Hierarchical Distributed Latent Dirichlet Allocation. [sent-156, score-0.452]
21 by: βk ∼ G [a, b], α p ∼ G [c, d], θ jp ∼ D [α p ], zi jp ∼ θ jp , Φk ∼ D [γ], ϕkp ∼ D [βk Φk ], (5) xi jp ∼ ϕzi j p p . [sent-157, score-0.922]
22 The collapsed distribution of z p and x p on processor p is given by: p(z p , x p |α p , β, Φ) = ∏ j ∏ k Γ(Nk jp + α p ) Γ(Kα p ) ∏ Γ(α p ) Γ(N jp + Kα p ) k Γ(Nwkp + βk Φw|k ) Γ(βk ) ∏ Γ(βk Φw|k )) . [sent-162, score-0.728]
23 Γ(Nkp + βk ) w (6) From this we derive the conditional probability for sampling a topic assignment zi jp . [sent-163, score-0.557]
24 Unlike AD-LDA, the topic assignments on any processor are now conditionally independent of the topic assignments on the other processors given Φ, thus allowing each processor to sample z p concurrently. [sent-164, score-1.161]
25 The conditional probability of zi jp is p(zi jp = k|z¬i jp , x, α p , β, Φ) = p ¬i jp (Nwkp + βk Φw|k ) ¬i jp (Nk jp + α p ) . [sent-165, score-1.356]
26 Algorithm 2 HD-LDA repeat for each processor p in parallel do Sample z p locally: LDA-Gibbs-Iteration(x p , z p , Nk jp , Nwkp , α p , βk Φk ) Sample α p locally end for Synchronize Sample: βk , Φk Broadcast: βk , Φk until termination criterion satisfied 3. [sent-178, score-0.525]
27 Unlike AD-LDA, which uses a fixed number of topics, individual processors in AD-HDP may instantiate new topics during the sampling phase, according to the HDP sampling Equation (4). [sent-183, score-0.591]
28 During the synchronization and update step, instead of treating each processor’s new topics as distinct, we merge new topics that were instantiated on different processors. [sent-184, score-0.659]
29 Merging new topics helps limit unnecessary growth in the total number of topics and allows AD-HDP to produce more of a global model. [sent-185, score-0.573]
30 A simple way— inspired by AD-LDA—is to merge new topics based on their integer topic label. [sent-188, score-0.508]
31 A more complicated way is to match new topics across processors based on topic similarity. [sent-189, score-0.693]
32 In the first merging scheme, new topics are merged based on their integer topic label. [sent-190, score-0.526]
33 For example, assume that we have three processors, and at the end of a sweep through the data, processor one has 8 new topics, processor two has 6 new topics, and processor three has 7 new topics. [sent-191, score-0.725]
34 Then during synchronization, all these new topics would be aligned by topic label and their counts summed, producing 8 new global topics, as shown in Figure 4. [sent-192, score-0.55]
35 We will also show that AD-HDP ultimately learns models with similar perplexity to HDP, irrespective of how new topics are merged. [sent-196, score-0.601]
36 We also investigate more complicated schemes for merging new topics in AD-HDP, beyond the simple approach of merging by topic-id. [sent-197, score-0.358]
37 Instead of aligning new topics based topic-id it is possible to align new topics using a similarity metric such as symmetric Kullback-Leibler divergence. [sent-198, score-0.54]
38 In the bipartite matching scheme, we select a reference processor and perform bipartite matching between every processor’s new topics and the set of new topics of the reference processor. [sent-201, score-0.854]
39 In the greedy matching scheme, new topics on each processor are sequentially compared to a global set of new topics. [sent-204, score-0.594]
40 If a new topic is sufficiently different from every topic in the global set, the number of topics in the global set is incremented; otherwise, the counts for that new topic are added to those from the closest match in the global set. [sent-206, score-1.04]
41 The worst case complexity of this algorithm is O(P2 T 2 )—this is the case where every new topic is found to be different from every other new topic in the global set. [sent-208, score-0.457]
42 The distributed algorithms were initialized by first randomly assigning topics to words in z, then counting topics in documents, Nk jp , and words in topics, Nwkp , for each processor. [sent-232, score-0.87]
43 In our perplexity experiments, multiple processors were simulated in software by separating data, running sequentially through each processor, and simulating the global synchronization and update steps. [sent-235, score-0.619]
44 For the speedup experiments, computations were run on 64 to 1024 processors on a 2000+ processor parallel supercomputer. [sent-236, score-0.553]
45 For AD-LDA, the perplexity computation exactly follows that of LDA, since a single set of topic counts Nwk are saved when a sample is taken. [sent-257, score-0.578]
46 In contrast, all P copies of Nwkp are required to compute perplexity for HD-LDA. [sent-258, score-0.331]
47 W β + Nk (7) This perplexity computation follows the standard practice of averaging over multiple chains when making predictions with LDA models trained via Gibbs sampling, as discussed in Griffiths and Steyvers (2004). [sent-260, score-0.331]
48 Averaging over ten samples significantly reduces perplexity compared to using a single sample from one chain. [sent-261, score-0.331]
49 Analogous perplexity calculations are used for HD-LDA and AD-HDP. [sent-263, score-0.331]
50 Each processor learns a document mixture θ jp using the fold-in part for each test document. [sent-265, score-0.552]
51 Computing perplexity in this manner prevents the possibility of seeing or using test words during the training and fold-in phases. [sent-268, score-0.371]
52 1 Perplexity The perplexity results for KOS and NIPS in Figure 5 clearly show that the model perplexity is essentially the same for the distributed models AD-LDA and AD-HDP at P = 10 and P = 100 as their single-processor versions at P = 1. [sent-270, score-0.731]
53 The P = 1 perplexity is computed by LDA (circles) and HDP (triangles), and we use our distributed algorithms—AD-LDA (crosses), HD-LDA (squares), and AD-HDP (stars)—to compute the P = 10 and P = 100 perplexities. [sent-272, score-0.4]
54 The variability in perplexity as a function of the number of topics is much greater than the variability due to the number of processors. [sent-273, score-0.601]
55 Note that there is essentially no perplexity difference between AD-LDA and HD-LDA. [sent-274, score-0.331]
56 Even in the limit of a large number of processors, the perplexity for the distributed algorithms matches that for the sequential version. [sent-278, score-0.417]
57 Left plot shows perplexity for KOS and right plot shows perplexity for NIPS. [sent-281, score-0.702]
58 AD-HDP instantiates fewer topics but produces a similar perplexity to HDP. [sent-282, score-0.618]
59 The average number of topics instantiated by HDP on KOS was 669 while the average number of topics instantiated by AD-HDP was 490 (P = 10) and 471 (P = 100). [sent-283, score-0.638]
60 For NIPS, HDP instantiated 687 topics while AD-HDP instantiated 569 (P = 10) and 569 (P = 100) topics. [sent-284, score-0.368]
61 AD-HDP instantiates fewer topics because of the merging across processors of newly-created topics. [sent-285, score-0.542]
62 The similar perplexity results for AD-HDP compared to HDP, despite the fewer topics, is partly due to the relatively small probability mass in many of the topics. [sent-286, score-0.331]
63 We also tested both our distributed LDA algorithms with adversarial/non-random distributions of topics across processors using synthesized data. [sent-288, score-0.55]
64 One example of an adversarial distribution of documents is where each document only uses a single topic, and these documents are distributed such that processor p only has documents that are about topic p. [sent-289, score-0.847]
65 In this case the distributed topic models have to learn the correct set of P topics, even though each processor only sees local documents that pertain to just one of the topics. [sent-290, score-0.602]
66 We ran multiple experiments, starting with 1000 documents that were hard-assigned to K = 10 topics (i. [sent-291, score-0.359]
67 The perplexity performance of AD-LDA and HD-LDA under these adversarial/non-random distribution of documents was as good as the performance when the documents were distributed randomly, and as good as the performance of single-processor LDA. [sent-294, score-0.578]
68 To demonstrate that the low perplexities obtained from the distributed algorithms with P = 100 processors are not just due to averaging effects, we split the NIPS corpus into one hundred 15-document collections, and ran LDA separately on each of these hundred collections. [sent-295, score-0.344]
69 The test perplexity at K = 40 computed by averaging 100-separate LDA models was 2117, significantly 1813 N EWMAN , A SUNCION , S MYTH AND W ELLING higher than the P = 100 test perplexity of 1575 for AD-LDA and HD-LDA. [sent-296, score-0.698]
70 As an example, Figure 7 shows test perplexity versus iteration of the Gibbs sampler for the NIPS data at K = 20 topics. [sent-303, score-0.44]
71 During burn-in, up to iteration 200, the distributed algorithms are actually converging slightly faster than single processor LDA. [sent-304, score-0.332]
72 Note that one iteration of AD-LDA or 1 HD-LDA on a parallel multi-processor computer only takes a fraction (at best P ) of the wall-clock time of one iteration of LDA on a single processor computer. [sent-305, score-0.353]
73 The number of new topics instantiated in one pass of AD-HDP is limited to the maximum number of new topics instantiated on any one processor. [sent-312, score-0.638]
74 Correspondingly, at 500 iterations, the perplexity of HDP is lower than the perplexity of AD-HDP. [sent-314, score-0.662]
75 Figure 9 shows that performing the greedy matching scheme for new topics as described in Algorithm 4 significantly improves the rate of convergence for AD-HDP. [sent-318, score-0.346]
76 The number of topics increases at a faster rate for AD-HDP with matching, since the greedy matching scheme is more flexible in that the number of new topics at each iteration is not limited to the maximum number of new topics instantiated on any one processor. [sent-320, score-0.966]
77 In practice, only a few new topics are generated locally on each processor each iteration, and so the computational overhead of this heuristic matching scheme is minimal relative to the time for Gibbs sampling. [sent-322, score-0.577]
78 Left plot shows test perplexity versus iteration for HDP and AD-HDP. [sent-324, score-0.42]
79 Right plot shows number of topics versus iteration for HDP and AD-HDP. [sent-325, score-0.341]
80 Figure 10 shows the test perplexity computed on the NIPS data set, as a function of the number of topics, for the LDA algorithms and a fixed number of processors P = 10 (the results for the KOS data set were quite similar and therefore not shown). [sent-328, score-0.56]
81 This lower perplexity may be due to the fact that in HD-LDA test perplexity is computed using P sets of topic parameters, thus it has more parameters than AD-LDA to better fit the data. [sent-331, score-0.892]
82 Left plot shows test perplexity versus iteration for HDP and AD-HDP. [sent-336, score-0.42]
83 Right plot shows number of topics versus iteration for HDP and AD-HDP. [sent-337, score-0.341]
84 2000 LDA AD−LDA P=10 HD−LDA P=10 1900 1800 Perplexity 1700 1600 1500 1400 1300 1200 1100 1000 0 100 200 300 400 500 Number of Topics 600 700 Figure 10: Test perplexity versus number of topics using the NIPS data set (S=5). [sent-339, score-0.621]
85 For each test document, the training documents are ranked according to how probable the test document is under each training document’s mixture θ j and the set of topics φ. [sent-341, score-0.48]
86 We performed large-scale speedup experiments with just AD-LDA instead of all three of our distributed topic modeling algorithms because AD-LDA produces very similar results to HD-LDA, but with significantly less computation. [sent-377, score-0.332]
87 We ran AD-LDA on WIKIPEDIA using K = 1000 topics and PUBMED using K = 2000 topics distributed over P = 64, 128, 256, 512 and 1024 processors. [sent-385, score-0.609]
88 Figure 15 shows that AD-LDA is much closer to the LDA topics E[φLDA ] than either are to the true topics φ∗ , telling us that the sampling variation in learning LDA models from finite data sets is much greater than the variation between LDA and AD-LDA on the same data sets. [sent-514, score-0.595]
89 This can happen because the topics (with the same integer label) on each processor can drift far apart, so that topic k on one processor diverges from topic k on another processor. [sent-522, score-1.158]
90 The P = 2 processor case is the worst case for AD-LDA, since one half of the total words on 1821 N EWMAN , A SUNCION , S MYTH AND W ELLING each processor have the freedom to drift. [sent-525, score-0.486]
91 In contrast, when P = 100 processors, each processor can only locally modify 1/100th of the topic assignments, and so the topics on each processor can not drift far from the global set of topic counts at the previous iteration. [sent-526, score-1.243]
92 Bipartite matching significantly improves the perplexity in the P = 2 processor case, suggesting that the lack of communication has indeed caused the topics to drift apart. [sent-527, score-0.891]
93 Fortunately, topic drifting becomes less of a problem as more processors are used, and can be eliminated by frequent synchronization. [sent-528, score-0.423]
94 For example, if we wish to learn topics for a very large document collection, one is usually satisfied with mean values of word-topic distributions taken from a single chain. [sent-562, score-0.337]
95 There the authors perform topic modeling on a collection of books by learning a different topic model for each book and then clustering these learned topics together to find global topics. [sent-574, score-0.727]
96 With our approximate distributed algorithm, AD-LDA, we sample from an approximation to the posterior distribution by allowing different processors to concurrently sample topic assignments on their local subsets of the data. [sent-584, score-0.577]
97 With our hierarchical distributed model, HD-LDA, we adapt the underlying LDA model to map to the distributed processor architecture. [sent-586, score-0.4]
98 For HDP, our distributed algorithm eventually produced the same perplexity as the single-processor version of HDP. [sent-592, score-0.4]
99 Prior to reaching the converged perplexity result, AD-HDP had higher perplexity than HDP since the merging of new topics by label slows the rate of topic growth. [sent-593, score-1.188]
100 p Γ(N jp + Kα p ) k Using the expansions (8,9) we introduce the auxiliary variables t and s: P(α p , t, s| ) ∝ ∏tj Kα p −1 ∏ ∏ S(Nk jp , sk jp )α p sk j p (1 − t j )N j p −1 dt j j j αc−1 e−dα p . [sent-613, score-0.695]
wordName wordTfidf (topN-words)
[('lda', 0.523), ('hdp', 0.348), ('perplexity', 0.331), ('topics', 0.27), ('processor', 0.232), ('jp', 0.217), ('topic', 0.212), ('processors', 0.211), ('nwkp', 0.157), ('nwk', 0.145), ('ad', 0.095), ('gibbs', 0.093), ('documents', 0.089), ('kos', 0.087), ('elling', 0.081), ('ewman', 0.081), ('myth', 0.081), ('suncion', 0.081), ('istributed', 0.075), ('opic', 0.075), ('dirichlet', 0.075), ('distributed', 0.069), ('nk', 0.067), ('document', 0.067), ('collapsed', 0.062), ('parallel', 0.059), ('pubmed', 0.058), ('sampling', 0.055), ('zi', 0.054), ('lgorithms', 0.053), ('speedup', 0.051), ('instantiated', 0.049), ('synchronization', 0.044), ('merging', 0.044), ('odels', 0.041), ('matching', 0.041), ('nkp', 0.041), ('perplexities', 0.041), ('swkp', 0.041), ('tkp', 0.041), ('sampler', 0.04), ('kp', 0.04), ('wikipedia', 0.04), ('counts', 0.035), ('latent', 0.035), ('samplers', 0.034), ('teh', 0.034), ('global', 0.033), ('allocation', 0.032), ('nips', 0.031), ('concurrently', 0.031), ('iteration', 0.031), ('assignments', 0.031), ('hierarchical', 0.03), ('sweep', 0.029), ('merge', 0.026), ('mode', 0.026), ('memory', 0.025), ('parallelize', 0.025), ('vocabulary', 0.025), ('ref', 0.023), ('synchronize', 0.023), ('ths', 0.023), ('xtest', 0.023), ('variational', 0.023), ('posterior', 0.023), ('word', 0.023), ('corpus', 0.023), ('sk', 0.022), ('hd', 0.022), ('words', 0.022), ('plot', 0.02), ('mcmc', 0.02), ('distributing', 0.02), ('blei', 0.02), ('grif', 0.02), ('supercomputer', 0.02), ('versus', 0.02), ('hyperparameters', 0.019), ('google', 0.019), ('assignment', 0.019), ('collections', 0.018), ('mixture', 0.018), ('test', 0.018), ('generative', 0.018), ('asuncion', 0.018), ('greedy', 0.018), ('count', 0.018), ('antoniak', 0.018), ('smyth', 0.018), ('ics', 0.017), ('instantiates', 0.017), ('pw', 0.017), ('steyvers', 0.017), ('sync', 0.017), ('synchronizations', 0.017), ('communication', 0.017), ('sequential', 0.017), ('scheme', 0.017), ('locally', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 25 jmlr-2009-Distributed Algorithms for Topic Models
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
2 0.11238275 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
3 0.07121072 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
4 0.040218644 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
5 0.039255198 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
Author: Ricardo Silva, Zoubin Ghahramani
Abstract: Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly. Keywords: graphical models, structural equation models, Bayesian inference, Markov chain Monte Carlo, latent variable models 1. Contribution The introduction of graphical models (Pearl, 1988; Lauritzen, 1996; Jordan, 1998) changed the way multivariate statistical inference is performed. Graphical models provide a suitable language to decompose many complex real-world processes through conditional independence constraints. Different families of independence models exist. The directed acyclic graph (DAG) family is a particularly powerful representation. Besides providing a language for encoding causal statements (Spirtes et al., 2000; Pearl, 2000), it is in a more general sense a family that allows for non-monotonic independence constraints: that is, models where some independencies can be destroyed by conditioning on new information (also known as the “explaining away” effect — Pearl, 1988), a feature to be expected in many real problems. ∗. Part of this work was done while RS was at the Gatsby Computational Neuroscience Unit, UCL, and at the Statistical Laboratory, University of Cambridge. †. Also affiliated with the Machine Learning Department, Carnegie Mellon University. c 2009 Ricardo Silva and Zoubin Ghahramani. S ILVA AND G HAHRAMANI Y1 Y2 Y4 Y3 Y5 Y6 Y2 Y1 Y4 (a) Y3 Y5 (b) Y2 Y1 Y4 Y3 Y5 (c) Figure 1: Consider the DAG in (a). Suppose we want to represent the marginal dependencies and independencies that result after marginalizing out Y6 . The simplest resulting DAG (i.e., the one with fewest edges) is depicted in (b). However, notice that this graph does not encode some of the independencies of the original model. For instance, Y3 and Y4 are no longer marginally independent in the modified DAGs. A different family of graphical models, encoded with more than one type of edge (directed and bi-directed), is the focus of this paper. The graph in (c) depicts the solution using this “mixed” representation. However, DAG independence models have an undesirable feature: they are not closed under marginalization, as we will illustrate. Consider the regression problem where we want to learn the effect of a cocktail of two drugs for blood pressure, while controlling for a chemotherapy treatment of liver cancer. We refer to Y1 , Y2 as the dosage for the blood pressure drugs, Y3 as a measure of chemotherapy dosage, Y4 as blood pressure, and Y5 as an indicator of liver status. Moreover, let Y6 be an hidden physiological factor that affects both blood pressure and liver status. It is assumed that the DAG corresponding to this setup is given by Figure 1(a). In this problem, predictions concerning Y6 are irrelevant: what we care is the marginal for {Y1 , . . . ,Y5 }. Ideally, we want to take such irrelevant hidden variables out of the loop. Yet the set of dependencies within the marginal for {Y1 , . . . ,Y5 } cannot be efficiently represented as a DAG model. If we remove the edge Y3 → Y4 from Figure 1(b), one can verify this will imply a model where Y3 and Y4 are independent given Y5 , which is not true in our original model. To avoid introducing unwanted independence constraints, a DAG such as the one in Figure 1(b) will be necessary. Notice that in general this will call for extra dependencies that did not exist originally (such as Y3 and Y4 now being marginally dependent). Not only learning from data will be more difficult due to the extra dependencies, but specifying prior knowledge on the parameters becomes less intuitive and therefore more error prone. In general, it will be the case that variables of interest have hidden common causes. This puts the researcher using DAGs in a difficult position: if she models only the marginal comprising the variables of interest, the DAG representation might not be suitable anymore. If she includes all hidden variables for the sake of having the desirable set of independencies, extra assumptions about hidden variables will have to be taken into account. In this sense, the DAG representation is flawed. There is a need for a richer family of graphical models, for which mixed graphs are an answer. Directed mixed graphs (DMGs) are graphs with directed and bi-directed edges. In particular, acyclic directed mixed graphs (ADMGs) have no directed cycle, that is, no sequence of directed edges X → · · · → X that starts and ends on the same node. Such a representation encodes a set of conditional independencies among random variables, which can be read off a graph by using a 1188 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y1 Y2 Y3 Y1 (a) Y2 Y3 Y1 (b) Y2 Y3 Y1 Y2 (c) Y3 (d) Figure 2: Different examples of directed mixed graphs. The graph in (b) is cyclic, while all others are acyclic. A subgraph of two variables where both edges Y1 → Y2 and Y1 ↔ Y2 are present is sometimes known as a “bow pattern” (Pearl, 2000) due to its shape. Y1 Y1 Y2 H1 Y4 Y3 H2 Y1 Y2 Y4 Y3 Y2 Y3 Y4 Figure 3: After marginalizing variables H1 and H2 from the DAG on the left, one possible DMG representation of the same dependencies is shown by the graph in the middle. Notice that there are multiple DMGs within a same Markov equivalence class, that is, encoding the same set of conditional independencies (Richardson and Spirtes, 2002). The two last graphs above are on the same class. criterion known as m-separation, a natural extension of the d-separation criterion used for directed acyclic graphs (Richardson, 2003). In a ADMG, two adjacent nodes might be connected by up to two edges, where in this case one has to be bi-directed and the other directed. A cyclic model can in principle allow for two directed edges of opposite directions. Figure 2 provides a few examples of DMGs. The appeal of this graphical family lies on the representation of the marginal independence structure among a set of observed variables, assuming they are part of a larger DAG structure that includes hidden variables. This is illustrated in Figure 3.1 More details on DMGs are given in Sections 2 and 8. In our blood pressure\liver status multiple regression problem, the suitable directed mixed graph is depicted in Figure 1(c). The contribution of this paper is how to perform Bayesian inference on two different families of mixed graph models: Gaussian and probit. Markov chain Monte Carlo (MCMC) and variational approximations will be discussed. Current Bayesian inference approaches for DMG models have limitations, as discussed in Section 2, despite the fact that such models are widely used in several sciences. The rest of the paper is organized as follows. Section 3 describes a special case of Gaussian mixed graph models, where only bi-directed edges are allowed. Priors and a Monte Carlo algorithm are described. This case will be a building block for subsequent sections, such as Section 4, where 1. Notice that it is not necessarily the case that the probability model itself is closed under marginalization. This will happen to some models, including the Gaussian model treated in this paper. But the basic claim of closure concerns the graph, that is, the representation of independence constraints. 1189 S ILVA AND G HAHRAMANI Gaussian DMG models are treated. Section 5 covers a type of discrete distribution for binary and ordinal data that is Markov with respect to an acyclic DMG. In Section 6 we discuss more sophisticated algorithms that are useful for scaling up Bayesian learning to higher-dimensional problems. Section 7 presents several empirical studies. Since the use of mixed graph models in machine learning applications is still in its early stages, we briefly describe in Section 8 a variety of possible uses of such graphs in machine learning applications. 2. Basics of DMGs, Gaussian Models and Related Work In this section, we describe the Gaussian DMG model and how it complements latent variable models. At the end of the section, we also discuss a few alternative approaches for the Bayesian inference problem introduced in this paper. 2.1 Notation and Terminology In what follows, we will use standard notions from the graphical modeling literature, such as vertex (node), edge, parent, child, ancestor, descendant, DAG, undirected graph, induced subgraph, Markov condition and d-separation. Refer to Pearl (1988) and Lauritzen (1996) for the standard definitions if needed. Less standard definitions will be given explicitly when appropriate. A useful notion is that of m-separation (Richardson, 2003) for reading off which independencies are entailed by a DMG representation. This can be reduced to d-separation (Pearl, 1988) by the following trick: for each bi-directed edge Yi ↔ Y j , introduce a new hidden variable Xi j and the edges Xi j → Yi and Xi j → Y j . Remove then all bi-directed edges and apply d-separation to the resulting directed graph. As usual, we will refer to vertices (nodes) in a graph and the corresponding random variables in a distribution interchangeably. Data points are represented by vectors with an upper index, such as ( j) Y(1) , Y(2) , . . . , Y(n) . The variable corresponding to node Yi in data point Y( j) is represented by Yi . 2.2 Gaussian Parameterization The origins of mixed graph models can be traced back to Sewall Wright (Wright, 1921), who used special cases of mixed graph representations in genetic studies. Generalizing Wright’s approach, many scientific fields such as psychology, social sciences and econometrics use linear mixed graph models under the name of structural equation models (Bollen, 1989). Only recently the graphical and parametrical aspects of mixed graph models have been given a thorough theoretical treatment (Richardson and Spirtes, 2002; Richardson, 2003; Kang and Tian, 2005; Drton and Richardson, 2008a). In practice, many structural equation models today are Gaussian models. We will work under this assumption unless stated otherwise. For a DMG G with a set of vertices Y, a standard parameterization of the Gaussian model is given as follows. For each variable Yi with a (possibly empty) parent set {Yi1 , ...,Yik }, we define a “structural equation” Yi = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi where εi is a Gaussian random variable with zero mean and variance vii . Notice that this parameterization allows for cyclic models. Unlike in standard Gaussian DAG models, the error terms {εi } are not necessarily mutually independent. Independence is asserted by the graphical structure: given two vertices Yi and Y j , 1190 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS the respective error terms εi and ε j are marginally independent if Yi and Y j are not connected by a bi-directed edge. By this parameterization, each directed edge Yi ← Y j in the graph corresponds to a parameter bi j . Each bi-directed edge Yi ↔ Y j in the graph is associated with a covariance parameter vi j , the covariance of εi and ε j . Each vertex Y j in the graph is associated with variance parameter v j j , the variance of ε j . Algebraically, let B be a m × m matrix, m being the number of observed variables. This matrix is such that (B)i j = bi j if Yi ← Y j exists in the graph, and 0 otherwise. Let V be a m × m matrix, where (V)i j = vi j if i = j or if Yi ↔ Y j is in the graph, and 0 otherwise. Let Y be the column vector of observed variables, α the column vector of intercept parameters, and ε be the corresponding vector of error terms. The set of structural equations can be given in matrix form as Y = BY + α + ε ⇒ Y = (I − B)−1 (ε + α) ⇒ Σ(Θ) = (I − B)−1 V(I − B)−T (1) where A−T is the transpose of A−1 and Σ(Θ) is the implied covariance matrix of the model, Θ ≡ {B, V, α}. 2.2.1 C OMPLETENESS OF PARAMETERIZATION AND A NCESTRAL G RAPHS An important class of ADMGs is the directed ancestral graph. Richardson and Spirtes (2002) provide the definition and a thorough account of the Markov properties of ancestral graphs. One of the reasons for the name “ancestral graph” is due to one of its main properties: if there is a directed path Yi → · · · → Y j , that is, if Yi is an ancestor of Y j , then there is no bi-directed edge Yi ↔ Y j . Thus directed ancestral graphs are ADMGs with this constraint.2 In particular, they show that any Gaussian distribution that is Markov with respect to a given ADMG can be represented by some Gaussian ancestral graph model that is parameterized as above. For the ancestral graph family, the given parameterization is complete: that is, for each Markov equivalence class, it is always possible to choose an ancestral graph where the resulting parameterization imposes no further constraints on the distribution besides the independence constraints of the class. Since the methods described in this paper apply to general DMG models, they also apply to directed ancestral graphs. In principle, it is possible to define and parameterize a Gaussian DAG model that entails exactly the same independence constraints encoded in an directed ancestral graph. One possibility, as hinted in the previous Section, is to replace each bi-directed edge Yi ↔ Y j by a new path Yi ← Xi j → Y j . Variables {Xi j } are “ancillary” hidden variables, in the sense that they are introduced for the sake of obtaining the same independence constraints of an ancestral graph. Standard Bayesian methodology can then be applied to perform inference in this Gaussian DAG model. However, this parameterization might have undesirable consequences, as discussed in Section 8.6 of Richardson and Spirtes (2002). Moreover, when Markov chain Monte Carlo algorithms are applied to compute posteriors, the “ancillary” hidden variables will have to be integrated out numerically. The resulting Markov chain can suffer from substantial autocorrelation when compared to a model with no ancillary variables. We illustrate this behavior in Section 7. Further constraints beyond independence constraints are certainly desirable depending on the context. For instance, general ADMGs that are not ancestral graphs may impose other constraints (Richardson and Spirtes, 2002), and such graphs can still be sensible models of, for example, the 2. Notice this rules out the possibility of having both edges Yi → Y j and Yi ↔ Y j in the same ancestral graph. 1191 S ILVA AND G HAHRAMANI causal processes for the problem at hand. When many observed variables are confounded by a same hidden common cause, models based on factor analysis are appropriate (Silva et al., 2006). However, it is useful to be able to build upon independence models that are known to have a complete parameterization. In any case, even the latent variables in any model might have dependencies that arise from other latent variables that were marginalized, and a latent variable ADMG model will be necessary. When it comes to solving a problem, it is up to the modeler (or learning algorithm) to decide if some set of latent variables should be included, or if they should be implicit, living their hidden life through the marginals. Richardson and Spirtes (2002) provide further details on the advantages of a complete parameterization. Drton and Richardson (2004) provide an algorithm for fitting Gaussian ancestral graph models by maximum likelihood. 2.3 Bayesian Inference The literature on Bayesian structural equation models is extensive. Scheines et al. (1999) describe one of the first approaches, including ways of testings such models. Lee (2007) provides details on many recent advances. Standard Bayesian approaches for Gaussian DMG models rely on either attempting to reduce the problem to inference with DAG models, or on using rejection sampling. In an application described by Dunson et al. (2005), the “ancillary latent” trick is employed, and Gibbs sampling for Gaussian DAG models is used. This parameterization has the disadvantages mentioned in the previous section. Scheines et al. (1999) use the complete parameterization, with a single parameter corresponding to each bi-directed edge. However, the global constraint of positive-definiteness in the covariance matrix is enforced only by rejection sampling, which might be inefficient in models with moderate covariance values. The prior is setup in an indirect way. A Gaussian density function is independently defined for each error covariance vi j . The actual prior, however, is the result of multiplying all of such functions and the indicator function that discards non-positive definite matrices, which is then renormalized. In contrast, the Bayesian approach delineated in the next sections uses the complete parameterization, does not appeal to rejection sampling, makes use of a family of priors which we believe is the natural choice for the problem, and leads to convenient ways of computing marginal likelihoods for model selection. We will also see that empirically they lead to much better behaved Markov chain Monte Carlo samplers when compared to DAGs with ancillary latent variables. 3. Gaussian Models of Marginal Independence This section concerns priors and sampling algorithms for zero-mean Gaussian models that are Markov with respect to a bi-directed graph, that is, a DMG with no directed edges. Focusing on bi-directed graphs simplifies the presentation, while providing a convenient starting point to solve the full DMG case in the sequel. Concerning the notation: the distribution we introduce in this section is a distribution over covariance matrices. In the interest of generality, we will refer to the random matrix as Σ. In the context of the previous section, Σ ≡ Σ(Θ) = V, since we are assuming B = 0, α = 0. 1192 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 3.1 Priors Gaussian bi-directed graph models are sometimes called covariance graph models. Covariance graphs are models of marginal independence: each edge corresponds to a single parameter in the covariance matrix (the corresponding covariance); the absence of an edge Yi ↔ Y j is a statement that σYiY j = 0, σXY being the covariance of random variables X and Y . More precisely, if Σ is a random covariance matrix generated by a covariance model, a distribution of Σ is the distribution over the (non-repeated) entries corresponding to variances and covariances of adjacent nodes.3 In a model with a fully connected bi-directed graph, this reduces to a space of unrestricted covariance matrices. A common distribution for covariance matrices is the inverse Wishart IW (δ, U). In this paper, we adopt the following inverse Wishart parameterization: 1 p(Σ) ∝ |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ positive definite, 2 p(·) being the density function, tr(·) the trace function, and m the number of variables (nodes) in our model.4 We will overload the symbol p(·) wherever it is clear from the context which density function we are referring to. It is assumed that δ > 0 and U is positive definite. Following Atay-Kayis and Massam (2005), let M + (G ) be the cone of positive definite matrices such that, for a given bi-directed graph G and Σ ∈ M + (G ), σi j = 0 if nodes Yi and Y j are not adjacent in G . It is convenient to choose a distribution that is conjugate to the Gaussian likelihood function, since one can use the same algorithms for performing inference both in the prior and posterior. In a zero-mean Gaussian model, the likelihood function for a fixed data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined by the sufficient statistic S = ∑n (Y(d) )(Y(d) )T as follows: d=1 1 2 L (Σ; D ) = (2π)−nm/2 |Σ|−n/2 exp − tr(Σ−1 S) . (2) We extend the inverse Wishart distribution to the case of constrained covariance matrices in order to preserve conjugacy. This define the following distribution: p(Σ) = 1 1 |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ ∈ M + (G ) IG (δ, U) 2 (3) which is basically a re-scaled inverse Wishart prior with a different support and, consequently, different normalizing constant IG (δ, U). An analogous concept exists for undirected graphs, where Σ−1 ∈ M + (G ) is given a Wishart-like prior: the “G -Wishart” distribution (Atay-Kayis and Massam, 2005). We call the distribution with density function defined as in Equation (3) the G -Inverse Wishart distribution (G -IW ). It will be the basis of our framework. There are no analytical formulas for the normalizing constant. 3. As such, the density function for Σ is defined with respect to the Lebesgue measure of the non-zero, independent elements of this matrix. 4. We adopt this non-standard parameterization of the inverse Wishart because it provides a more convenient reparameterization used in the sequel. Notice this is the parameterization used by Brown et al. (1993) and Atay-Kayis and Massam (2005), which developed other distributions for covariance matrices. 1193 S ILVA AND G HAHRAMANI 3.2 The Normalizing Constant We now derive a Monte Carlo procedure to compute IG (δ, U). In the sequel, this will be adapted into an importance sampler to compute functionals of a G -IW distribution. The core ideas are also used in a Gibbs sampler to obtain samples from its posterior. The normalizing constant is essential for model selection of covariance graphs. By combining the likelihood equation (2) with the prior (3), we obtain the joint nm p(D , Σ | G ) = (2π)− 2 IG (δ, U)−1 × |Σ|− δ+2m+n 2 1 exp − tr[Σ−1 (S + U)] 2 where we make the dependency on the graphical structure G explicit. By the definition of IG , integrating Σ out of the above equation implies the following marginal likelihood: p(D | G ) = 1 IG (δ + n, S + U) nm IG (δ, U) (2π) 2 from which a posterior P (G | D ) can be easily derived as a function of quantities of the type IG (·, ·). The normalizing constant IG (δ, U) is given by the following integral:5 IG (δ, U) = Z M + (G ) |Σ|− δ+2m 2 1 exp − tr(Σ−1 U) dΣ. 2 (4) The space M + (G ) can be described as the space of positive definite matrices conditioned on the event that each matrix has zero entries corresponding to non-adjacent nodes in graph G . We will reduce the integral (4) to an integral over random variables we know how to sample from. The given approach follows the framework of Atay-Kayis and Massam (2005) using the techniques of Drton and Richardson (2003). Atay-Kayis and Massam (2005) show how to compute the marginal likelihood of nondecomposable undirected models by reparameterizing the precision matrix through the Cholesky decomposition. The zero entries in the inverse covariance matrix of this model correspond to constraints in this parameterization, where part of the parameters can be sampled independently and the remaining parameters calculated from the independent ones. We will follow a similar framework but with a different decomposition. It turns out that the Cholesky decomposition does not provide an easy reduction of (4) to an integral over canonical, easy to sample from, distributions. We can, however, use Bartlett’s decomposition to achieve this reduction. 3.2.1 BARTLETT ’ S D ECOMPOSITION Before proceeding, we will need a special notation for describing sets of indices and submatrices. Let {i} represent the set of indices {1, 2, . . . , i}. Let Σi,{i−1} be the row vector containing the covariance between Yi and all elements of {Y1 ,Y2 , . . . ,Yi−1 }. Let Σ{i−1},{i−1} be the marginal covariance matrix of {Y1 ,Y2 , . . . ,Yi−1 }. Let σii be the variance of Yi . Define the mapping Σ → Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }, 5. Notice this integral is always finite for any choice of δ > 0 and positive definite U, since it is no greater than the normalizing constant of the inverse Wishart. 1194 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS such that Bi is a row vector with i − 1 entries, γi is a scalar, and γ1 = σ11 , Bi = Σi,{i−1} Σ−1 i > 1, {i−1},{i−1} , −1 γi = σii.{i−1},{i−1} ≡ σii − Σi,{i−1} Σ{i−1},{i−1} Σ{i−1},i , i > 1. (5) The set Φ provides a parameterization of Σ, in the sense that the mapping (5) is bijective. Given that σ11 = γ1 , the inverse mapping is defined recursively by Σi,{i−1} = Bi Σ{i−1},{i−1} , i > 1, σii = γi + Bi Σ{i−1},i , i > 1. (6) We call the set Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } the Bartlett parameters of Σ, since the decomposition (6) is sometimes known as Bartlett’s decomposition (Brown et al., 1993). For a random inverse Wishart matrix, Bartlett’s decomposition allows the definition of its density function by the joint density of {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }. Define U{i−1},{i−1} , U{i−1},i and uii.{i−1},{i−1} in a way analogous to the Σ definitions. The next lemma follows directly from Lemma 1 of Brown et al. (1993): Lemma 1 Suppose Σ is distributed as IW (δ, U). Then the distribution of the corresponding Bartlett parameters Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } is given by: 1. γi is independent of Φ\{γi , Bi } 2. γi ∼ IG((δ + i − 1)/2, uii.{i−1,i−1} /2), where IG(α, β) is the inverse gamma distribution −1 3. Bi | γi ∼ N(U−1 {i−1},{i−1} U{i−1},i , γi U{i−1},{i−1} ), where N(M, C) is a multivariate Gaussian −1 distribution and U−1 {i−1},{i−1} ≡ (U{i−1},{i−1} ) . 3.2.2 BARTLETT ’ S D ECOMPOSITION OF M ARGINAL I NDEPENDENCE M ODELS What is interesting about Bartlett’s decomposition is that it provides a simple parameterization of the inverse Wishart distribution with variation independent parameters. This decomposition allows the derivation of new distributions. For instance, Brown et al. (1993) derive a “Generalized Inverted Wishart” distribution that allows one to define different degrees of freedom for different submatrices of an inverse Wishart random matrix. For our purposes, Bartlett’s decomposition can be used to reparameterize the G -IW distribution. For that, one needs to express the independent elements of Σ in the space of Bartlett parameters. The original reparameterization maps Σ to Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bd , γd }. To impose the constraint that Yi and Y j are uncorrelated, for i > j, is to set Bi Σ{i−1},{i−1} j = σYiY j (Φ) = 0. For a fixed Σ{i−1},{i−1} , this implies a constraint on (Bi ) j ≡ βi j . Following the terminology used by Richardson and Spirtes (2002), let a spouse of node Y in a mixed graph be any node adjacent to Y by a bi-directed edge. The set of spouses of Yi is denoted by sp(i). The set of spouses of Yi according to order Y1 ,Y2 , . . . ,Ym is defined by sp≺ (i) ≡ sp(i) ∩ {Y1 , . . . ,Yi−1 }. The set of non-spouses of Yi is denoted by nsp(i). Analogously, nsp≺ (i) ≡ {Y1 , . . . ,Yi−1 }\sp≺ (i). Let Bi,sp≺ (i) be the subvector of Bi corresponding to the the respective spouses of Yi . Define Bi,nsp≺ (i) analogously. 1195 S ILVA AND G HAHRAMANI Given the constraint Bi Σ{i−1},nsp≺ (i) = 0, it follows that Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) + Bi,nsp≺ (i) Σnsp≺ (i),nsp≺ (i) = 0 ⇒ Bi,nsp≺ (i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp (7) Identity (7) was originally derived by Drton and Richardson (2003). A property inherited from the original decomposition for unconstrained matrices is that Bi,sp≺ (i) is functionally independent of Σ{i−1},{i−1} . From (7), we obtain that the free Bartlett parameters of Σ are ΦG ≡ {γ1 , B2,sp≺ (2) , γ2 , B3,sp≺ (3) , γ3 , . . . , Bm,sp≺ (m) , γm }. Notice that, according to (5), Φ corresponds to the set of parameters of a fully connected, zeromean, Gaussian DAG model. In such a DAG, Yi is a child of {Y1 , . . . ,Yi−1 }, and Yi = Bi Yi−1 + ζ j , ζ j ∼ N(0, γ j ) where Yi−1 is the (i − 1) × 1 vector corresponding to {Y1 , . . . ,Yi−1 }. As discussed by Drton and Richardson (2003), this interpretation along with Equation (7) implies Yi = Bi,sp≺ (i) Zi + ζ j (8) where the entries in Zi are the corresponding residuals of the regression of sp≺ (i) on nsp≺ (i). The next step in solving integral (4) is to find the Jacobian J(ΦG ) of the transformation Σ → ΦG . This is given by the following Lemma: Lemma 2 The determinant of the Jacobian for the change of variable Σ → ΦG is m |J(ΦG )| = ∏ |Ri | = i=2 m−1 1 γm−i m ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 i ∏ where Ri ≡ Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) , that is, the covariance matrix of the nsp / respective residual Zi (as parameterized by ΦG ). If nsp≺(i) = 0, Ri is defined as Σsp≺(i) ,sp≺(i) and |Σnsp≺(i) ,nsp≺(i) | is defined as 1. The proof of this Lemma is in Appendix C. A special case is the Jacobian of the unconstrained covariance matrix (i.e., when the graph has no missing edges): m−1 |J(Φ)| = ∏ γm−i . i (9) i=1 Now that we have the Jacobian, the distribution over Bartlett’s parameters given by Lemma 1, and the identities of Drton and Richardson (2003) given in Equation (7), we have all we need to provide a Monte Carlo algorithm to compute the normalizing constant of a G -IW with parameters (δ, U). Let Σ(ΦG ) be the implied covariance matrix given by our set of parameters ΦG . We start from the integral in (4), and rewrite it as a function of ΦG . This can be expressed by substituting Σ for Σ(ΦG ) and multiplying the integrand by the determinant of the Jacobian. Notice that the parameters in Σ(ΦG ) are variation independent: that is, their joint range is given by the product of 1196 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS their individual ranges (positive reals for the γ variables and the real line for the β coefficients). This range will replace the original M + (G ) space, which we omit below for simplicity of notation: IG (δ, U) = Z |J(ΦG )||Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . 2 We now multiply and divide the above expression by the normalizing constant of an inverse Wishart (δ, U), which we denote by IIW (δ, U): IG (δ, U) = IIW (δ, U) Z −1 |J(ΦG )| × IIW (δ, U)|Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . (10) 2 The expression −1 IIW (δ, U)|Σ|− δ+2m 2 1 exp − tr(Σ−1 U) 2 corresponds to the density function of an inverse Wishart Σ. Lemma 1 allows us to rewrite the inverse Wishart density function as the density of Bartlett parameters, but this is assuming no independence constraints. We can easily reuse the result of Lemma 1 as follows: 1. write the density of the inverse Wishart as the product of gamma-normal densities given in Lemma 1; 2. this expression contains the original Jacobian determinant |J(Φ)|. We have to remove it, since we are plugging in our own Jacobian determinant. Hence, we divide the reparameterized density by the expression in Equation (9). This ratio |J(ΦG )|/|J(Φ)| can be rewritten as m |J(ΦG )| |Ri | 1 = ∏ m−i = m |J(Φ)| ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 γi / where |Σnsp≺(i) ,nsp≺(i) | ≡ 1 if nsp≺ (i) = 0; 3. substitute each vector Bi,nsp≺ (i) , which is not a free parameter, by the corresponding expression −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp This substitution takes place into the original factors given by Bartlett’s decomposition, as introduced in Lemma 1: −(i−1)/2 p(Bi , γi ) = (2π)−(i−1)/2 γi |U{i−1},{i−1} |1/2 1 × exp − (BiT − Mi )T U{i−1},{i−1} (BiT − Mi ) 2γi (uii.{i−1},{i−1} /2)(δ+i−1)/2 −( δ+i−1 +1) 1 exp − uii.{i−1},{i−1} × γi 2 Γ((δ + i − 1)/2) 2γi where Mi ≡ U−1 {i−1},{i−1} U{i−1},i . Plugging in this in (10) results in 1197 (11) S ILVA AND G HAHRAMANI IG (δ, U) = IIW (δ, U) Z 1 ∏m |Σnsp≺(i) ,nsp≺(i) | i=2 m × p(γ1 ) ∏ p(Bi , γi ) dΦG . i=2 However, after substitution, each factor p(Bi , γi ) is not in general a density function for {Bi,sp≺ (i) , γi } and will include also parameters {B j,sp≺ ( j) , γ j }, j < i. Because of the non-linear relationships that link Bartlett parameters in a marginal independence model, we cannot expect to reduce this expression to a tractable distribution we can easily sample from. Instead, we rewrite each original density factor p(Bi , γi ) such that it includes all information about Bi,sp≺ (i) and γi within a canonical density function. That is, factorize p(Bi , γi ) as p(Bi , γi |Φi−1 ) = pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) × fi (Φi−1 ) (12) where we absorb any occurrence of Bi,sp≺ (i) within the sampling distribution and factorize the remaining dependence on previous parameters Φi−1 ≡ {γ1 , γ2 , B2,sp≺ (2) , . . . , γi−1 , Bi−1,sp≺ (i−1) } into a separate function.6 We derive the functions pb (·), pg (·) and fi (·) in Appendix A. The result is as follows. The density pb (Bi,sp≺ (i) |γi , Φi−1 ) is the density of a Gaussian N(Ki mi , γi Ki ) such that mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i (13) Ai = Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp where Uss Usn Uns Unn ≡ Usp≺ (i),sp≺ (i) Usp≺ (i),nsp≺ (i) Unsp≺ (i),sp≺ (i) Unsp≺ (i),nsp≺ (i) . (14) The density pg (γi |Φi−1 ) is the density of an inverse gamma IG(g1 , g2 ) such that δ + i − 1 + #nsp≺ (i) , 2 uii.{i−1},{i−1} + Ui = , 2 TU = Mi {i−1},{i−1} Mi − mT Ki mi . i g1 = g2 Ui where uii.{i−1},{i−1} was originally defined in Section 3.2.1. Finally, (i−1)−#sp≺ (i) 2 |Ki |1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 6. A simpler decomposition was employed by Silva and Ghahramani (2006) (notice however that paper used an incorrect expression for the Jacobian). The following derivation, however, can be adapted with almost no modification to define a Gibbs sampling algorithm, as we show in the sequel. 1198 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS / Density function pb (Bi,sp≺ (i) |·, ·) and determinant |Ki |1/2 are defined to be 1 if sp≺ (i) = 0. Ui TU / / is defined to be zero if nsp≺ (i) = 0, and Ui = Mi {i−1},{i−1} Mi if sp≺ (i) = 0. The original normalizing constant integral is the expected value of a function of ΦG over a factorized inverse gamma-normal distribution. The density function of this distribution is given below: m pI(δ,U) (ΦG ) = m ∏ pg (γi |Φi−1 ) ∏ pb (Bi,sp i=1 ≺ (i) |γi , Φi−1 ) . i=2 We summarize the main result of this section through the following theorem: Theorem 3 Let f (X) p(X) be the expected value of f (X) where X is a random vector with density p(X). The normalizing constant of a G -Inverse Wishart with parameters (δ, U) is given by m IG (δ, U) = IIW (δ, U) × fi (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | . pI(δ,U) (ΦG ) This can be further simplified to m IG (δ, U) = fi′ (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | (15) pI(δ,U) (ΦG ) where fi′ (Φi−1 ) ≡ (2π) #sp≺ (i) 2 |Ki (Φi−1 )|1/2 Γ((δ + i − 1 + #nsp≺ (i))/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 which, as expected, reduces IG (δ, U) to IIW (δ, U) when the graph is complete. (1) (2) A Monte Carlo estimate of IG (δ, U) is then given from (15) by obtaining samples {ΦG , ΦG , (M) . . . , ΦG } according to pI(δ,U) (·) and computing: (s) fi′ (Φi−1 ) 1 M m IG (δ, U) ≈ ∑∏ M s=1 i=1 |Σnsp ,nsp (Φ(s) )| i−1 ≺(i) ≺(i) where here we emphasize that Σnsp≺(i) ,nsp≺(i) is a function of ΦG as given by (6). 3.3 General Monte Carlo Computation If Y follows a Gaussian N(0, Σ) where Σ is given a G -IW (δ, U) prior, then from a sample D = {Y(1) , . . . , Y(n) } with sufficient statistic S = ∑n (Y(d) )(Y(d) )T , the posterior distribution for Σ d=1 given S will be a G -IW (δ + n, U + S). In order to obtain samples from the posterior or to compute its functionals, one can adapt the algorithm for computing normalizing constants. We describe an importance sampler for computing functionals, followed by a Gibbs sampling algorithm that also provides samples from the posterior. 1199 S ILVA AND G HAHRAMANI Algorithm S AMPLE GIW-1 Input: a m × m matrix U, scalar δ, bi-directed graph G , an ordering ≺ 1. Let Σ be a m × m matrix 2. Define functions sp≺ (·), nsp≺ (·) according to G and ordering ≺ 3. Sample σ11 from IG(δ/2, u11 /2) 4. For i = 2, 3, . . . , m 5. Sample γi ∼ IG((δ + i − 1 + #nsp≺ (i))/2, (uii.{i−1},{i−1} + Ui )/2) 6. Sample Bi,sp≺ (i) ∼ N(Ki mi , γi Ki ) 7. Set Bi,nsp≺(i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp 8. Set ΣT {i−1},i = Σi,{i−1} = Bi Σ{i−1},{i−1} 9. Set σii = γi + Bi Σi,{i−1} 10. Set w = ∏m fi′ (Φi−1 )/|Σnsp≺ (i),nsp≺ (i) | i=1 11. Return (w, Σ). Figure 4: A procedure for generating an importance sample Σ and importance weight w for computing functionals of a G -Inverse Wishart distribution. Variables {Mi , mi , Ki , Ui } and function fi′ (Φi−1 ) are defined in Section 3.2.2. 3.3.1 T HE I MPORTANCE S AMPLER One way of computing functionals of the G -IW distribution, that is, functions of the type g(δ, U; G ) ≡ Z M + (G ) g(Σ)p(Σ | δ, U, G ) dΣ is through the numerical average g(δ, U; G ) ≈ ∑M ws g(Σ(s) ) s=1 , ∑M ws s=1 where weights {w1 , w2 , . . . , wM } and samples {Σ(1) , Σ(2) , . . . , Σ(M) } are generated by an importance sampler. The procedure for computing normalizing constants can be readily adapted for this task using pI(δ,U) (·) as the importance distribution and the corresponding weights from the remainder factors. The sampling algorithm is shown in Figure 4. 3.3.2 T HE G IBBS S AMPLER While the importance sampler can be useful to compute functionals of the distribution, we will need a Markov chain Monte Carlo procedure to sample from the posterior. In the Gibbs sampling 1200 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Algorithm S AMPLE GIW-2 Input: a m × m matrix U, scalar δ, bi-directed graph G , a m × m matrix Σstart 1. Let Σ be a copy of Σstart 2. Define functions sp(·), nsp(·) according to G 3. For i = 1, 2, 3, . . . , m 4. Sample γi ∼ IG((δ + (m − 1) + #nsp(i))/2, (uii.{\i},{\i} + U\i )/2) 5. Sample Bi,sp(i) from a N(K\i m\i , γi K\i ) 6. Set Bi,nsp(i) = −Bi,sp(i) Σsp(i),nsp(i) Σ−1 nsp(i),nsp(i) 7. Set ΣT = Σi,{\i} = Bi Σ{\i},{\i} {\i},i 8. Set σii = γi + Bi Σi,{\i} 9. Return Σ. Figure 5: A procedure for generating a sampled Σ within a Gibbs sampling procedure. procedure, we sample the whole i-th row of Σ, for each 1 ≤ i ≤ m, by conditioning on the remaining independent entries of the covariance matrix as obtained on the previous Markov chain iteration. The conditional densities required by the Gibbs sampler can be derived from (12), which for a particular ordering ≺ implies m p(Σ; δ, U, G ) ∝ pg (γ1 ) ∏ pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) fi (Φi−1 ). i=2 By an abuse of notation, we used Σ in the left-hand side and the Bartlett parameters in the righ-hand side. The conditional density of {Bm,sp≺ (m) , γm } given all other parameters is therefore p(Bm,sp≺ (m) , γm |ΦG \{Bm,sp≺ (m) , γm }) = pb (Bm,sp≺ (m) |γm , Φm−1 )pg (γm |Φm−1 ) from which we can reconstruct a new sample of the m-th row/column of Σ after sampling {Bm,sp≺ (m) , γm }. Sampling other rows can be done by redefining a new order where the corresponding target variable is the last one. More precisely: let {\i} denote the set {1, 2, . . . , i − 1, i + 1, . . . , m}. The Gibbs algorithm is analogous to the previous algorithms. Instead of sp≺ (i) and nsp≺ (i), we refer to the original sp(i) and nsp(i). Matrices Σ{\i},{\i} and U{\i},{\i} are defined by deleting the respective i-th row and i-th columns. Row vector Σi,{\i} and scalar uii.{\i} are defined accordingly, as well as any other vector and matrix originally required in the marginal likelihood/importance sampling procedure. The algorithm is described in Figure 5. The procedure can be interpreted as calling a modification of the importance sampler with a dynamic ordering ≺i which, at every step, moves Yi to the end of the global ordering ≺. 1201 S ILVA AND G HAHRAMANI 3.4 Remarks The importance sampler suffers from the usual shortcomings in high-dimensional problems, where a few very large weights dominate the procedure (MacKay, 1998). This can result in unstable estimates of functionals of the posterior and the normalizing constant. The stability of the importance sampler is not a simple function of the number of variables in the domain. For large but sparse graphs, the number of parameters might be small. For large but fairly dense graphs, the importance distribution might be a good match to the actual distribution since there are few constraints. In Section 7, we performe some experiments to evaluate the sampler. When used to compute functionals, the Gibbs sampler is more computationally demanding considering the cost per step, but we expect it to be more robust in high-dimensional problems. In problems that require repeated calculations of functionals (such as the variational optimization procedure of Section 4.3), it might be interesting to run a few preliminary comparisons between the estimates of the two samplers, and choose the (cheaper) importance sampler if the estimates are reasonably close. Na¨vely, the Gibbs sampler costs O(m4 ) per iteration, since for each step we have to invert the ı matrix Σnsp{\i},nsp{\i} , which is of size O(m) for sparse graphs. However, this inversion can cost much less than O(m3 ) if sparse matrix inversion methods are used. Still, the importance sampler can be even more optimized by using the methods of Section 6. 4. Gaussian Directed Mixed Graph Models As discussed in Section 2, Gaussian directed mixed graph models are parameterized by the set with parameters Θ = {V, B, α}. Our prior takes the form p(Θ) = p(B)p(α)p(V). We assign priors for the parameters of directed edges (non-zero entries of matrix B) in a standard way: each parameter bi j is given a Gaussian N(cBj , sBj ) prior, where all parameters are marginally independent in the i i prior, that is, p(B) = ∏i j p(bi j ). The prior for intercept parameters α is analogous, with αi being a Gaussian N(cα , sα ). i i Recall from Equation (1) that the implied covariance of the model is given by the matrix Σ(Θ) = (I − B)−1 V(I − B)−T . Similarly, we have the implied mean vector µ(Θ) ≡ (I − B)−1 α. The likelihood function for data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined as L (Θ; D ) = |Σ(Θ)|−n/2 ∏n exp − 1 (Y(d) − µ(Θ))T Σ(Θ)−1 (Y(d) − µ(Θ)) d=1 2 = |(I − B)−1 ||V||(I − B)−T | −n/2 exp − 1 tr(V−1 (I − B)S(I − B)T ) 2 , where now S ≡ ∑n (Y(d) − µ(Θ))(Y(d) − µ(Θ))T . d=1 Given a prior G -IW (δ, U) for V, it immediately follows that the posterior distribution of V given the data and other parameters is V | {B, α, D } ∼ G -IW (δ + n, U + (I − B)S(I − B)T ). Therefore it can be sampled using the results from the previous section. Notice this holds even if the directed mixed graph G is cyclic. Sampling αi given {D , Θ\{αi }} can also be done easily for both cyclic and acyclic models: the ′ ′ ′ posterior is given by a normal N(cα /sα , 1/sα ) where i i i 1202 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS sα i cα i ′ ≡ ′ ≡ 1 + n(V−1 )ii , sα i n m m cα (d) (d) −1 −1 i , − ∑ bt pt Ypt α − n ∑ (V )it αt + ∑ ∑ (V )it Yt si pt d=1 t=1 t=1,t=i with pt being an index running over the parents of Yt in G . However, sampling the non-zero entries of B results in two different cases depending whether G is cyclic or not. We deal with them separately. 4.1 Sampling from the Posterior: Acyclic Case The acyclic case is simplified by the fact that I − B can be rearranged in a way it becomes lower triangular, with each diagonal element being 1. This implies the identity |(I−B)−1 ||V||(I−B)−T | = |V|, with the resulting log-likelihood being a quadratic function of the non-zero elements of B. Since the prior for coefficient bi j is Gaussian, its posterior given the data and all other parameters will be ′ ′ ′ the Gaussian N(cbj /sbj , 1/sbj ) where i i i ′ sbj i ′ cbj i ≡ ≡ n 1 (d) + (V−1 )ii ∑ (Y j )2 , b si j d=1 m n cbj i (d) (d) (d) + ∑ Y j ∑ (V−1 )it Yt − ∑ bt pt Ypt − αt . b si j d=1 t=1 pt ,(t,pt )=(i, j) (16) As before, pt runs over the indices of the parents of Yt in G . Notice that in the innermost (d) summation we exclude bi jY j . We can then sample bi j accordingly. It is important to notice that, in practice, better mixing behavior can be obtained by sampling the coefficients (and intercepts) jointly. The joint distribution is Gaussian and can be obtained in a way similar to the above derivation. The derivation of the componentwise conditionals is nevertheless useful in the algorithm for cyclic networks. 4.2 Sampling from the Posterior: Cyclic Case Cyclic directed graph models have an interpretation in terms of causal systems in equilibrium. The simultaneous presence of directed paths Yi → · · · → Y j and Y j → · · · → Yi can be used to parameterize instantaneous causal effects in a feedback loop (Spirtes, 1995). This model appears also in the structural equation modeling literature (Bollen, 1989). In terms of cyclic graphs as families of conditional independence constraints, methods for reading off constraints in linear systems also exist (Spirtes et al., 2000). The computational difficulty in the cyclic case is that the determinant |I − B| is no longer a constant, but a multilinear function of coefficients {bi j }. Because bi j will appear outside the exponential term, its posterior will no longer be Gaussian. From the definition of the implied covariance matrix Σ(Θ), it follows that |Σ(Θ)|−n/2 = (|I − B||V|−1 |I − B|)n/2 . As a function of coefficient bi j , |I − B| = (−1)i+ j+1Ci j bi j + k=m ∑ k=1,k= j 1203 (−1)i+k+1Cik bik , S ILVA AND G HAHRAMANI where Ci j is the determinant of respective co-factor of I − B, bik ≡ 0 if there is no edge Yi ← Yk , and bii ≡ −1. The resulting density function of bi j given D and Θ\{bi j } is ′ p(bi j |Θ\{bi j }, D ) ∝ |bi j − κi j | exp − n ′ (bi j − cbj /sbj )2 i i ′ 2sbj i , where κi j ≡ Ci−1 j k=m ∑ (−1)k− j+1Cik bik k=1,k= j ′ ′ and {cbj , sbj } are defined as in Equation (16). Standard algorithms such as Metropolis-Hastings can i i be applied to sample from this posterior within a Gibbs procedure. 4.3 Marginal Likelihood: A Variational Monte Carlo Approach While model selection of bi-directed graphs can be performed using a simple Monte Carlo procedure as seen in the previous Section, the same is not true in the full Gaussian DMG case. Approaches such as nested sampling (Skilling, 2006) can in principle be adapted to deal with the full case. For problems where there are many possible candidates to be evaluated, such a computationally demanding sampling procedure might be undesirable (at least for an initial ranking of graphical structures). As an alternative, we describe an approximation procedure for the marginal likelihood p(D |G ) by combining variational bounds (Jordan et al., 1998) with the G -Inverse Wishart samplers, and therefore avoiding a Markov chain over the joint model of coefficients and error covariances. This is described for acyclic DMGs only. We adopt the following approximation in our variational approach, accounting also for possible latent variables X: n p(V, B, α, X|D ) ≈ q(V)q(B, α) ∏ q(X(d) ) ≡ q(V)q(B, α)q(X) d=1 with q(B, α) being a multivariate Gaussian density of the non-zero elements of B and α. Function q(X(d) ) is also a Gaussian density, and function q(V) is a G -Inverse Wishart density. From Jensen’s inequality, we obtain the following lower-bound (Beal, 2003, p. 47): ln p(D |G ) = ln p(Y, X|V, B, α)p(V, B, α) dX dB dV dα ≥ ln p(Y, X|V, B, α) q(V)q(B,α)q(X) + ln p(V)/q(V) q(V) + ln p(B, α)/q(B, α) q(B,α) − ln q(X) q(X) R (17) where this lower bound can be optimized with respect to functions q(V), q(B), q(X). This can be done by iterative coordinate ascent, maximizing the bound with respect to a single q(·) function at a time. The update of q(V) is given by qnew (V) = pG -IW (δ + d, U + (I − B)S(I − B)T 1204 q(X)q(B,α) ) BAYESIAN L EARNING WITH M IXED G RAPH M ODELS where pG -IW (·) is the density function for a G -Inverse Wishart, and S is the empirical second moment matrix summed over the completed data set (X, Y) (hence the expectation over q(X)) centered at µ(Θ). The updates for q(B, α) and q(X) are tedious but straightforward derivations, and described in Appendix B. The relevant fact about these updates is that they are functions of V−1 q(V) . Fortunately, we pay a relatively small cost to obtain these inverses using the Monte Carlo sampler of Figure 4: from the Bartlett parameters, define a lower triangular m × m matrix B (by placing on the ith line the row vector Bi , followed by zeroes) and a diagonal matrix Γ from the respective vector of γi ’s. The matrix V−1 can be computed from (I − B )T Γ−1 (I − B ), and the relevant expectation computed according to the importance sampling procedure. For problems of moderate dimensionality,7 the importance sampler might not be recommended, but the Gibbs sampler can be used. At the last iteration of the variational maximization, the (importance or posterior) samples from q(V) can then be used to compute the required averages in (17), obtaining a bound on the marginal log-likelihood of the model. Notice that the expectation ln p(V)/q(V) q(V) contains the entropy of q(V), which will require the computation of G -inverse Wishart normalizing constants. For large problems, the cost of this approximation might still be prohibitive. An option is to partially parameterize V in terms of ancillary latents and another submatrix distributed as a G -inverse Wishart, but details on how to best do this partition are left as future work (this approximation will be worse but less computationally expensive if ancillary latents are independent of the coefficient parameters in the variational density function q(·)). Laplace approximations might be an alternative, which have been successfully applied to undirected non-decomposable models (Roverato, 2002). We emphasize that the results present in this section are alternatives that did not exist before in previous approaches for learning mixed graph structures through variational methods (e.g., Silva and Scheines, 2006). It is true that the variational approximation for marginal likelihoods will tend to underfit the data, that is, generate models simpler than the true model in simulations. Despite the bias introduced by the method, this is less of a problem for large data sets (Beal and Ghahramani, 2006) and the method has been shown to be useful in model selection applications (Silva and Scheines, 2006), being consistently better than standard scores such as BIC when hidden variables are present (Beal and Ghahramani, 2006). An application in prediction using the variational posterior instead of MCMC samples is discussed by Silva and Ghahramani (2006). It is relevant to explore other approaches for marginal likelihood evaluation of DMG models using alternative methods such as annealed importance sampling (Neal, 2001) and nested sampling (Skilling, 2006), but it is unrealistic to expect that such methods can be used to evaluate a large number of candidate models. A pre-selection by approximations such as variational methods might be essential. 5. Discrete Models: The Probit Case Constructing a discrete mixed graph parameterization is not as easy as in the Gaussian case. Advances in this area are described by Drton and Richardson (2008a), where a complete parameterization of binary bi-directed graph models is given. In our Bayesian context, inference with the mixed graph discrete models of Drton and Richardson would not to be any computationally easier than the case for Markov random fields, which has been labeled as doubly-intractable (Murray et al., 2006). 7. We observed a high ratio of the highest importance weight divided by the median weight in problems with dimensionality as low as 15 nodes. However, notice that in practice the error covariance matrix V has a block diagonal structure, and only the size of the largest block is relevant. This is explained in more detail in Section 6. 1205 S ILVA AND G HAHRAMANI Instead, in this paper we will focus on a class of discrete models that has been widely used in practice: the probit model (Bartholomew and Knott, 1999). This model is essentially a projection of a Gaussian distribution into a discrete space. It also allows us to build on the machinery developed in the previous sections. We will describe the parameterization of the model for acyclic DMGs, and then proceed to describe algorithms for sampling from the posterior distribution. 5.1 Parameterizing Models of Observable Independencies A probit model for the conditional probability of discrete variable Yi given a set of variables {Yi1 , ..., Yik } can be described by the two following relationships: P (Yi = vi l Yi⋆ = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi | Yi⋆ ) = 1(τi ≤ Yi⋆ < τi ) l−1 l (18) where P (·) is the probability mass function of a given random variable, as given by the context, and 1(·) is the indicator function. Yi assumes values in {vi , vi , . . . , vi }. Thresholds {τi = −∞ < τi < 1 1 0 1 κ(i) τi < · · · < τi = ∞} are used to define the mapping from continuous Yi⋆ to discrete Yi . This model 2 κ(i) has a sensible interpretation for ordinal and binary values as the discretization of some underlying latent variable (UV) Yi⋆ . Such a UV is a conditionally Gaussian random variable, which follows by assuming normality of the error term εi . This formulation, however, is not appropriate for general discrete variables, which are out of the scope of this paper. Albert and Chib (1993) describe alternative Bayesian treatments of discrete distributions not discussed here. Given this binary/ordinal regression formulation, the natural step is how to define a graphical model accordingly. As a matter of fact, the common practice does not strictly follow the probit regression model. Consider the following example: for a given graph G , a respective graphical representation of a probit model can be built by first replicating G as a graph G ⋆ , where each vertex Yi is relabeled as Yi⋆ . Those vertices represent continuous underlying latent variables (UVs). To each vertex Yi⋆ in G ⋆ , we then add a single child Yi . We call this the Type-I UV model. Although there are arguments for this approach (see, for instance, the arguments by Webb and Forster (2006) concerning stability to ordinal encoding), this is a violation of the original modeling assumption as embodied by G : if the given graph is a statement of conditional independence constraints, it is expected that such independencies will be present in the actual model. The Type-I formulation does not fulfill this basic premise: by construction there are no conditional independence constraints among the set of variables Y (the marginal independencies are preserved, though). This is illustrated by Figure 6(b), where the conditional independence of Y1 and Y3 given Y2 disappears. An alternative is illustrated in Figure 6(c). Starting from the original graph G (as in Figure 6(a)), the probit graph model G ⋆ shown in the Figure is built from G by the following algorithm: 1. add to empty graph G ⋆ the vertices Y of G , and for each Yi ∈ Y, add a respective UV Yi⋆ and the edge Yi⋆ → Yi ; 2. for each Yi → Y j in G , add edge Yi → Y j⋆ to G ⋆ ; 3. for each Yi ↔ Y j in G , add edge Yi⋆ ↔ Y j⋆ to G ⋆ ; We call this the Type-II UV model, which has the following property (the proof is in Appendix C): 1206 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y* 1 Y1 Y2 (a) Y2 Y* 2 Y3 Y* 4 Y4 Y1 Y3 Y* 1 Y* 3 (b) Y2 Y* 2 Y* 4 Y4 Y1 Y3 Y* 3 Y4 (c) Figure 6: The model in (a) has at least two main representations as a probit network. In (b), the original structure is given to the underlying variables, with observed variables being children of their respective latents. In (c), the underlying variable inherits the parents of the original variable and the underlying latents of the spouses. Theorem 4 Suppose G is acyclic with vertex set Y. Yi and Y j are m-separated given Z ⊆ Y\{Yi ,Y j } in G if and only if Yi and Y j are m-separated given Z in G ⋆ . The parameterization of the Type-II UV model follows from the definition of probit regression: the conditional distribution Yi given its parents in {Yi1 , ...,Yik } in G is given as in Equation (18), while the error terms {ε1 , ε2 , . . . , εm } follow the multivariate Gaussian N(0, V). The entry corresponding to the covariance of εi and ε j is assumed to be zero if there is no bi-directed edge Yi ↔ Y j in G . In what follows, we discuss algorithms for Type-II models. The approach here described can be easily adapted to cover Type-I models. We say that Type-II models are models of observable independencies, since independencies hold even after marginalizing all UVs. 5.2 Algorithm As before, we provide a Gibbs sampling scheme to sample parameters Θ = {α, B, V, T } from the posterior distribution given data set D = {Y(1) , Y(2) , . . . , Y(n) }. The set T = {Ti } is the set of threshold parameters, Ti = {τi = −∞ < τi < τi < · · · < τi = ∞} for each random variable Yi with 2 1 0 κ(i) κ(i) different values. We will not discuss priors and algorithms for sampling T given the other parameters: this can be done by standard approaches (e.g., Albert and Chib, 1993).8 For the purposes of the Gibbs procedure, we augment the data set with the underlying variables ⋆ = {Y⋆(1) , Y⋆(2) , . . . , Y⋆(n) } at each sampling step. D From the set of structural equations Y⋆(d) = α + BY(d) + ε it follows that the conditional distribution of Y⋆(d) given the D ∪ Θ is a truncated Gaussian with mean α + BY(d) and covariance matrix V. The truncation levels are given by the thresholds and (d) ⋆(d) observed data Y(d) : for each Yi = vi , the range for Yi becomes [τi , τi ). Sampling from a l l−1 l truncated Gaussian is a standard procedure. We used the algorithm of Kotecha and Djuric (1999) in our implementation. To sample V from its conditional, we will rely on the following result. 8. In Section 7, we perform experiments with binary data only. In this case, the thresholds are set to fixed values: {τi = −∞, τi = 0, τi = ∞} for all 0 ≤ i ≤ m. 0 1 2 1207 S ILVA AND G HAHRAMANI Proposition 5 Let G be an acyclic DMG, and (α, B, V, T ) be the respective set of parameters that defines the probit model. For a fixed (α, B, T ), there is a bijective function fBαT (·) mapping Y⋆ to ε. This is not true in general if G is cyclic. Proof: If the graph is acyclic, this follows directly by recursively solving the model equations, starting from those corresponding to Y j⋆ vertices with no parents. This results in ε = Y⋆ − α − BY, as expected. For cyclic graphs, the following model provides a counter-example. Let the graph be Y1⋆ → Y1 → Y2⋆ → Y2 → Y1⋆ . Let the model be Y1⋆ = Y2 + ε1 ,Y2⋆ = Y1 + ε2 , that is, b12 = b21 = 1 and α = 0. Let the variables be binary, with a threshold at zero (Yi = 1 if and only if Yi⋆ ≥ 0). Then the two instantiations (Y1⋆ = −0.8,Y2⋆ = 0), (Y1⋆ = 0.2,Y2⋆ = 1) imply the same pair (ε1 = −0.8, ε2 = 0). The negative result for discrete models with cycles is the reason why such models are out of the scope of the paper. ⋆ Let Dε = {ε(1) , . . . , ε(n) }, where ε(d) = fBαT (y(d)⋆ ). Due to this bijection (and the determinism mapping Y⋆ to Y), the density p(V | Θ\V, D , D ⋆ ) = p(V | Θ\V, D ⋆ ) = p(V | Θ\V, y(1)⋆ , . . . , y(d)⋆ ) is equivalent to p(V | Θ\V, D ⋆ ) = = ∝ ∝ ⋆ p(V | α, B, T , D ⋆ , Dε ) ⋆) p(V | α, B, T , Dε ⋆ p(V | α, B, T )p(Dε | α, B, T , V) n (d) | V). p(V) ∏d=1 p(ε For the given data set D ∪ D ⋆ , define S⋆ as the sum of (Y⋆(d) − α − BY(d) )(Y⋆(d) − α − BY(d) )T over all d ∈ {1, 2, . . . , n}. Since p(ε | V) is normal with zero mean and covariance matrix V, the posterior for V given all other parameters and variables is V | {Θ\V, D , D ⋆ } ∼ G -IW (δ + n, U + S⋆ ). Sampling B and α is analogous to the Gaussian case, except that we have to consider that the left-hand side of the structural equations now refer to Y⋆ . We give the explicit conditional for αi , with the conditional for bi j being similarly adapted from Section 4. The posterior for αi is given by a normal N((s′ )−1 m′ , s′ ) where i i i sα i ′ cα i ′ = = 1 + n(V−1 )ii , sα i m n m cα (d) ⋆(d) i − n ∑ (V−1 )it αt + ∑ ∑ (V−1 )it Yt . − ∑ bt pt Ypt sα pt i d=1 t=1 t=1,t=i 5.3 A Note on Identifiability The scale of the underlying latent variables in the probit model is arbitrary. As such, it has been often suggested that such latents should have constant (e.g., unity) variance (Pitt et al., 2006). There are two usual arguments for fixing the variance: improving the interpretability of the model, and improving the mixing of the Markov chain. The interpretability argument is not particularly appealing within the Bayesian setting with proper priors, such as the one proposed in this paper: the posterior distribution of the parameters is well-defined by the prior uncertainty and the data. 1208 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The goal of improving the mixing of the chain might be important: if some parameters can assume arbitrary values and still allow for the same model over the observables, then fixing such parameters may help sampling by eliminating largely flat regions from the posterior (which will happen for large data sets and broad priors). In practice, however, scaling UVs might not be advantageous. In some cases it might increase the computational cost of each sampling step, while sampling from the non-scaled model might work just fine. Many MCMC algorithms work well on highly unidentifiable models such as multilayer perceptrons (Neal, 1996). In our experiments, we do not use any scaling. 5.4 Remarks It is clear that the given approach can be generalized to other generalized linear models by changing the link function that maps underlying latent variables (UVs) to observables. For instance, a model containing discrete and continuous variables can be constructed by using the identity link function instead of probit for the continuous variables. Notice that the continuous variables will not necessarily be marginally Gaussian if some of its parents are discrete. Other link functions will have different parameters besides thresholds, such as in multivalued (“polychotomous”) discrete distributions. A Bayesian account of Gaussian copula models is given by Pitt et al. (2006), to which a DMG-based family could in principle be defined. For continuous, marginally non-Gaussian, variables joined by a Gaussian copula, it is possible that all link functions are invertible. In this case, it is easier in principle to define cyclic models through Type-I UV models (e.g., Figure 6(b)) while preserving the observable independencies. It is important to point out that Type-II probit models with Markov equivalent graphs will not, in general, be likelihood equivalent. A simple example is given by the two-node graphs Y1 → Y2 and Y1 ↔ Y2 : if Y1 is binary, then the marginal for Y2 in the first case is equivalent to having an underlying latent variable that follows a mixture of two Gaussians. While some of these issues can be solved by adopting a mixture of Gaussians marginal independence model to account for bi-directed edges (Silva and Ghahramani, 2009), details need to be worked out. When the goal of model selection is to find causal structures (Spirtes et al., 2000), the usual guarantees of search methods based on Markov equivalence classes do not hold. However, it remains to be seen whether the parametric constraints implied by the Type-II formulation will allow for other consistent approaches for causal discovery, as shown in the case of non-linearities with additive noise (Hoyer et al., 2008). 6. Scaling Up: Factorizations and Perfect Sequences Each Monte Carlo sampling step for the given mixed graph models is theoretically tractable, but not necessarily practical when the dimensionality m of the data is high. By using clever factorizations of the graph and ordering of the variables, it is possible to sometimes scale to high-dimensional problems. In this section, we describe approaches to minimize the run-time of the marginal likelihood computation for bi-directed graphs, which is also important for computing variational bounds for DMG models. We start, however, with a discussion on factorizations of the posterior density for coefficient parameters B. The context is the Gibbs sampler for acyclic models. 1209 S ILVA AND G HAHRAMANI Y2 Y1 Y3 b31 Y3 Y1 Y2 Y3 Y4 Y5 Y1 b32 Y4 b54 Y5 (a) b21 Y4 b32 b43 Y6 Y2 b53 (b) (c) Figure 7: The coefficients b31 and b32 , represented as nodes in (a), become dependent after conditioning on Y. However, they are still independent of b54 . This a general property of DAG models. In DMG models, a sequence of bi-directed edges will connect extra coefficients. In graph (b), coefficients b21 , b32 and b43 will all be dependent given Y. Coefficients into nodes in different districts will still be independent. The graph in (c) has districts {Y1 ,Y2 ,Y3 ,Y4 } and {Y5 ,Y6 }. 6.1 Factorizations Our prior for coefficients {bi j } is fully factorized. In directed acyclic graphs, this is particularly advantageous: coefficients corresponding to edges into different nodes are independent in the posterior.9 One can then jointly sample a whole set of {bi j } coefficients with same i index, with no concern for the other coefficients. Figure 7(a) illustrates this factorization. This means that, in Equation (16), the summation over t does not go over all variables, but only for t = i. This also follows from the fact that (V)−1 = 0 unless i = t, since V is diagonal. it In ADMGs, however, this is not true anymore. For any pair of vertices linked by a path of bi-directed edges, for example, Yi ↔ Yi+1 ↔ · · · ↔ Yt , one will have in general that (V)−1 = 0. This it can be shown by using the graphical properties of the model when conditioning on some arbitrary datapoint Y: Proposition 6 Let G be an acyclic DMG with vertex set Y, and G ′ the DMG obtained by augmenting G with a vertex for each parameter bi j and a respective edge bi j → Yi . Then if there is a bi-directed path Yi ↔ · · · ↔ Yt in G , {bi j , btv } are not m-separated given Y in G ′ . Proof: The joint model for {Y, B} with independent priors on the non-zero entries of B is Markov with respect to G ′ . The sequence of bi-directed edges between Yi and Yt implies a path between bi j and btv where every vertex but the endpoints is a collider in this path. Since every collider is in Y, this path is active. This Proposition is illustrated by Figure 7(b). The practical implication is as follows: mconnection means that there is no further graphical property that would entail (V)−1 = 0 (i.e., only it particular cancellations on the expression of the inverse, unlikely to happen in practice, would happen to generate such zeroes). 9. Sampling in Gaussian DAG models is still necessary if the model includes latent variables (Dunson et al., 2005). 1210 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Consider the maximal sets of vertices in an ADMG such that each pair of elements in this set is connected by a path of bi-directed edges. Following Richardson (2003), we call this a district.10 It follows that is not possible in general to factorize the posterior of B beyond the set of districts of G . Figure 7(c) illustrates a factorization. Fortunately, for many DMG models with both directed and bi-directed edges found in practical applications (e.g., Bollen, 1989), the maximum district size tends to be considerably smaller than the dimensionality of the problem. 6.2 Perfect Sequences It is still important to speed up marginal likelihood (or variational bound) computations for models with districts of moderate size, particularly if many models are to be evaluated. Without loss of generality, assume our graph G is a bi-directed graph with a single district, since the problem can be trivially separated into the disjoint bi-directed components. We will consider the case where the bi-directed graph is sparse: otherwise there is little to be gained by exploring the graphical structure. In that case, we will assume that the largest number of spouses of any node in G is bounded by a constant κ that is independent of the total number of nodes, m. The goal is to derive algorithms that are of lower complexity in m than the original algorithms. The bottleneck of our procedure is the computation of the Σ−1 ≺ (i),nsp≺ (i) matrices, required nsp in the mapping between independent and dependent Bartlett parameters (Equation 7), as well as computing the determinants |Σnsp≺ (i),nsp≺ (i) |. Since in sparse districts nsp≺ (i) grows linearly with m, the cost of a na¨ve algorithm for a single sampling step is O(m3 ) per node. Iterating over all nodes ı implies a cost of O(m4 ) for a Monte Carlo sweep. Therefore, our goal is to find a procedure by which such mappings can be computed in less than O(m3 ) time. The general framework is reusing previous inverses and determinants instead of performing full matrix inversion and determinant calculation for each Yi . The difficulty on applying low-rank updates when we traverse the covariance matrix according to ≺ is that the sets of non-spouses nsp≺ (i) and nsp≺ (i + 1) might differ arbitrarily. We want sensible orderings where such sets vary slowly and allow for efficient low-rank updates, if any. The foundation of many scaling-up procedures for graphical models is the graph decomposition by clique separators (Tarjan, 1985), usually defined for undirected graphs. The definition for bi-directed graphs is analogous. Such a decomposition identifies overlapping prime subgraphs {GP(1) , GP(2) , . . . , GP(k) } of the original graph G . A prime graph is a graph that cannot be partitioned into a triple (Y′ , S, Y′′ ) of non-empty sets such that S is a complete separator (i.e., S is a clique and removing S disconnects the graph). Notice that a clique is also a prime subgraph. The prime components of a graph can be ordered in a perfect sequence {YP(1) , . . . , YP(k) } of subsets of Y (Roverato, 2002; Lauritzen, 1996). Define H j ≡ YP(1) ∪ · · · ∪ YP( j) as the history of the perfect sequence up to the j-th subgraph. Let R j ≡ YP( j) \H j−1 be the residual of this history (with R1 ≡ YP(1) ), and S j ≡ H j−1 ∩ YP( j) the separator. In a perfect sequence, the triple (H j−1 \S j , S j , R j ) forms a decomposition of the subgraph of G induced by the vertex set H j . Surprisingly, although bi-directed and undirected graph models have very different Markov properties (in undirected models, conditioning removes dependencies; in bi-directed models, it adds dependencies), perfect prime graph sequences prove to be also useful, but in an entirely different 10. Kang and Tian (2005) call such structures c-components and reserve the word “district” to refer to the function mapping a vertex to its respective c-component, as originally introduced by Richardson (2003). We choose to overload the word and call “district” both the structure and the mapping. 1211 S ILVA AND G HAHRAMANI Y7 Y3 Y1 Y5 Y2 Y6 V1 = {Y1 ,Y2 ,Y3 } V2 = {Y4 ,Y5 } V3 = {Y6 } Y4 Figure 8: On the left, we have a bi-directed graph of 7 vertices arranged and ordered such that nodes are numbered by a depth-first numbering starting from “root” Y7 , with {Y1 ,Y2 ,Y4 ,Y6 } 3 being leaves. Vertices {Y1 ,Y2 , . . . ,Y6 } can be partitioned as the union ∪t=1 Vt , as illustrated on the right. way. The next subsection describes the use of prime graph decompositions in a particularly interesting class of bi-directed graphs: the decomposable case. The general case is treated in the sequel. 6.2.1 D ECOMPOSABLE M ODELS In a recursively decomposable graph, all prime subgraphs are cliques. We will assume that any perfect sequence in this case contains all and only the (maximal) cliques of the graph. The resulting decomposition can be interpreted as a hypergraph where nodes are the maximal cliques of the original graph, and edges correspond to the separators. In the statistics literature, a decomposable model is defined as a model that is Markov with respect to a recursively decomposable undirected graph (Lauritzen, 1996). Its widespread presence on applications of Markov random fields is due to nice computational properties, with tree-structured distributions being a particular case. Our definition of bi-directed decomposable models is analogous: a model Markov with respect to a recursively decomposable bi-directed graph. Given the residual sequence {R1 , R2 , . . . , Rk } obtained through a perfect sequence of maximal cliques of G , we define a perfect ordering ≺ by numbering nodes in Rt before nodes in R1 , . . . , Rt−1 , 1 < t ≤ k and ordering nodes according to this numbering.11 Any ordering that satisfies this restriction is a perfect ordering. Such an ordering has the following property. Theorem 7 Let G be a recursively decomposable bi-directed graph such that the indexing of its vertices Y = {Y1 ,Y2 , . . . ,Ym } follows a perfect ordering ≺. Then for each 1 < i ≤ m, the set K(i) {Y1 ,Y2 , . . . ,Yi−1 } can be partitioned as ∪t=1 Vt such that: 1. each Vt induces a connected subgraph of G , and for each Yt ∈ Vt and Yt ′ ∈ Vt ′ , t = t ′ , Yt is not adjacent to Yt ′ in G ; 11. Lauritzen (1996) describes other uses of perfect sequences in undirected graphs. Notice that the notion of perfect numbering described by Lauritzen (1996) is not equivalent to our notion of perfect ordering, which is always derived from a perfect sequence of prime graphs. 1212 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 2. for each {Yp ,Yq } ⊆ Vt , if Yp is a spouse of Yi , and Yq is a non-spouse of Yi , then p > q; The proof is in Appendix C. This result is easier to visualize in trees. One can take as a perfect ordering some depth-first ordering for a given choice of root. Then for each vertex Yi , the set {Y1 ,Y2 , . . . ,Yi−1 } is partitioned according to the different branches “rooted” at Yi . The starting point of each branch is a spouse of Yi , and all other vertices are non-spouses of Yi . The ordering result then follows directly from the definition of depth-first traversal, as illustrated in Figure 8. Let Σ be the covariance matrix of a bi-directed decomposable model with graph G , where Σ follows a G -inverse Wishart distribution. Let ≺ be a perfect ordering for G . By the construction of Bartlett’s decomposition, mapping between parameters is given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) , nsp the computational bottleneck being the inversion. Notice this corresponds to the multiple regression coefficients of sp≺ (i) on nsp≺ (i). But according to Theorem 7, using a perfect ordering implies that within each Vs for a fixed Yi , all preceding non-spouses of Yi are ordered before the preceding spouses. Elements {Yp ,Yq } in different Vs are marginally independent given {Y1 , . . . ,Yi−1 }\{Yp ,Yq }. This implies that the regression coefficient of spouse Yp on non-spouse Yq will be zero if Yp and Yq are on different components Vs , and will be identical to the previously computed B p,q if they are in the same component. Splitting the set {Y1 ,Y2 , . . .Yi−1 } into preceding spouses Ysp≺ (i) and nonspouses Ynsp≺ (i) , we have Ysp≺ (i) = Bsp≺ (i),sp≺ (i) Ysp≺ (i) + Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ⇒ Ysp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 (Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ) where each ε j is an independent Gaussian with variance γ j , and each element (p, q) in Bsp≺ (i),nsp≺ (i) corresponds to the known (i.e., previously computed) regression coefficient of the spouse Yp on the non-spouse Yq . Matrix Bsp≺ (i),sp≺ (i) is defined analogously. Hence, the regression coefficients of Ysp≺ (i) on Ynsp≺ (i) are given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 Bsp≺ (i),nsp≺ (i) . nsp (19) No inversion of Σnsp≺ (i),nsp≺ (i) is ever necessary. Moreover, the determinant |Σnsp≺ (i),nsp≺ (i) | is given by ∏{q s.t. Yq ∈nsp≺ (i)} γq , since all non-spouses precede the spouses (which means their marginal covariance matrix is given by the previously computed Bartlett parameters). Hence, calculating Bi,nsp≺ (i) for all 1 ≤ i ≤ m according to a perfect ordering has as a bottleneck the inversion (of a triangular matrix) and multiplication in Equation (19), with a cost of O(κ2 +mκ2 ), κ being the maximum number of spouses for any given node. The cost of the remaining operations for the i-th stage in the importance sampler is O(κ3 ). As a function of m, the cost of the parameter sampling step falls from O(m3 ) to O(m). The cost of computing the weights is dominated by the computation of Ki from Equation (13), which is O(κ3 + κm2 ) = O(m2 ). Figure 9 illustrates the derivation of the new ordering in a tree-structured model. 6.2.2 N ON - DECOMPOSABLE M ODELS In a non-decomposable model, some prime graphs YP(t) will no longer be cliques. In what follows, we once again assume that ≺ is a perfect ordering. Unlike in the decomposable case, the product 1213 S ILVA AND G HAHRAMANI YA YC YD YA YC YC YB YC YC YC YD YB (a) (b) Figure 9: The tree-structured (i.e., cycle-free) bi-directed graph in (a) has as maximal cliques the adjacent pairs. Such cliques can be ordered in a perfect sequence as shown in (b), where rectangles indicate the separators. Notice that R1 = {YA ,YC }, R2 = {YB }, R3 = {YD }. One possible perfect ordering is {YD ,YB ,YC ,YA }. Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) does not simplify in general. Instead we will focus only on fast methods nsp to compute Σ−1 ≺ (i),nsp≺ (i) . nsp As we shall see, the function of the perfect sequence is now to provide a sensible choice of which inverse submatrices {Σ−1 }, W ⊆ Y, to cache and reuse when computing Σ−1 ≺ (i),nsp≺ (i) . W,W nsp The same can be done to compute determinants |Σnsp≺ (i),nsp≺ (i) |. A simple way of reusing the results from the previous section is by triangulating the nondecomposable graph G , transforming it into a decomposable one, G ′ , which is then used to generate the perfect sequence. We need to distinguish between the “true” spouses of a node Yi in G and the artificial spouses in G ′ that result from the extra edges added. Let nsp≺G ′ (i) be the non-spouses of Yi in G ′ that precede it according to ≺: by construction, these are also non-spouses of Yi in G . Let sp∆≺G ′ (i) be the spouses of Yi in G ′ that are not spouses of Yi in G . That is, the set of preceding non-spouses of Yi in G is given by nsp≺ (i) = nsp≺G ′ (i) ∪ sp∆≺G ′ (i). Recall that the inverse of a partitioned matrix can be given by the following identity: A B C D −1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1 −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1 . (20) In order to compute Σ−1 ≺ (i),nsp≺ (i) , we consider its partitioned version nsp Σ−1 ≺ (i),nsp≺ (i) nsp = Σnsp≺G ′ (i),nsp≺G ′ (i) Σnsp≺G ′ (i),sp∆≺G ′ (i) Σsp∆≺G ′ (i),nsp≺G ′ (i) Σsp∆≺G ′ (i),sp∆≺G ′ (i) −1 . (21) Let κnsp be the maximum number of non-spouses among all Yi within any prime subgraph induced by YP(t) . By using relation (20), where we assume for now that we know A−1 ≡ Σ−1 ′ (i),nsp ′ (i) , the cost of computing (21) is O(m2 κnsp ) + O(κ3 ) = O(m2 κnsp ) (the cost of comnsp nsp ≺G ≺G puting D − CA−1 B is O(m2 κnsp ) + O(κ2 ) = O(m2 κnsp ), while the cost of inverting it is O(κ3 )). nsp nsp Treating κnsp as a constant, this reduces the complexity of sampling the i-th row of Σ from O(m3 ) to O(m2 ). A similar procedure applies to the computation of the determinant |Σnsp≺ (i),nsp≺ (i) |, using in this case the relationship (26). 1214 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The advantage of using the perfect sequence is to allow for the computation of all A−1 ≡ at a total cost, across all nodes, of O(m3 ): each set nsp≺G ′ (i) is guaranteed to Σ−1 ′ (i),nsp ′ (i) nsp≺G ≺G be equal to {Y1 ,Y2 , . . . ,Ylns } where Ylns is the last non-spouse of Yi in G ′ that antecedes Yi . This follows from the result in the previous section, since all non-spouses of a node in a decomposable graph precede its spouses. Therefore, if we store the inverse covariance matrices for {Y1 ,Y2 , . . . ,Yi }, 1 ≤ i ≤ m, we can obtain the required matrices A−1 . This requires the storage of O(m) matrices, and each matrix can be obtained by the previous one by a low-rank update (20) with a O(m2 ) cost. Arbitrary orderings do not guarantee such an incremental pattern and, hence, no efficient lowrank updates. Notice that depending on the problem, many of such inverse matrices can be dynamically removed from memory if they are not used by any node placed after a particular position. 6.3 Remarks In Gaussian undirected models, the problem of covariance matrix sampling can also be reduced to sampling within each prime graph at the cost of O(|P |4 ), |P | being the size of the largest prime component (Atay-Kayis and Massam, 2005). Since both κ and κnsp are O(|P |), our procedure costs O(m2 |P |2 + |P |4 ) per prime graph, plus a cost of O(m2 ) per node to compute the importance weights. Considering a number of m/|P | prime graphs and |P | < m, the total cost is O(m3 |P |), down from O(m4 ). For undirected models, the corresponding cost by sampling step using the perfect ordering decomposition is O(m|P |3 ). The higher-order dependency on m in bi-directed models is to be expected, since the Markov blanket of any node Yt in a connected bi-directed graph is V\{Yt }. It is clear that inference with a given bi-directed graph model will never scale at the same rate of a undirected model with the same adjacencies, but this does not justify adopting an undirected representation if it is ill-suited to the problem at hand. One has also to consider that in problems with directed and bi-directed edges, the actual maximum district size might be much smaller than the number of variables. For large problems, however, further approximation schemes will be necessary. Drton and Richardson (2008b) describe some reduction techniques for transforming bi-directed edges into directed edges such that the resulting Gaussian model remains the same. As future work, such methods could be adapted to the G -inverse Wishart sampling procedures and combined with the ordering techniques developed here into a single framework. It will also be interesting to develop similar schemes for the Gibbs sampler. 7. Experiments We now evaluate the advantages of the Gaussian and probit models in Bayesian inference on real problems. 7.1 Industrialization and Democratization Study Bollen (1989) describes a structural equation model of political and democratization factors within nations. “Democratization” and “industrialization” levels are abstract notions, but nevertheless of clearly observable impact. They are tied to empirical observations through different sets of indicators. For instance, an indicator of industrialization level is the gross national product. Hence, democratization and industrialization levels are here defined as scalar latent variables never observed directly, while the observed data is composed of indicators. In this model, there is a total of three indicators of industrialization, and four indicators of democratization. Democratization is 1215 S ILVA AND G HAHRAMANI Y1 Y2 Y3 Industrialization 1960 Democratization 1960 Y4 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Y5 Y6 Democratization 1965 Y7 Y8 Y9 Y 10 Y 11 Gross national product (GNP) 1960 Energy consumption per capita 1960 Percentage of labor force in industry 1960 Freedom of press 1960 Freedom of opposition 1960 Fairness of elections 1960 Elective nature of legislative body 1960 Freedom of press 1965 Freedom of opposition 1965 Fairness of elections 1965 Elective nature of legislative body 1965 Figure 10: A directed mixed graph representing dependencies between 11 observed political and economical indicators and three latent concepts (shaded nodes) (Dunson et al., 2005; Bollen, 1989). measured in a longitudinal study, where data was collected in two years (1960 and 1965). The indicators of democratization are pooled expert opinions summarized in an ordinal number scaled from 1 to 10. Following Bollen, we will treat the model as multivariate Gaussian, which provides an excellent fit (a p-value greater than 0.3 using a chi-square test) for a sample of 75 countries. The corresponding mixed graph is depicted in Figure 10, along with a description of all indicators. The graph is taken from Bollen (1989). Other hidden common causes affect the democratization indicators over time, but the nature of such hidden variables is irrelevant to the problem at hand: that is, the bi-directed edges are motivated by unmeasured causes of variability in the observed indicators that exist over time. For instance, the records of freedom of press in 1960 (Y4 ) and 1965 (Y8 ) co-vary due to other unmeasured factors not accounted by democratization factors. 1216 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Factor scores: countries in the latent space 5 0 Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) Figure 11: An embedding of 75 countries in a two-dimensional latent space: democratization level in 1960 and 1965. Boxplots of the Bayesian posterior distribution of the projection in the two dimensions are depicted in the vertical axis. Countries are arranged in the horizontal axis by the increasing order of their posterior expected industrialization level. Figure adapted from Dunson et al. (2005). An example of Bayesian inference application is shown in Figure 11. Boxplots of the posterior values of Democratization Level 1960 and Democratization Level 1965 are generated. Dunson et al. (2005) use this information to, for instance, find clusters of countries in the latent space. An example of a cluster is the one formed by the bottom 16 countries in the industrialization level ranking: the growing trend of democratization levels after the first 16 countries is interrupted. This type of analysis might provide new insights to a polical scientist, for example, by revealing particular characteristics for such a group of nations. 7.1.1 E VALUATING THE MCMC A LGORITHM FOR D IFFERENT M ODELS In our analysis, we fix to unity the coefficients corresponding to the edges Industrialization 1960 → Y1 , Democratization 1960 → Y4 and Democratization 1965 → Y8 , since the scale and sign of the latent variables is arbitrary. The intercept terms of the equations for Y1 ,Y4 and Y8 are set to zero, since the mean of the latents is also arbitrary. The resulting model is identifiable. We apply the Gibbs sampling procedure to three different models. The Gaussian DMG model as described in this paper, and two modified DAG models. The first DAG model is the one described by Dunson et al. (2005), where each bi-directed edge is substituted by an “ancillary” latent (as mentioned in Section 2.3). For instance, the pathway corresponding to Y4 ↔ Y8 is substituted by the chain Y4 ← D48 → Y8 , where D48 is unobserved. Dunson et al. further assume that all covariances due to such ancillary latents are positive. As such, the coefficients from Di j into {Yi ,Y j } are set 1217 S ILVA AND G HAHRAMANI Industrialization1960 → Democratization1965 Democratization1960 → Democratization1965 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 2 Density 1 0 0.0 0.0 0.5 1.0 Density 1.0 0.5 Density 3 1.5 1.5 4 2.0 Industrialization1960 → Democratization1960 -0.5 N = 50000 Bandwidth = 0.02317 0.0 0.5 1.0 1.5 2.0 2.5 -0.5 N = 50000 Bandwidth = 0.01972 (a) 0.0 0.5 1.0 1.5 2.0 2.5 N = 50000 Bandwidth = 0.009335 (b) (c) Figure 12: Posterior distribution of parameters associated with the respective edges in the industrialization/democratization domain. Smoothed posterior obtained using the output of our Gibbs sampler and the DENSITY function of R 2.6.0. Democratization1960 → Y5 0.8 5 0.6 0.4 Density 0.2 0.0 -1 0.0 0.0 0 0.5 0.5 1 1.0 2 1.0 1.5 3 2.0 4 2.5 2.0 1.5 Y7 ↔ Y11 Y7 ↔ Y11 3.0 Industrialization1960 → Democratization1960 0 1000 2000 3000 Iteration 4000 5000 0 1000 2000 3000 4000 5000 0 Iteration 1000 2000 3000 Iteration 4000 5000 -1 0 1 2 3 4 5 N = 50000 Bandwidth = 0.04786 Figure 13: The first three plots show the initial 5,000 iterations of a run of the Gibbs sampling algorithm for the DMG model for three different parameters associated with edges in the graph. The last plot depicts the posterior distribution the error covariance associated with the edge Y7 ↔ Y11 (smoothed with the kernel density estimator from the statistical software R). to unity, with the variance of Di j corresponding to the residual covariance of {Yi ,Y j } given their parents. Means of ancillary latents are fixed at zero. However, even for covariance matrices with positive covariances, this parameterization is not complete. This result is evident from the fact that the variances of Yi and Y j will both be larger than their covariance, which is not true of covariance matrices in general. For this particular problem, however, this extra restriction provides no measurable difference in terms of fitness. It does serve as a reminder, however, that “intuitive” parameterizations might hide undesirable constraints. The second DAG model is an extension of the DAG model suggested by Dunson et al., the only difference being that the coefficients corresponding to edges Di j → Yi , i < j, are free to vary (instead of being fixed to 1). In general, there are Gaussian DMG models that cannot be parameterized this way (Richardson and Spirtes, 2002). Notice also that because of chains such as Democratization 1960 → Y4 ↔ Y8 ← Democratization 1965, the set of independence constraints in this graph can only be represented by a DAG if we include the ancillary latents Di j . That is, there is no DAG with 1218 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 50000 MCMC comparison: DMG vs. unconstrained DAG 30000 0 10000 20000 Effective Sample Size 30000 20000 10000 0 Effective Sample Size DMG DAG 40000 DMG posDAG 40000 50000 MCMC comparison: DMG vs. positive covariance DAG 1 4 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 1 4 7 Parameter 11 15 19 23 27 31 35 39 43 47 51 55 59 63 Parameter Figure 14: Comparison of the effective sample size of the MCMC algorithm applied to the three models, DMG, DAG with positive covariances (posDAG) and general DAG, as explained in the main text. The horizontal axis is the boxplot for each independent entry of the observed covariance matrix, 66 in total. The boxplots are obtained from 80 independent chains initialized randomly, where each chain runs for 50,000 iterations. exactly the same set of independence constraints as the given DMG, unless ancillary latent variables are added. We study the behavior of the MCMC algorithm for these three models.12 It turns out that the mixing properties of the chain are considerably affected by the choice of model. Recall that, in the Gibbs sampling algorithm for the DMG model, a whole row of the error covariance matrix is sampled jointly conditioning on the other parameters. For the DAG models all entries of the error covariance matrix are independent and can be sampled jointly, but this requires conditioning on the ancillary latents, which do not exist in the DMG model and have to be sampled only in the DAG case. For the majority of the covariance entries, the MCMC procedure mixed quite well, as illustrated in Figure 13. Notice how about 12% of the sampled DMG error covariances for Y7 ↔ Y11 were under zero, which could raise suspicion over the assumption of positive covariances. Autocorrelation is 12. A few technical notes: we used the priors suggested in Dunson et al. (2005), except that we changed the confidence in the prior of the covariance of the error terms V to be smaller (in order to minimize the influence of the priors in the models, since in this particular problem the DMG and DAG models are nearly likelihood equivalent but not posterior distribution equivalent − the priors belong to different families). We used 1 degree of freedom in our G -Inverse Wishart, with the matrix parameter being the expected value of Dunson et al.’s prior. For the DAG models, we also used the G -inverse Wishart prior for the error terms, but where all error terms are independent. For the DAG model with a free coefficient per ancillary latent, we assigned a standard Gaussian prior to such coefficients. The chains were initialized randomly by sampling standard Gaussians for the coefficients and latent variables. Error covariance matrices were initialized to diagonal matrices with diagonal entries sampled uniformly in [1, 2]. Coefficient parameters were sampled jointly given the error covariance matrix and latent variables. Latent variables were also sampled jointly, given the parameters. 1219 S ILVA AND G HAHRAMANI Comparison of average effective sample size 30000 DMG 0 0 10000 20000 20000 10000 DMG 30000 40000 40000 Comparison of average effective sample size 0 10000 20000 30000 40000 0 positive covariance DAG 10000 20000 30000 40000 unconstrained DAG Figure 15: Comparison of the effective sample size of the MCMC algorithm applied to the three models. Here we plot the average effective sample sizes over 80 trials of 50,000 samples for each of the 66 entries of the covariance matrix. Points over the line indicate parameters where the DMG approach performed better. essentially zero for most parameters at a lag of 50. The degree of autocorrelation, however, varied significantly between the DMG model and each DAG model. The chains for the DMG model mixed considerably better. To summarize such behavior, we calculated the effective sample size of the samples obtained from several chains. The parameters of interest in this comparison are the independent entries in the 11 × 11 dimensional observed covariance matrix. This is a total of 66 parameters. The effective sample size statistics were obtained by 80 independent chains of 50,000 samples each, for the three models. For each chain and each parameter, we compute the desired statistic using the EFFECTIVE S IZE function implemented in the R package CODA, freely available in the Internet. Results are summarized by boxplots in Figure 14. Parameters are ordered in the x-axis following the upper triangular covariance matrix, scanning it in the order {σY1Y1 , σY1Y2 , . . . , σY1Y11 , σY2Y2 , . . . , σY11Y11 }. White boxplots correspond to the distribution of effective sample size statistics with the DMG model across the 80 independent chains. Gray boxplots correspond to the two DAG variants. There is no significant difference between the behaviour of the Gibbs sampling procedure for the two DAG models. The procedure with the DMG model is clearly better behaved. As a summary statistic, the average effective sample size over 80 trials was steadly larger in the DMG outcome than in the positive DAG outcome (61 out of 66 parameters) and unconstrained DAG (59 out of 66). The comparison of averages is illustrated by Figure 15. By caching the sufficient statistics of the data and factorizing the sampling procedure according to the districts of the graph, the running time for generating 50,000 samples out of the DMG model was of 34 seconds in a dual core Pentium IV 2.0 GHz. Depending on the connectivity of the bidirected components of the graph and on the implementation of matrix inversion, sampling from the 1220 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS DAG model might be faster than sampling from the DMG. In this particular study, sampling from the DAG models was substantially slower, an approximate average of 60 seconds for both variants. This can be explained by the fact that sampling latent variables is very expensive, especially considering that in the given DAG models all ancillary latents become dependent when conditioning on the data. To summarize, the DMG approach allowed for a complete parameterization with significantly better mixing properties, while still resulting in a faster MCMC procedure. 7.2 Structure Learning Applications When trying to find a point estimate of graphical structures (i.e., returning a single graph that explains the data well), simple approaches such as testing for marginal independencies are reasonable learning algorithms under the Gaussian assumption. The Bayesian approach, however, allows one to compute odds and distributions over graphs and graph statistics, for example, the joint probability of small substructures (Friedman and Koller, 2003). Moreover, it is not clear how the independence test procedure controls for the predictive ability of the model, which is not a straightforward function of the edges that are selected due to the quantitative aspects of the dependencies. We evaluate our Bayesian model selection contribution, focusing on the Monte Carlo sampler for bi-directed models. Jones et al. (2005) propose the following priors for graphs: P (G |β) = β|E| (1 − β)0.5m(m−1)−|E| where β is a hyperparameter, |E| is the number of edges in G , and m is the number of nodes. As suggested by Jones et al., we choose β = 0.5/(m − 1), which puts more mass on graphs with O(m) edges than the uniform prior. We start with a brief synthetic study to compare the approach against a simple but effective approach based on the BIC approximation.13 An experiment with gene expression data closes this subsection. 7.2.1 S YNTHETIC S TUDIES As a sanity check for the procedure, we generate synthetic 10-dimensional Gaussian data from models that are Markov with respect to a bi-directed graph. One hundred data sets of 50 datapoints each are generated, each coming from a different model.14 We initially find a structure by marginal independence tests using the Fisher’s Z statistic at a 0.05 level. From this starting point, we perform two searches: one using the BIC score, and the other using the marginal likelihood with a G -IW prior.15 Given the best model for each procedure, we evaluate the predictive log-likelihood on a test set of 2,000 points which are independently sampled for each of the 100 models. 13. The BIC approach is an asymptotically consistent score for selecting the maximum a posteriori Gaussian bi-directed graph model (Richardson and Spirtes, 2002). 14. The details of the simulated data are as follows: we start with DAG with no edges, with observed nodes {Y1 ,Y2 , . . . ,Y10 } and hidden nodes {X1 , X2 , X3 , X4 }. Each individual edge Xi → Y j is added with probability 0.35, and no other edges are allowed. We reject graphs with fewer than 10 edges. All coefficient parameters are sampled from a standard Gaussian, and variances from an uniform distribution in [0, 1]. The model over Y corresponds to a bi-directed graph, where the edge Yi ↔ Y j exists if and only if Yi and Y j have a common latent parent Xk in the DAG. We then store 50 samples for the Y variables in a data set. The procedure is repeated 100 times with different parameters and graphical structures each time. The average number of edges in the resulting simulation was of 18.4 edges per graph. 15. In both cases, we center the data at the empirical mean of the training set and assume the data to have been generated from a zero-mean Gaussian. The G -Inverse Wishart is an empirical prior: a diagonal matrix with the training variance 1221 S ILVA AND G HAHRAMANI MLE Difference (GIW Greedy - BIC Greedy) 0.0008 Density 0.0000 0.0000 0.0002 0.0002 0.0004 0.0006 0.0006 0.0004 Density 0.0008 0.0010 0.0010 0.0012 0.0012 Difference (GIW Greedy - BIC Greedy) -500 0 500 1000 1500 2000 -500 N = 100 Bandwidth = 125.5 0 500 1000 1500 2000 N = 100 Bandwidth = 116.1 (a) (b) Figure 16: The differente in predictive log-likelihood with models learned with the G -IW prior and the best BIC models found by greedy search. Although the difference per point is small, it reflects a persistent advantage of the full Bayesian approach. Figure (a) shows the estimated density of the distribution of differences when predicting points using the Bayesian predictive log-likelihood. Since the BIC search method does not atempt to maximize the finite sample posterior distribution, we provide Figure (b) for completeness: in this case, the predictive log-likelihood for the BIC model was calculated using the maximum likelihood estimator. The difference hardly changes, and the fully Bayesian model still wins (density estimates produced by the DENSITY(·) function of R 2.6.0.). The average difference in log-likelihood prediction16 between the structure learned with the Bayesian prior and the BIC-induced model is depicted in Figure 16(a). This is computed by conditioning on the learned structures (fully Bayesian vs. BIC maximum a posteriori graphs) and marginalizing over the posterior of the parameters. The parameter priors are those used for the structure learning step. This might be unfair for the BIC procedure, since it is not designed to maximize the finite sample posterior: hence we also show in Figure 16(b) the results obtained when the predictions given the BIC model are obtained by using the maximum likelihood estimators of the of each variable used as the diagonal. The number of degrees of freedom is set to 1. The search is a standard greedy procedure: we evaluate the marginal log-likelihood or BIC score for each graph that differs from the current candidate by one edge (i.e., graphs with one more or one fewer edge) and pick the one with the highest score. We stop when no improvement is possible. 16. In terms of incorrect edge additions and deletions, the procedures behave about the same: an average of one third of the edges is missed, and 7% of edges are incorrectly added (individual percentages are with respect to total number of possible mistakes in each graph). Unlike BIC, however, our procedure allows for different trade-offs by using different priors. It should also be pointed out that counting edge errors is just one possible measure. A more global quantitative score such as predictive log-likelihood takes into account, indirectly, the magnitude of the errors—although it is not a direct measure of model fitness. 1222 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS parameters. The average difference in the first case is 400.07, and only slightly less for the second case (389.63). Most of the mass of the difference distribution is positive (85 out of 100 for the first case, 89 out of 100 in the second case), which passes a sign test at a 0.05 level. The difference is still relatively small, suggesting that informative prior knowledge might be necessary to produce substantially better predictions. 7.2.2 G ENE E XPRESSION A NALYSIS To illustrate the use of Bayesian model selection approaches, we analyse the gene expression data previously studied by Drton and Perlman (2007), also as Gaussian bi-directed models. As before, our goal will be to compare the predictive power of models learned by greedy search with BIC and greedy search with the Bayesian posterior. The data consists of 13 gene expression measurements from a metabolic network. A total of 118 points is available. Using all data, the BIC-induced graph has 39 edges, while the finite sample posterior graph had 44. The same procedure used in the synthetic studies, for initializing graphs and choosing priors and centering the data, was applied in this case with two choices of degrees of freedom δ for the G -IW prior: δ = 1 and δ = 5. Preliminary experiments where 90% of the samples are assigned to the training set showed a negligible difference between methods. We then generate 10 random splits of the data, 50% of them assigned to the training set. Predictive results using the MCMC method for evaluating the Bayesian predictions (with half a million samples) are shown in Table 1. The BIC graphs are by definition the same in the three sets of evaluation, but parameters are learned in three different ways (maximum likelihood point estimation and Bayesian averaging with two different priors). There is a steady advantage for the Bayesian approach, although a small one. Notice that using Bayesian averaging over parameters given the BIC graph improves prediction when compared to using the maximum likelihood point estimate, despite the simplistic choice of prior in this study. Notice also that the cases where the Monte Carlo method has small or no advantage over the BIC method were the ones where the maximum likelihood estimators produced their best results. 7.2.3 R EMARKS The procedure based on the sampler is doable for reasonably sized problem on the order of a few dozen variables in desktop machines. Further improvements are necessary for larger problems. One aspect that was not explored here was re-using previous computations when calculating the probability of a new candidate, in a way similar to the local updates in DAG models (Chickering, 2002). How to combine local updates with the ordering-based improved sampler of Section 6 is left as future research. Several practical variations can also be implemented, such as vetoing the inclusion of edges associated with high p-values in the respective independence tests. Such tabu lists can significantly shrink the search space. It is important to evaluate how the Monte Carlo procedure for computing normalizing constants behaves in practice. For all practical purposes, the procedure is an importance sampler and as such is not guaranteed to work within a reasonable amount of time for problems of high dimensionality (MacKay, 1998). We can, however, exploit the nature of the problem for our benefit. Notice that the procedure depends upon a choice of ordering ≺ for the variables. Different orderings correspond in general to different importance distributions. We can play with this feature to choose an suitable ordering. Consider the following algorithm for choosing an ordering given a bi-directed graph G : 1223 S ILVA AND G HAHRAMANI Folder 1 2 3 4 5 6 7 8 9 10 MLE BIC -6578.44 -6392.67 -8194.54 -6284.00 -9428.00 -7111.45 -6411.43 -6350.44 -6374.31 -7247.82 δ=1 BIC MC -6382.45 -6308.19 -6284.64 -6277.94 -6567.89 -6433.51 -6265.16 -6285.77 -6473.93 -6400.51 -6573.85 -6572.74 -6329.53 -6317.18 -6319.87 -6295.19 -6307.13 -6308.21 -6584.96 -6468.51 δ=5 BIC MC -6342.82 -6296.14 -6279.54 -6285.26 -6553.88 -6452.15 -6252.54 -6258.42 -6483.43 -6469.45 -6528.76 -6513.02 -6313.05 -6309.18 -6299.53 -6297.80 -6297.47 -6304.25 -6528.61 -6444.55 Table 1: Results for the 10 random splits of the gene expression data, with 50% of the points assigned to the training set. The first column shows the predictive log-likelihood for the graph learned with the BIC criterion and parameters fit by maximum likelihood. The next two columns show predictive log-likelihood results for the graphs learned with BIC and the Monte Carlo (MC) marginal likelihood method using a G -IW prior with degrees of freedom δ = 1. The last two columns are the results of a prior where δ = 5. Best results in bold. 1. Let ≺ be an empty queue. 2. Let G ′ be the graph complement of G , that is, the graph where {Yi ,Y j } are neighbors if and only if they are not adjacent in G . 3. Let C be an arbitrary maximum clique of G ′ . Add all elements of C to the end of ≺ in any arbitrary order. 4. For each pair {Yi ,Y j }, not intersecting C , such that the path Yi ↔ Yk ↔ Y j exists in G and Yk ∈ C , add the edge Yi ↔ Y j to G . 5. Remove all elements Yk ∈ C from G , including any edge into Yk . 6. Iterate Steps 2-5 until G is an empty graph. The resulting queue ≺ is an ordering that attempts to maximize the number of variables that are marginally independent given their common predecessors. This is just one possibility to simplify the importance sampling distribution: perfect orderings and the approaches for simplifying maximum likelihood estimation described by Drton and Richardson (2008b) could be adapted to provide even better orderings, but we leave this as future work.17 17. In our actual implementation used in the experiments in this Section, we implemented an even simpler approach: instead of finding maximum cliques, we start to build a clique from a particular node, “greedily” adding other nodes to the clique according to the column order of the data set. Each node generates a candidate clique, and we pick an arbitrary clique of maximal size to be our new set C . 1224 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS −13720 −13722 −13728 −13726 −13724 −13722 −13724 −13726 −13728 Marginal likelihood −13720 −13718 Gene data, optimized ordering −13718 Gene data, default ordering 0 2 × 104 4 × 104 6 × 104 8 × 104 105 0 2 × 104 4 × 104 6 × 104 8 × 104 Sample Synthetic data (123 parameters) Synthetic data (658 parameters) 4 × 104 6 × 104 8 × 104 −48625 −48620 −48615 2 × 104 −48630 Marginal likelihood Arbitrary ordering Optimized ordering 0 −48635 −48640 −24171.5 −24172.5 −24173.5 Marginal likelihood −24170.5 −48610 Sample 105 105 Arbitrary ordering Optimized ordering 0 Sample 2 × 104 4 × 104 6 × 104 8 × 104 105 Sample Figure 17: An evaluation on the stability of the Monte Carlo normalizing function procedure. The top row depicts the marginal likelihood estimates for the gene problem using two different distributions implied by two different orderings, as explained in the text. Experiments with synthetic data are shown in the bottom, and the bottom-right figure illustrates major differences. Figure 17 illustrates the difference that a smart choice of ordering can make. The top left graph in Figure 17 depicts the progress of the marginal likelihood Monte Carlo estimator for the gene expression problem using the graph given by the hypothesis testing procedure. The model has 55 parameters. We obtain three estimates, each using a sample of 100,000 points, which allows us to observe how the estimates change at the initial stages. The variable ordering in this case is the 1225 S ILVA AND G HAHRAMANI ordering present in the original database (namely, DXPS1, DXPS2, DXPS3, DXR, MCT, CMK, MECPS, HDS, HDR, IPPI1, GPPS, PPDS1 and PPDS2). The top right graph shows three runs using the optimized ordering criterion. Convergence is much faster in this case, and both samplers agree on the normalizing constant estimate. As an illustration of the power of the procedure and its limitations, we generated a synthetic sample of 1,000 training points from a graph with 25 nodes, using the same procedure of Section 7.2.1. A run of two different samplers is shown at the bottom left of Figure 17. They are seemingly well-behaved, the ratio between the largest and median weight being at the order of one hundred in the “optimally” ordered case. In contrast, the bottom right corner of Figure 17 illustrates the method with a covariance matrix of 50 random variables and 1,000 training points. Notice this is a particularly dense graph. Much bigger jumps are observed in this case and there is no clear sign of convergence at 100,000 iterations. While there is no foolproof criterion to evaluate the behavior of an importance sampler, the relationship between orderings provides a complementary technique: if the normalizing constant estimators vary substantially for a given set of random permutations of the variables, then the outcomes are arguably not to be trusted even if the respective estimators appear to have converged. Concerning the choice of priors, in this Section we exploited empirical priors. The G -Inverse Wishart matrix hyperparameter is a diagonal matrix where variance entries are the sample variances. While this adds an extra bias towards diagonal matrices, at least in our experiments we performed close to or better than other approaches—it is however not clear whether we could have done much better. It is still an open question which practical “default” hyperparameters will prove useful for the G -IW . Elicitation of subjective priors in the context of structural equation models can benefit from pre-existing work on Bayesian regression, although again practical matters might be different for the G -IW . Dunson et al. (2005) describe some limitations of default priors for structural equation models. A thorough evaluation of methods for eliciting subjective priors is out of the context of this work, but existing work on inverse Wishart elicitation provides a starting point (Al-Awadhi and Garthwaite, 1998). As in the case of the inverse Wishart, the G -Inverse Wishart has a single hyperparameter for specifying degrees of freedom, a limitation which might motivate new types of priors (Brown et al., 1993). 7.3 Discrete Data Applications We now show results on learning a discrete distribution that factorizes according to a mixed graph. Drton and Richardson (2008a) describe applications on real-world binary data modeled according to bi-directed graphs. The empirical contingency tables for two studies can be found in the corresponding technical report (Drton and Richardson, 2005). Drton and Richardson used a complete parameterization for bi-directed binary models and a maximum likelihood estimation procedure. In this section, we analyze these two data sets to illustrate the behavior of our Bayesian procedure using the probit model. Our model imposes probit constraints that are not enforced by Drton and Richardson, but it allows us to obtain Bayesian credible intervals and predictions. The graphs used in the two studies are depicted in Figure 18. The first problem is a study on the dependence between alcoholism and depression, as shown in Figure 18(a). A data point is collected for a given pair of mono-zygotic twins. For each sibling Si , it is recorded whether Si is/is not alcoholic (Ai ), and whether Si suffers/does not suffer from depression (Di ). The hypothesis encoded by the graph is that alcoholism and depression do not share a common genetic cause, despite A and 1226 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A1 Success A2 Type of offense Prior sentence D1 D2 Drugs (a) Age (b) Figure 18: Two learning problems with discrete data. In (a), the graph shows dependencies concerning alcoholism (Ai ) and depression (Di ) symptoms for paired twins {1, 2}. In (b), a model for dependencies among features of a study on parole appeals, including the success of the parole, if the type of offense was a person offense or not, and if the offender had a dependency on drugs and was over 25 years old. All variables in these studies are binary and further details and references are provided by Drton and Richardson (2008a). A1A2D1D2 = 0001 A1A2D1D2 = 1000 0.0 0.2 0.4 0.6 0.8 1.0 N = 5000 Bandwidth = 0.003151 80 Density 40 60 0 0 0 5 20 5 10 Density 10 Density 15 20 15 25 20 30 A1A2D1D2 = 0000 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.002153 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.000712 Figure 19: Posterior distribution of some of the marginal contingency table entries for the twin model. D having some hidden (but different) genetic causes. If A and D did have genetic common causes, one would expect that the edges A1 ↔ D2 and A2 ↔ D1 would be also required. The compounded hypothesis of marginal independencies for Ai and D j , i = j, can be tested jointly by testing a bidirected model. Notice that no reference to particular genetic hidden causes of alcoholism and depression is necessary, which again illustrates the power of modeling by marginalizing out latent variables. The second study, as shown in Figure 18(b), concerns the dependencies among several variables in an application for parole. The model implies, for instance, that the success of a parole application (Success node, in the Figure) is independent of the age of the offender being under 25 (Age node). However, if it is known that the offender had a prior sentence, these two variables become dependent (through the path Success ↔ Prior sentence ↔ Age). As reported by Drton and Richardson, their 1227 S ILVA AND G HAHRAMANI Entry A1 A2 D1 D2 0000 0001 0010 0011 0100 0101 0110 0111 E[Θ|D ] 0.461 0.136 0.157 0.097 0.032 0.022 0.007 0.012 Estimates MLE 0.461 0.138 0.159 0.096 0.032 0.021 0.008 0.012 uMLE 0.482 0.134 0.154 0.085 0.025 0.015 0.012 0.017 Entry A1 A2 D1 D2 1000 1001 1010 1011 1100 1101 1110 1111 E[Θ|D ] 0.018 0.003 0.021 0.009 0.008 0.003 0.003 0.006 Estimates MLE 0.018 0.004 0.020 0.009 0.010 0.002 0.005 0.005 uMLE 0.013 0.007 0.013 0.015 0.005 0.003 0.007 0.012 Figure 20: The posterior expected value of the 16 entries in the twin study table (E[Θ|D ]). Results generated with a chain of 5,000 points. We also show the maximum likelihood estimates of Drton and Richardson (MLE) and the maximum likelihood values obtained using an unconstrained model (uMLE). Despite the probit parameterization, in this particular study there is a reasonable agreement between the Bayesian estimator and the estimator of Drton and Richardson. binary bi-directed model passes a significance test. Drton and Richardson also attempted to learn an undirected (Markov) network structure with this data, but the outcome was a fully connected graph. This is expected, since Markov networks cannot represent marginal independencies unless the graph is disconnected, which would introduce all sorts of other independencies and possibly not fit the data well. If many marginal independencies exist in the data generating process, Markov networks might be a bad choice of representation. For problems with symmetries such as the twin study, DAGs are not a natural choice either. 7.3.1 R ESULTS For the twin data problem, we used a simple prior for the covariance matrix of the underlying latent variables: a G -inverse Wishart with 1 degree of freedom and a complete covariance with a value of 2 for each element in the diagonal and 1 outside the diagonals. Thresholds are fixed at zero, since we have binary data. We present the expected posterior values of the contingency table entries in Figure 20. The outcome is essentially identical to the maximum likelihood estimates of Drton and Richardson despite the probit parameterization. Moreover, with our procedure we are able to generate Bayesian confidence intervals, as illustrated in Figure 19. The results are very stable for a chain of 1,000 points. For the parole data, we used a G -inverse Wishart prior for the covariance matrix of underlying variables Y⋆ with 1 degree of freedom and the identity matrix as hyperparameters. We compare the effective sample size of the Gibbs sampler for our DMG model and the DAG model obtained by using the ancillary latent parameterization of Section 7.1 for the underlying latent variable covariance matrix.18 Boxplots for the 16 contingency table entries of the twin network and the 32 entries of the parole study are shown in Figure 21. The setup is the same as in the democratization and 18. The priors used are as follows: the ancillary representation was given a prior with mean 1 and variance 1 for the coefficients Xi j → Y j⋆ , for j > i, and set constant to 1, if i < j. The means of the ancillary latents were fixed at 1228 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS industrialization experiment, where we run 80 independent chains and plot the distribution of the effective sample sizes to measure the mixing time. We ran a shorter chain of 2,000 points, since computing the contingency table entries is expensive. There is a substantial difference in effective sample size for the parole study. Notice that we are comparing MCMC samples for the entries in the contingency table, which in the DAG case requires integrating out not only the underlying latent variables implicit in the probit parameterization, but also the ancillary latents that account for the bi-directed edges. This hierarchy of latent variables, which does not exist in the DMG case, causes a considerable increase on autocorrelation of the chain compared to the DMG model. The standard DMG parameterization can be seen as a way of obtaining a collapsed Gibbs sampler, where the parameterization by construction reflects latent variables that were analytically marginalized. MCMC comparison: DMG vs. DAG (parole data) 800 700 MCMC comparison: DMG vs. DAG (twin data) DMG DAG 200 400 Effective Sample Size 400 300 0 100 200 Effective Sample Size 500 600 600 DMG DAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Parameter 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Parameter (a) (b) Figure 21: Comparison of effective sample sizes for the twin data (a) and parole data (b). 80 independent chains of 2,000 points were obtained using the Gibbs sampling algorithm, and the respective box-plots shown above. The Markov chain with the DMG approach easily dominates the DAG one. For the parole data, the average effective sample size for the DAG was as low as 60 points. 8. Conclusion Directed mixed graph models are a generalization of directed graph models. Whenever a machine learning application requires directed graphs, one should first consider whether directed mixed graphs are a better choice of representation instead. DMGs represent conditional independencies of DAGs where hidden variables have been marginalized out. Given that in most applications it is 0. Variance parameters were given (0.5, 0.5) inverse gamma priors, which approximately matches the priors in the DMG model. 1229 S ILVA AND G HAHRAMANI Y1 Y1 Y2 Y3 Y2 Y3 Y4 Y4 (a) (b) Figure 22: In (a), a simple bi-directed chain with four random variables. In (b), the respective factor graph that is obtained from a Bartlett parameterization using the ordering ≺≡ {Y1 ,Y2 ,Y3 ,Y4 }. In this case, the factors are p(Y1 ) × p(Y2 |Y1 ) × p(Y3 |Y1 ,Y2 ) × p(Y4 |Y1 ,Y2 ,Y3 ). A different choice of ordering (e.g., the perfect ordering) could provide simpler factors on average, but the presence of a factor linked to all variables is unavoidable. unlikely that all relevant variables are known, DMGs are a natural representation to use. In this paper, we introduced priors and inference algorithms for Bayesian learning with two popular families of mixed graph models: Gaussian and probit. We discussed some implementations and approximations to scale up algorithms. We showed examples of applications with real data, and demonstrated that Bayesian inference in Gaussian and probit DMG models using MCMC can have substantially faster mixing than in comparable DAGs. It is part of the machine learning folklore that factor graphs can subsume directed networks. In an important sense, this is known not to be true: undirected and factor graphs only allow for monotonic independence models, where explaining away is ruled out. This excludes a vast number of realistic, non-monotonic, models. While factor graphs are perhaps the data structures of choice for general message-passing algorithms (e.g., Yedidia et al., 2005), they are far from being universal modeling languages for independencies. What is true is that for any distribution that is Markov with respect to a DAG or DMG there is at least one corresponding factor graph model, but this is a vacuous claim of little interest: any distribution can be represented by a single-factor model involving all variables. Some will require a factor with all variables, even under the presence of a large number of independence constraints. For instance, a factor graph corresponding to any given bi-directed chain will necessarily include a factor node adjacent to all variable nodes, as illustrated in Figure 22. When parameterizing a distribution with many marginal independencies (e.g., a bi-directed tree), the respective factor graph would be no more than an unhelpful drawing. A better strategy for solving real-world problems is to define a family of models according to the (directed/undirected/factor) graphs of choice, and let the inference algorithm decide which re-expression of the model suits the problem. This has been traditional in graphical modeling literature (Lauritzen, 1996). The strategy adopted in this paper followed this spirit. An alternative has been recently introduced by Huang and Frey (2008). This paper discusses graphical families of marginal independence constraints (essentially identical to bi-directed graphs, although other types of constraints might implicitly follow from the parameterization). Models are parameterized using a very different strategy. The idea is to parameterize cumulative distribution functions (CDFs) instead of densities or probability mass functions. A simple factorization criterion can be defined in the space of CDFs, but densities have to be computed by a novel message-passing 1230 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS scheme. The particular application discussed by Huang and Frey (2008) could in principle be approached using the Gaussian bi-directed probit model of Section 5, but the parameterization in Huang and Frey (2008) does not need to rely on Gaussian distributions. It is not clear, however, how to efficiently perform Bayesian inference in this case and which constraints are implicitly implied by the different choices of parameterization. The different perspective given by products of CDFs is novel and promising. It should point out to new directions in mixed graph modeling. The structural equation modeling literature also describes several pragmatic ways of specifying non-linearities in the structural equations (Lee, 2007). Less common is the specification of nonGaussian models for the joint density of the error terms. Silva and Ghahramani (2009) introduce a flexible mixture of Gaussians approach for models of marginal independence. There is a need on how to combine this approach with flexible families of structural equations in a computationally efficient way. Also, models with non-additive error terms remain to be explored. Current interest in estimating sparse statistical models has lead to approaches that estimate structured covariance matrices (e.g., Bickel and Levina, 2008). This development could also lead to new families of priors. In particular, different matrix decompositions have motivated different ways of specifying priors on covariance matrices. For instance, Chen and Dunson (2003) propose a modified Cholesky decomposition for the covariance matrix of random effect parameters: standard deviations are parameterized separately with a prior that puts positive mass on zero variances (effectively allowing the random effect to be neutralized). Wong et al. (2003) describe a prior for inverse correlation matrices that is uniform conditioned on the number of structural zeros. Metropolis-Hastings schemes are necessary in this case. Shrinkage methods have also been applied to the estimation of covariance matrices. A common approach, shrinkage towards a diagonal matrix (e.g., Daniels and Kass, 1999), could be generalized towards some sparse matrix corresponding to a bi-directed graph. Although shrinkage will not generate structural zeros in the resulting matrix, allowing for sparse shrinkage matrices other than the identity matrix could be interesting in prediction problems. Some approaches can exploit an ordering for the variables, which is natural in some domains such as time-series analysis. While the G -Inverse Wishart is invariant to a permutation of the variables, new types of priors that exploit a natural variable ordering should be of interest, as in the original work of Brown et al. (1993) that motivated our approach. Other directions and applications are suggested by recent papers: • learning measurement models: the industrialization and democratization problem of Section 7.1 provides an example of a measurement model. In such a family of problems, observed variables are children of latent variables, and connections from latents to observables define the measurement model. Sparsity in the measurement can be exploited to allow for more general dependencies connecting latent variables. One role of the bi-directed component is to allow for extra dependencies connecting observed variables that are not accounted by the explicit latent variables in the model. Silva et al. (2006) describes a learning algorithm for mixed graph measurement models using the “ancillary” parameterization. The natural question is which alternative optimization strategies could be used and how to scale them up; • structural and relational learning: in prediction problems where given an input vector X we have to predict an output vector Y, the dependence structure of Y given X can also lie within the directed mixed graph family. Silva et al. (2007) introduces mixed graph models within the context of relational classification, where Y are labels of different data points 1231 S ILVA AND G HAHRAMANI not independently distributed. In such a class of problems, novel kinds of parameterization are necessary since the dimensionality of the covariance matrix increases with the sample size. Structural features of the graph are used to propose different parameterizations of the dependencies, and many other alternatives are possible; • causal inference: mixed graphs have been consistently used as a language for representing causal dependencies under unmeasured confounding. Zhang (2008) describes recent advances in identifying causal effects with ancestral graphs. Algorithms for learning mixed graph structures are described by Spirtes et al. (2000) and the recent advances in parameterizing such models should result in new algorithms; Many challenges remain. For instance, more flexible models for DMG discrete models are being developed (Drton and Richardson, 2008a), but for large graphs they pose a formidable computational problem. An important question is which other less flexible, but more tractable, parameterizations could be used, and which approximation algorithms to develop. The probit family discussed here was a choice among many. The parameterization by Drton and Richardson (2008a) could be a starting point for trading-off flexibility and computational effort. And while it is true that Gaussian copula models (Pitt et al., 2006) can be adapted to generalize the approach introduced here, it remains to be seen if other copula parameterizations easily lead to DMG models. Acknowledgments We thank the anonymous reviewers for their suggestions, Kenneth Bollen for providing us with the industrialization and democratization data set, and Robert Gramacy for helpful discussions. An earlier version of this paper (Silva and Ghahramani, 2006) appeared in the proceedings of the Uncertainty in Artificial Intelligence conference. This work was funded by the Gatsby Charitable Foundation and a EPSRC grant #EP/D065704/1. Appendix A. Deriving the Sampling Distribution for the Monte Carlo Computation of Normalizing Constants We give here the details on how to derive the sampling distribution used for computing normalizing constants IG (δ, U), as described in Section 3.2.2. Let Ai ≡ Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . Recall from Equation (7) that Bi,nsp≺ (i) = −Bi,sp≺ (i) Ai . nsp The original density p(Bi | γi ), as given by Lemma 1, is a multivariate Gaussian with the following kernel: 1 exp − 2γi T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) (22) where U{i−1},{i−1} in Lemma 1 was rearranged above as the partitioned matrix in (14). The pair {Msp≺ (i) , Mnsp≺ (i) } corresponds to the respective partition of the mean vector Mi . Plugging in the expression for Bi,nsp≺ (i) in (22), we obtain the modified kernel 1232 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS exp − 1 2γi T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) (23) which can be rewritten as pb (Bi,sp≺ (i) ; Ki mi , γi Ki ) × (2π)#sp≺ (i)/2 |γi |#sp≺ (i)/2 |Ki (Φi−1 )|1/2 1 × exp − γ−1 Ui 2 i (24) where #sp≺ (i) is the size of set sp≺ (i), pb (·; α, Σ) is the density function of a multivariate Gaussian distribution with mean α and covariance Σ, Ki (Φi−1 ) ≡ Ki to emphasize the contribution of previous parameters, and mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i Ui = MT U{i−1},{i−1} Mi − mT Ki mi . i i / If sp≺ (i) = 0, it follows that Bi = Bi,nsp≺ (i) = 0. The kernel (23) reduces to exp(−0.5Ui /γi ), / and Ui ≡ MT U{i−1},{i−1} Mi . If nsp≺ (i) = 0, then the expression for the kernel does not change i (Ui ≡ 0), and Equation (24) corresponds to the original kernel in Equation (11). Inserting the re-expressed kernel into the original function (11), we obtain pb (Bi,sp≺ (i) ; Ki mi , γi Ki )pg γi ; δ + i − 1 + #nsp≺ (i) uii.{i−1},{i−1} + Ui , 2 2 fi (Φi−1 ) where pg (·; α, β) is an inverse gamma density function and (i−1)−#sp≺ (i) 2 |Ki (Φi−1 )|1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 Appendix B. Variational Updates for Gaussian Mixed Graph Models The variational updates for the coefficient and intercept parameters are essentially identical to their joint conditional distribution given V and X, where occurrences of V and X are substituted by expectations V−1 q(V) and X q(X) , respectively. Let Vi j be the i j-th entry of V−1 q(V) . The covariance matrix of (B, α) is the covariance matrix of the vector vec(B, α). Such vector is constructed using all (non-zero) coefficients and intercepts. We denote this covariance matrix by ΣB,α . For simplicity of notation, we will treat αi as the coefficient bi(m+1) , m being the number of variables. We will (d) also adopt the notation Ym+1 ≡ 1 in the following derivations. As an abuse of notation, let Y also 1233 S ILVA AND G HAHRAMANI refer to latent variables. In this case, if Yi and Y j refer to latent variables Xhi and Xh j , then define Yi ≡ Xhi q(X) , and YiY j ≡ Xhi Xh j q(X) . Let bi j and btv be the r-th and s-th entries of vec(B, α), respectively. The rs-th entry of the inverse matrix Σ−1 is given by Bα (Σ−1 )rs = Vit Bα n ∑ Yj (d) (d) Yv + 1(i = t)1( j = v) d=1 cbj i sbj i where bxpx ≡ 0 if no edge Yx ← Ypx exists in the graph, 1(·) is the indicator function, and cbj , sbj are i i the given prior parameters defined in Section 4. Similarly to the factorization criterion explained in Section 6, the matrix q(V) will in general be block-diagonal, and this summation can be highly simplified. Define now a vector cb analogous to the Gibbs sampling case, where m cb = r ∑ Vit t=1 n ∑ Y j Yt (d) (d) d=1 + cbj i sbj i . The variational distribution q(B, α) is then a N(ΣB,α c, ΣB,α ). The variational distribution for the latent variables will exactly the same as the Gibbs distribution, except that references to B, α, V−1 are substituted by B q(B,α) , α q(B,α) and V−1 q(V) . Appendix C. Proofs Proof of Lemma 2: Arrange the columns of the Jacobian such that their order corresponds to the sequence σ11 , σ21 , σ22 , σ31 , σ32 , σ33 , . . . , σmm , excluding the entries σi j that are identically zero by construction. Arrange the rows of the Jacobian such that their order corresponds to the sequence γ1 , β21 , γ2 , β31 , β32 , . . . , γm , excluding the entries βi j that are not in ΦG (i.e., exclude any βi j corresponding to a pair {Yi ,Y j } that is not adjacent in the graph). By the definition of Bartlett’s decomposition, Σ{i},{i} and βst are functionally independent for s > i. The same holds for Σ{i},{i} and γs . As such, ∂σi j /∂βst = 0 and ∂σi j /∂γs = 0 for s > i. This implies that J(ΦG ) is a (lower) block triangular matrix of 2m − 1 blocks: for k odd, the k-th block is the singleton ∂σii /∂γi = 1, where i = (k + 1)/2. For k even, the k-th block is the Jacobian ∂Σi,sp≺(i) /∂Bi,sp≺(i) , where i = 1 + k/2 and Σi,sp≺(i) is the vector of covariances of Yi and its preceding spouses. From the interpretation given by Equation (8), it follows that Bi,sp≺ (i) can also be defined by the regression of Yi on Zi . That is Bi,sp≺ (i) = ΣYi ,Zi Σ−1 i ≡ ΣYi ,Zi R−1 . i Zi ,Z (25) However, ΣYi ,Zi = Σi,sp≺ (i) , since Yi is independent of its non-spouses. From (25) we get Σi,sp≺ (i) = Bi,sp≺ (i) Ri , and as such the submatrix ∂Σi,sp≺(i) /∂Bi,sp≺(i) turns out to be Ri . Since the determinant of the block triangular Jacobian J(ΦG ) is given by the determinant of the blocks, this implies m |J(ΦG )| = ∏ |Ri |. i=2 By the matrix identity 1234 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A B C D |Σ{i−1},{i−1} | = = |A||D − CA−1 B|, |Σnsp≺(i) ,nsp≺(i) ||Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) | nsp (26) ≡ i−1 |Σnsp≺(i) ,nsp≺(i) ||Ri |. Since |Σ{i−1},{i−1} | = ∏t=1 γt , the second equality holds. Proof of Theorem 4: We first describe a mapping from each path in G to a path in G ⋆ , and viceversa (such mappings are not inverse functions of each other, since the number of paths in G ⋆ is larger than in G ). By construction, all bi-directed edges in G ∗ have two UVs as endpoints, with an one-to-one mapping between each Ys⋆ ↔ Yt⋆ in G ⋆ and each Ys ↔ Yt in G . All directed edges in G ⋆ are of two types: Ys → Yt⋆ , with s = t, or Ys⋆ → Ys . Therefore, one can define an unique path P in G as a function of a path P⋆ in G ⋆ , obtained by relabeling each Y ⋆ as Y , and by collapsing any Y → Y edges that might result from this relabeling into a single vertex Y . A mapping in the opposite direction is analogous as given by the construction rule of Type-II models. A collider in a path is any vertex within a head-to-head collision in the path, that is, any vertex Yt where the preceding and the next vertex in the path are connected to Yt with an edge (directed or bi-directed) into Yt . Yi and Y j are m-separated by Z in an acyclic DMG if and only if there is no active path connecting Yi and Y j . Like in d-separation, a path is active if all of its colliders have some descendant in Z, and none of its non-colliders is in Z (Richardson, 2003). The mappings between paths P and P⋆ are such that Yt is a collider in P if and only if Yt is in P⋆ and is a collider, or Yt⋆ is in P⋆ and is a collider. Since by construction any Yt⋆ will have the same Y-descendants in G ⋆ as Yt has in G , and Z ⊂ Y, the result follows. Proof of Theorem 7: The first of the two claims of the theorem trivially holds, since connectivity is a transitive property and as such this partition will always exist (where K(i) = 1 is a possibility). We will prove the validity of the second claim by induction. Let {R1 , . . . , Rk } be the perfect sequence that generated our perfect ordering. The second claim automatically holds for all vertices in Rk , since Rk is a clique. Assume the second claim holds for the subsequence {Rl+1 , Rl+2 , . . . , Rk }. Let Yi be an element of Rl . Assume there is some non-spouse Yq of Yi in Rl ′ , and some spouse Yp of Yi in Rl ′′ , such that l < l ′ ≤ l ′′ . We will assume that both Yq and Yp belong to the same component Vt and show this leads to a contradiction. Without loss of generality, we can assume that Yq and Yp are adjacent: otherwise, the fact that Yq and Yp are in the connected set Vt will imply there is a path connecting Yq and Yp in the subgraph induced by {Rl+1 , . . . , Rk }. We can redefine {Yq ,Yp } to be the endpoints of the first edge in the path containing a non-spouse and a spouse of Yi . It will still be the case that q > p, by the induction hypothesis. Since Yp ∈ Rl ′′ , there is a separator Sl ′′ between Hl ′′ \Sl ′′ and Rl ′′ . But Yi ∈ Hl ′′ , and Yi is adjacent to Yp , which implies Yi ∈ Sl ′′ . If l ′ < l ′′ , this will also imply that Yq ∈ Sl ′′ , which is a contradiction, since Sl ′′ is a complete set. If l ′ = l ′′ , this implies that Yi and Yq are both in YP(l ′′ ) , which is also a contradiction since YP(l ′′ ) is a clique. 1235 S ILVA AND G HAHRAMANI References A. Al-Awadhi and P. Garthwaite. An elicitation method for multivariate normal distributions. Communications in Statistics - Theory and Methods, 27:1123–1142, 1998. J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679, 1993. A. Atay-Kayis and H. Massam. A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika, 92:317–335, 2005. D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers, 1999. M. Beal. Variational algorithms for approximate Bayesian inference. PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. M. Beal and Z. Ghahramani. Variational Baeysian learning of directed graphical models with hidden variables. Bayesian Analysis, 1:793–832, 2006. P. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36:2577– 2604, 2008. K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989. P. Brown, N. Le, and J. Zidek. Inference for a covariance matrix. In P.R. Freeman, A.F.M. Smith (editors), Aspects of Uncertainty, a tribute to D. V. Lindley, pages 77–92, 1993. Z. Chen and D. Dunson. Random effects selection in linear mixed models. Biometrics, 59:762–769, 2003. D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. M. Daniels and R. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94:1254–1263, 1999. M. Drton and M. Perlman. Multiple testing and error control in Gaussian graphical model selection. Statistical Science, pages 430–449, 2007. M. Drton and T. Richardson. A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2003. M. Drton and T. Richardson. Iterative conditional fitting for Gaussian ancestral graph models. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004. M. Drton and T. Richardson. Binary models for marginal independence. Department of Statistics, University of Washington, Tech. report 474, 2005. M. Drton and T. Richardson. Binary models for marginal independence. Journal of the Royal Statistical Society, Series B, 70:287–309, 2008a. 1236 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS M. Drton and T. Richardson. Graphical methods for efficient likelihood inference in Gaussian covariance models. Journal of Machine Learning Research, pages 893–914, 2008b. D. Dunson, J. Palomo, and K. Bollen. Bayesian structural equation modeling. Statistical and Applied Mathematical Sciences Institute, Technical Report #2005-5, 2005. N. Friedman and D. Koller. Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Machine Learning Journal, 50:95–126, 2003. P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨ lkopf. Nonlinear causal discovery with additive o noise models. Neural Information Processing Systems, 2008. J. Huang and B. Frey. Cumulative distribution networks and the derivative-sum-product algorithm. Proceedings of 24th Conference on Uncertainty in Artificial Intelligence, 2008. B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20:388–400, 2005. M. Jordan. Learning in Graphical Models. MIT Press, 1998. M. Jordan, Z. Ghaharamani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in Graphical Models, pages 105–162, 1998. C. Kang and J. Tian. Local Markov property for models satisfying the composition axiom. Proceedings of 21st Conference on Uncertainty in Artificial Intelligence, 2005. J. Kotecha and P. Djuric. Gibbs sampling approach for the generation of truncated multivariate Gaussian random variables. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1757–1760, 1999. S. Lauritzen. Graphical Models. Oxford University Press, 1996. S.-Y. Lee. Structural Equation Modeling: a Bayesian Approach. Wiley, 2007. D. MacKay. Introduction to Monte Carlo methods. Learning in Graphical Models, pages 175–204, 1998. I. Murray, Z. Ghahramani, and D. MacKay. MCMC for doubly-intractable distributions. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. R. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. M. Pitt, D. Chan, and R. Kohn. Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93:537–554, 2006. 1237 S ILVA AND G HAHRAMANI T. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145–157, 2003. T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. A. Roverato. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29: 391–411, 2002. R. Scheines, R. Hoijtink, and A. Boomsma. Bayesian estimation and testing of structural equation models. Psychometrika, 64:37–52, 1999. R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Silva and Z. Ghahramani. Factorial mixtures of Gaussians and the marginal independence model. Artificial Intelligence & Statistics (AISTATS ’09), 2009. R. Silva and R. Scheines. Bayesian learning of measurement and structural models. 23rd International Conference on Machine Learning, 2006. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191–246, 2006. R. Silva, W. Chu, and Z. Ghahramani. Hidden common cause relations in relational learning. Neural Information Processing Systems (NIPS ’07), 2007. J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860, 2006. P. Spirtes. Directed cyclic graphical representations of feedback models. Proceedings of 11th Conference on Uncertainty in Artificial Intelligence, 1995. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000. R. Tarjan. Decomposition by clique separators. Discrete Mathematics, 55:221–232, 1985. E. Webb and J. Forster. Bayesian model determination for multivariate ordinal and binary data. Technical report, Southampton Statistical Sciences Research Institute, 2006. F. Wong, C. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90:809–830, 2003. S. Wright. Correlation and causation. Journal of Agricultural Research, pages 557–585, 1921. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005. J. Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, pages 1437–1474, 2008. 1238
6 0.037360832 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
7 0.032890234 38 jmlr-2009-Hash Kernels for Structured Data
8 0.028345184 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
9 0.025116263 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
10 0.023372985 50 jmlr-2009-Learning When Concepts Abound
11 0.023314962 23 jmlr-2009-Discriminative Learning Under Covariate Shift
12 0.022785081 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
13 0.021220911 15 jmlr-2009-Cautious Collective Classification
14 0.020753633 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
15 0.017723843 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
16 0.017426532 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
17 0.017127536 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
18 0.016874313 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
19 0.016690087 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
20 0.015639203 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
topicId topicWeight
[(0, 0.102), (1, -0.054), (2, 0.058), (3, -0.025), (4, 0.038), (5, -0.077), (6, 0.0), (7, -0.003), (8, -0.027), (9, 0.079), (10, -0.014), (11, 0.011), (12, -0.043), (13, -0.211), (14, -0.125), (15, 0.206), (16, 0.002), (17, 0.058), (18, 0.231), (19, 0.148), (20, -0.028), (21, 0.113), (22, 0.279), (23, -0.086), (24, 0.067), (25, -0.007), (26, -0.204), (27, -0.021), (28, -0.289), (29, -0.229), (30, 0.054), (31, -0.034), (32, -0.149), (33, -0.123), (34, 0.016), (35, 0.072), (36, 0.155), (37, 0.052), (38, -0.073), (39, 0.031), (40, 0.134), (41, -0.088), (42, 0.094), (43, 0.06), (44, 0.045), (45, -0.084), (46, -0.022), (47, 0.08), (48, 0.188), (49, -0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.97841167 25 jmlr-2009-Distributed Algorithms for Topic Models
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
2 0.52935296 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
Author: Kristian Woodsend, Jacek Gondzio
Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method
3 0.21413547 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
4 0.20911546 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
Author: Christian Rieger, Barbara Zwicknagl
Abstract: We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of ε- and ν-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples. Keywords: sampling inequality, radial basis functions, approximation theory, reproducing kernel Hilbert space, Sobolev space
5 0.20615101 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
6 0.1697102 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
7 0.15375698 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
8 0.14203897 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
9 0.1390135 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
10 0.11901313 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
12 0.11393002 48 jmlr-2009-Learning Nondeterministic Classifiers
13 0.10663994 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
14 0.10460705 46 jmlr-2009-Learning Halfspaces with Malicious Noise
15 0.10440341 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
16 0.10115675 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
17 0.097150944 50 jmlr-2009-Learning When Concepts Abound
18 0.096643142 23 jmlr-2009-Discriminative Learning Under Covariate Shift
19 0.095913053 38 jmlr-2009-Hash Kernels for Structured Data
20 0.093490735 15 jmlr-2009-Cautious Collective Classification
topicId topicWeight
[(8, 0.017), (11, 0.019), (19, 0.026), (24, 0.012), (26, 0.012), (38, 0.026), (47, 0.01), (52, 0.064), (55, 0.039), (58, 0.036), (66, 0.061), (68, 0.027), (75, 0.014), (79, 0.463), (88, 0.016), (90, 0.034), (96, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.7754218 25 jmlr-2009-Distributed Algorithms for Topic Models
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
2 0.2395827 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
3 0.23046315 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
4 0.22965452 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
5 0.22672662 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
6 0.22578186 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
7 0.22493154 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
8 0.22356483 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
9 0.22113657 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
10 0.22071531 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
11 0.21975651 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
12 0.21957669 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
13 0.21930991 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
14 0.21928866 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
15 0.21849063 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
17 0.21777567 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
18 0.21674861 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
19 0.21490045 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
20 0.2147456 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification