nips nips2013 nips2013-58 knowledge-graph by maker-knowledge-mining

58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent


Source: pdf

Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying

Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. [sent-12, score-1.044]

2 We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. [sent-14, score-1.782]

3 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? [sent-15, score-0.136]

4 In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. [sent-17, score-0.269]

5 Our goal is to provide efficient algorithms to discover bushy hierarchies. [sent-18, score-0.14]

6 We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. [sent-19, score-1.7]

7 While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. [sent-20, score-0.727]

8 The beta coalescent (Section 3) does not have this restriction. [sent-21, score-0.941]

9 However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. [sent-22, score-0.436]

10 Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). [sent-23, score-1.047]

11 We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). [sent-25, score-1.813]

12 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. [sent-26, score-0.164]

13 Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. [sent-35, score-0.164]

14 Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. [sent-36, score-0.146]

15 Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. [sent-37, score-0.454]

16 Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. [sent-38, score-0.857]

17 In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). [sent-39, score-0.184]

18 These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. [sent-40, score-0.204]

19 The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. [sent-42, score-0.184]

20 After time ti , a set of these nodes coalesce into a new node. [sent-45, score-0.206]

21 This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. [sent-48, score-0.134]

22 This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. [sent-50, score-1.402]

23 We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. [sent-51, score-0.183]

24 n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). [sent-52, score-1.462]

25 So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). [sent-54, score-0.139]

26 A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. [sent-56, score-0.701]

27 Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. [sent-57, score-0.701]

28 Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. [sent-58, score-0.48]

29 n (2) ki =2 Each coalescent event also has an edge length—duration—δi . [sent-61, score-0.806]

30 The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . [sent-62, score-0.242]

31 Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). [sent-63, score-0.277]

32 Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . [sent-64, score-0.783]

33 m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). [sent-67, score-0.299]

34 n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. [sent-68, score-0.197]

35 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. [sent-72, score-0.157]

36 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. [sent-75, score-0.722]

37 s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . [sent-82, score-0.941]

38 Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. [sent-84, score-1.834]

39 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. [sent-86, score-0.941]

40 In real clustering applications, we also care about a node’s children and features. [sent-87, score-0.252]

41 In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. [sent-88, score-0.127]

42 The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . [sent-90, score-0.298]

43 This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. [sent-91, score-0.907]

44 Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). [sent-93, score-0.217]

45 The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. [sent-95, score-0.48]

46 Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. [sent-96, score-0.158]

47 To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [sent-103, score-1.044]

48 [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. [sent-104, score-0.783]

49 We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. [sent-105, score-0.966]

50 The message for node ρi marginalizes over the features of the nodes in its children set. [sent-106, score-0.373]

51 2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . [sent-107, score-0.282]

52 (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . [sent-108, score-0.176]

53 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. [sent-110, score-0.147]

54 At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. [sent-111, score-0.701]

55 For coalescent models, this estimates the posterior distribution over tree structures given observations x. [sent-115, score-0.813]

56 Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. [sent-116, score-0.275]

57 We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . [sent-117, score-0.193]

58 To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. [sent-122, score-1.057]

59 The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. [sent-123, score-0.49]

60 Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . [sent-124, score-0.429]

61 s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. [sent-125, score-0.242]

62 Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . [sent-129, score-0.362]

63 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. [sent-131, score-0.194]

64 For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. [sent-153, score-0.116]

65 The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. [sent-155, score-0.119]

66 As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. [sent-156, score-0.728]

67 In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. [sent-159, score-0.303]

68 Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). [sent-168, score-0.415]

69 We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. [sent-177, score-0.449]

70 Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). [sent-178, score-0.41]

71 If Ωi has all children sets, it recovers exhaustive SMC. [sent-179, score-0.194]

72 In such cases it is simpler to enumerate all children sets. [sent-182, score-0.194]

73 The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. [sent-185, score-0.272]

74 Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. [sent-187, score-1.803]

75 We show beta performs best and can capture data in more interpretable, bushier trees. [sent-188, score-0.304]

76 Setup The parameter α for the beta coalescent is between 1 and 2. [sent-189, score-0.941]

77 The closer α is to 1, bushier the tree is, and we set α = 1. [sent-190, score-0.146]

78 All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). [sent-193, score-0.75]

79 Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. [sent-194, score-0.134]

80 The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. [sent-195, score-0.144]

81 For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. [sent-196, score-0.182]

82 The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. [sent-197, score-0.425]

83 The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. [sent-198, score-0.282]

84 We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). [sent-203, score-0.639]

85 In both cases, beta is comparable to kingman and hac (no edge length). [sent-204, score-0.635]

86 6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0. [sent-212, score-1.155]

87 0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v. [sent-231, score-0.24]

88 Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). [sent-234, score-0.449]

89 To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). [sent-235, score-0.282]

90 As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. [sent-239, score-0.168]

91 For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. [sent-241, score-0.132]

92 Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. [sent-247, score-0.734]

93 The tree recovered by beta (Figure 3) reflects human development. [sent-248, score-0.348]

94 While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. [sent-298, score-0.298]

95 047 Table 1: Comparing the three models: beta performs best on all three scores. [sent-332, score-0.24]

96 As with the biological data, beta performs best on all scores for 20-newsgroups. [sent-333, score-0.284]

97 Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. [sent-334, score-0.222]

98 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. [sent-337, score-0.989]

99 Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. [sent-343, score-0.759]

100 The general coalescent with asynchronous mergers of ancestral lines. [sent-438, score-0.723]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coalescent', 0.701), ('kingman', 0.28), ('dpmm', 0.241), ('beta', 0.24), ('children', 0.194), ('bushy', 0.14), ('ni', 0.125), ('nodes', 0.092), ('coalescents', 0.089), ('hac', 0.089), ('ci', 0.087), ('tree', 0.082), ('subtree', 0.079), ('smc', 0.079), ('tissues', 0.079), ('tb', 0.077), ('enum', 0.076), ('ti', 0.076), ('yb', 0.074), ('particle', 0.068), ('partitions', 0.065), ('purity', 0.065), ('bushier', 0.064), ('kii', 0.064), ('proposal', 0.061), ('diffusion', 0.061), ('clustering', 0.058), ('dirichlet', 0.058), ('ki', 0.057), ('trees', 0.055), ('particles', 0.053), ('node', 0.052), ('tissue', 0.052), ('ectoderm', 0.051), ('endoderm', 0.051), ('sir', 0.051), ('subsumer', 0.051), ('hierarchical', 0.048), ('duration', 0.048), ('merge', 0.047), ('dpmms', 0.045), ('scores', 0.044), ('parent', 0.044), ('mutation', 0.041), ('transition', 0.039), ('coagulation', 0.038), ('coalesce', 0.038), ('germ', 0.038), ('mesoderm', 0.038), ('placenta', 0.038), ('heller', 0.037), ('developmental', 0.037), ('brownian', 0.035), ('message', 0.035), ('yi', 0.034), ('fragmentation', 0.034), ('rec', 0.034), ('religion', 0.034), ('singleton', 0.034), ('hierarchy', 0.033), ('normalizer', 0.031), ('hierarchies', 0.031), ('observations', 0.03), ('internal', 0.03), ('monte', 0.028), ('kernel', 0.027), ('ts', 0.027), ('inference', 0.027), ('base', 0.027), ('exhaustively', 0.027), ('edge', 0.026), ('human', 0.026), ('length', 0.026), ('sequential', 0.026), ('bayesian', 0.026), ('berestycki', 0.025), ('dyb', 0.025), ('dyi', 0.025), ('embryonic', 0.025), ('secretory', 0.025), ('merges', 0.025), ('carlo', 0.024), ('daum', 0.023), ('equation', 0.023), ('wi', 0.023), ('restriction', 0.023), ('mergers', 0.022), ('crest', 0.022), ('teh', 0.022), ('considers', 0.022), ('merged', 0.022), ('event', 0.022), ('lengths', 0.022), ('propagation', 0.021), ('lters', 0.021), ('topic', 0.021), ('ns', 0.021), ('multilingual', 0.021), ('blundell', 0.021), ('root', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying

Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9

2 0.10889938 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

3 0.10627835 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

4 0.081901871 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

Author: Jie Liu, David Page

Abstract: In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with “stripped” Beta approximation (Gibbs SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs SBA’s performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models learned by MLE on real-world Senate voting data. 1

5 0.071283557 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

6 0.071256824 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

7 0.069853634 47 nips-2013-Bayesian Hierarchical Community Discovery

8 0.068204187 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

9 0.067495011 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

10 0.066735953 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

11 0.065296948 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

12 0.06314557 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

13 0.059426624 66 nips-2013-Computing the Stationary Distribution Locally

14 0.059368458 355 nips-2013-Which Space Partitioning Tree to Use for Search?

15 0.058117397 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

16 0.056328032 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

17 0.0561276 269 nips-2013-Regression-tree Tuning in a Streaming Setting

18 0.055295873 289 nips-2013-Scalable kernels for graphs with continuous attributes

19 0.055172756 174 nips-2013-Lexical and Hierarchical Topic Regression

20 0.053226668 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.125), (1, 0.034), (2, -0.059), (3, -0.012), (4, 0.044), (5, 0.077), (6, 0.134), (7, -0.037), (8, 0.05), (9, -0.004), (10, -0.023), (11, -0.01), (12, 0.091), (13, 0.028), (14, 0.033), (15, 0.022), (16, 0.034), (17, 0.033), (18, 0.027), (19, -0.001), (20, 0.095), (21, 0.082), (22, -0.028), (23, -0.042), (24, 0.069), (25, 0.049), (26, 0.015), (27, -0.021), (28, 0.003), (29, -0.042), (30, -0.002), (31, 0.061), (32, 0.032), (33, -0.108), (34, 0.017), (35, 0.024), (36, 0.058), (37, -0.017), (38, 0.006), (39, -0.074), (40, 0.016), (41, 0.054), (42, -0.009), (43, -0.016), (44, -0.046), (45, 0.018), (46, -0.091), (47, -0.092), (48, -0.071), (49, -0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93623936 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying

Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9

2 0.70542413 47 nips-2013-Bayesian Hierarchical Community Discovery

Author: Charles Blundell, Yee Whye Teh

Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1

3 0.64510918 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

4 0.61465168 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar

Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1

5 0.60626334 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi

Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1

6 0.54154974 340 nips-2013-Understanding variable importances in forests of randomized trees

7 0.53155786 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

8 0.53153324 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

9 0.52606529 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

10 0.5251193 66 nips-2013-Computing the Stationary Distribution Locally

11 0.51921344 344 nips-2013-Using multiple samples to learn mixture models

12 0.51721978 355 nips-2013-Which Space Partitioning Tree to Use for Search?

13 0.50785637 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

14 0.49654073 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs

15 0.48677495 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

16 0.47753084 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

17 0.47638756 161 nips-2013-Learning Stochastic Inverses

18 0.47493356 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

19 0.47303012 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

20 0.47069955 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.103), (33, 0.119), (34, 0.132), (41, 0.023), (49, 0.027), (56, 0.087), (70, 0.025), (85, 0.041), (87, 0.258), (89, 0.022), (93, 0.042), (99, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79141051 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying

Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9

2 0.68611783 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

Author: Moustapha M. Cisse, Nicolas Usunier, Thierry Artières, Patrick Gallinari

Abstract: This paper presents an approach to multilabel classification (MLC) with a large number of labels. Our approach is a reduction to binary classification in which label sets are represented by low dimensional binary vectors. This representation follows the principle of Bloom filters, a space-efficient data structure originally designed for approximate membership testing. We show that a naive application of Bloom filters in MLC is not robust to individual binary classifiers’ errors. We then present an approach that exploits a specific feature of real-world datasets when the number of labels is large: many labels (almost) never appear together. Our approach is provably robust, has sublinear training and inference complexity with respect to the number of labels, and compares favorably to state-of-the-art algorithms on two large scale multilabel datasets. 1

3 0.65992355 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1

4 0.65660495 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

5 0.64298517 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan

Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1

6 0.63677567 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

7 0.63498658 245 nips-2013-Pass-efficient unsupervised feature selection

8 0.6338163 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

9 0.63358575 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.63337719 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

11 0.62994426 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

12 0.62967175 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

13 0.62470448 86 nips-2013-Demixing odors - fast inference in olfaction

14 0.62392145 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

15 0.6233089 350 nips-2013-Wavelets on Graphs via Deep Learning

16 0.62251341 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

17 0.62110317 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

18 0.62096286 201 nips-2013-Multi-Task Bayesian Optimization

19 0.61886317 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

20 0.61719835 5 nips-2013-A Deep Architecture for Matching Short Texts