nips nips2013 nips2013-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Charles Blundell, Yee Whye Teh
Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. [sent-7, score-0.325]
2 We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. [sent-9, score-0.287]
3 In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. [sent-10, score-0.095]
4 Consequently the structure found in social networks is often studied by inferring these groups. [sent-14, score-0.081]
5 Using community membership one may then make predictions about the presence or absence of unobserved connectivity in the social network. [sent-15, score-0.233]
6 For example, within science, the community of physicists may be split into those working on various branches of physics, and each branch refined repeatedly until finally reaching the particular specialisation of an individual physicist. [sent-17, score-0.15]
7 Much previous work on social networks has focused on discovering flat community structure. [sent-18, score-0.234]
8 Both models require the number of flat communities to be known and are parametric methods. [sent-20, score-0.13]
9 Greedy hierarchical clustering has previously been applied directly to discovering hierarchical community structure [3]. [sent-21, score-0.307]
10 These methods do not require the community structure to be flat or the number of communities to be known. [sent-22, score-0.28]
11 Such schemes are often computationally efficient, scaling quadratically in the number of individuals for a dense network, or linearly in the number of edges for a sparse network [4]. [sent-23, score-0.094]
12 Several authors have also proposed Bayesian approaches to inferring community structure. [sent-25, score-0.176]
13 The Infinite Relational Model (IRM; [5, 6, 7]) infers a flat community structure. [sent-26, score-0.15]
14 [9] and [10] propose methods limited to hierarchies of depth two, whilst [11] propose methods limited to hierarchies of known depth. [sent-28, score-0.165]
15 The Mondrian process [12] propose a flexible prior on trees and a likelihood model for relational data. [sent-30, score-0.195]
16 ∗ Part of the work was done whilst at the Gatsby Unit, University College London. [sent-32, score-0.073]
17 1 Such schemes can take a large number of iterations to converge on an adequate solution whilst each iteration often scales unfavourably in the number of communities or vertices. [sent-33, score-0.203]
18 We shall describe a greedy Bayesian hierarchical clustering method for discovering community structure in social networks. [sent-34, score-0.395]
19 Our work builds upon Bayesian approaches to greedy hierarchical clustering [13, 14] extending these approaches to relational data. [sent-35, score-0.164]
20 BHCD produces good results two orders of magnitude faster than a single iteration of the IRM, and its worst case run-time is quadratic in the number of vertices of the graph and independent of the number of communities. [sent-37, score-0.095]
21 In Section 4 we describe an efficient scheme for inferring hierarchical community structure with our model. [sent-41, score-0.24]
22 Suppose V = {a, b, c, d}, then one way to partition the vertices would be to form clusters ab, c and d, which we shall write as φ = ab|c|d, where | denotes a split between clusters. [sent-44, score-0.211]
23 Similar approaches have been taken to adapt the IRM to community detection [7], where non-conjugate priors were used at increased computational cost. [sent-48, score-0.15]
24 The hyperparameters are picked such that α > β and δ < λ, which correspond to a prior belief of a higher density of edges within a community than across communities. [sent-51, score-0.209]
25 Integrating out the edge appearance parameters, we obtain the following likelihood of a particular partition φ: P (D|φ) = p∈φ B(α + n1 , β + n0 ) pp pp B(α, β) p,q∈φ p=q B(δ + n1 , λ + n0 ) pq pq B(δ, λ) (2) where B(·, ·) is the Beta function. [sent-52, score-0.464]
26 We use σpq to denote the sufficient statistics from a (p, q)-block of the adjacency matrix: all of the elements whose row indices are in cluster p and whose column indices are in cluster q. [sent-54, score-0.075]
27 pq pq B(α,β) B(δ,λ) For clarity of exposition, we shall focus on modelling undirected or symmetric networks with no self-edges, so σpq = σqp and σ{x}{x} = 0 for each vertex x, but in general this restriction is not necessary. [sent-59, score-0.358]
28 One approach to inferring φ is to fix the number of communities K then use maximum likelihood estimation or Bayesian inference to assign vertices to each of the communities [1, 15]. [sent-60, score-0.415]
29 Another approach is to use variational Bayes, combined with an upper bound on the number of communities, to determine the number of communities and community assignments [16]. [sent-61, score-0.28]
30 2 Figure 1: Hierarchical decomposition of the adjacency matrix into tree-consistent partitions. [sent-62, score-0.075]
31 3 Bayesian Hierarchical Communities In this section we shall develop a Bayesian nonparametric approach to community discovery. [sent-64, score-0.231]
32 Our model organises the communities into a nested hierarchy T , with all vertices in one community at the root and singleton vertices at the leaves. [sent-65, score-0.512]
33 Each vertex belongs to all communities along the path from the root to the leaf containing it. [sent-66, score-0.13]
34 We describe the probabilistic model relating the hierarchy of communities to the observed network connectivity data here, whilst in the next section we will develop a greedy model selection procedure for learning the hierarchy T from data. [sent-67, score-0.365]
35 We begin with the marginal probability of the adjacency matrix D under a stochastic blockmodel: p(D) = p(φ)p(D|φ) (4) φ If the Chinese restaurant process (CRP) is used as the prior on partitions p(φ), then (4) corresponds to the marginal likelihood of the IRM. [sent-68, score-0.192]
36 Computing (4) typically requires an approximation: the space of partitions φ is large and so the number of partitions in the above sum grows at least exponentially in the number of vertices. [sent-69, score-0.166]
37 We shall take a different approach: we use a tree to define a prior on partitions, where only partitions that are consistent with the tree are included in the sum. [sent-70, score-0.423]
38 The tree will represent the hierarchical community structure discovered in the network. [sent-72, score-0.357]
39 Each internal node of the tree corresponds to a community and the leaves of the tree are the vertices of the adjacency matrix, D. [sent-73, score-0.676]
40 Figure 1 shows how a tree defines a collection of partitions for inclusion in the sum in (4). [sent-74, score-0.226]
41 The adjacency matrix on the left is explained by our model, conditioned upon the tree on the upper left, by its five tree-consistent partitions. [sent-75, score-0.218]
42 Various blocks within the adjacency matrix are explained either by the on-diagonal model f or the off-diagonal model g, according to each partition. [sent-76, score-0.075]
43 Note that the block structure of the off-diagonal model depends on the structure of the tree T , not just on the partition φ. [sent-77, score-0.171]
44 The model always includes the trivial partition of all vertices in a single community and also the singleton partition of all vertices in separate communities. [sent-78, score-0.396]
45 More precisely, we shall denote trees as a nested collection of sets of vertices. [sent-79, score-0.158]
46 For example, the tree in Figure 1 is T = {{a, b}, {c, d, e}, f }. [sent-80, score-0.143]
47 The set of of partitions consistent with a tree T may be expressed formally as in [14]: Φ(T ) = {leaves(T )} ∪ {φ1 |. [sent-81, score-0.226]
48 |φnT : φi ∈ Φ(Ti ), Ti ∈ ch(T )} (5) where leaves(T ) are the leaves of the tree T , ch(T ) are its children, and so Ti is the ith subtree of tree T . [sent-84, score-0.356]
49 The marginal likelihood of the tree T can be written as: p(D|T ) = p(DT T |T ) = p(φ|T )p(DT T |φ, T ) (6) φ where the notation DT T is short for Dleaves(T ),leaves(T ) , the block of D whose rows and columns correspond to the leaves of T . [sent-85, score-0.247]
50 Our prior on partitions p(φ|T ) is motivated by the following generative process: Begin at the root of the tree, S = T . [sent-86, score-0.083]
51 Otherwise, with probability 1 − πS , generate all inter-cluster edges between the children of the current node according to g, and recurse on each child of the current tree S. [sent-88, score-0.229]
52 The resulting prior on 3 tree-consistent partitions p(φ|T ) factorises as: (1 − πS ) p(φ|T ) = S∈ancestorT (φ) πS (7) S∈subtreeT (φ) where subtreeT (φ) are the subtrees in T corresponding to the clusters in partition φ and ancestorT (φ) are the ancestors of trees in subtreeT (φ). [sent-89, score-0.276]
53 The prior probability of partitions not consistent with T is zero. [sent-90, score-0.083]
54 This choice of πS gives higher likelihood to non-binary trees over cascading binary trees when the data has no hierarchical structure [14]. [sent-92, score-0.306]
55 However due to the structure of the prior on partitions (7) and the block model (8), the marginal likelihood (6) may be calculated by dynamic programming, in O(n + m) time where n is the number of vertices and m is the number of edges. [sent-94, score-0.212]
56 The problem is treated as one of greedy model selection: each tree T is a different model, and we wish to find the model that best explains the data. [sent-96, score-0.186]
57 The tree is built in a bottom-up greedy agglomerative fashion, starting from a forest consisting of n trivial trees, each corresponding to exactly one vertex. [sent-97, score-0.27]
58 Each iteration then merges two of the trees in the forest. [sent-98, score-0.148]
59 At each iteration, each vertex in the network is a leaf of exactly one tree in the forest. [sent-99, score-0.178]
60 The initial forest, F (0) , consists a singleton tree for each vertex of the network. [sent-102, score-0.143]
61 At each iteration a pair of trees in the forest F is chosen to be merged, resulting in forest F . [sent-103, score-0.212]
62 Which pair of tree to merge, and how to merge these trees, is determined by considering which pair and type of merger yields the largest Bayes factor improvement over the current model. [sent-104, score-0.223]
63 , by considering the probability of the connectivity data only among the leaves of the newly merged tree. [sent-108, score-0.102]
64 This permits efficient local computations and makes the assumption that local community structure should depend only on the local connectivity structure. [sent-109, score-0.15]
65 We consider three possible mergers of two trees I and J into M . [sent-110, score-0.128]
66 9: for each tree J ∈ F \ {I}, do ¬ch 10: Compute σIJ , σM M , and σM M using (13). [sent-116, score-0.143]
67 Ta Tb Tc Td Te 12: end for 13: end if 14: end while 15: return the only tree in F Figure 2: Different merge operations. [sent-118, score-0.223]
68 The algorithm maintains a forest F of trees, the likelihood pI = p(DII |I) of each tree I ∈ F according to (10), and the sufficient statistics ¬ch {σII }I∈F , {σIJ }I,J∈F needed for efficient computation. [sent-123, score-0.231]
69 It also maintains a heap of potential merges ordered by the S CORE (12), used for determining the ordering of merges. [sent-124, score-0.077]
70 At each iteration, the best potential merge, say of trees X and Y resulting in tree I, is picked off the heap. [sent-125, score-0.247]
71 If either X or Y is not in F , this means that the tree has been used in a previous merge, so that the potential merge is discarded and the next potential merge is considered. [sent-126, score-0.303]
72 After a successful merge, the sufficient statistics associated with the new tree I are computed using the previously computed ones: σIJ = σXJ + σY J for J ∈ F, J = I. [sent-127, score-0.143]
73 Finally, potential mergers of I with other trees J in the forest are considered and added onto the heap. [sent-130, score-0.182]
74 1 Variations BHCD will consider merging trees that have no edges between them if the merge score (12) is high enough. [sent-136, score-0.243]
75 This does not seem to be a reasonable behaviour as communities that are completely disconnected should not be merged. [sent-137, score-0.13]
76 We can alter BHCD by simply prohibiting such merges between trees that have no edges between them. [sent-138, score-0.207]
77 The resulting algorithm we call BHCD sparse, as it only needs to perform computations on the parts of the network with edges present. [sent-139, score-0.094]
78 In particular, at the first iteration all pairs of vertices connected by an edge have the same score. [sent-142, score-0.123]
79 Where we want a single tree, we use R (R = 50 in experiments) restarts and pick the tree with the highest likelihood according to (10). [sent-145, score-0.177]
80 2 Predictions For link prediction, we wish to obtain the predictive distribution of a previously unobserved element of the adjacency matrix. [sent-148, score-0.12]
81 This is easily achieved by traversing one path of the tree from the root towards the leaves, hence the computational complexity is linear in the depth of the tree. [sent-149, score-0.143]
82 πS f (σSS ) p(DSS |S) (14) where rS is the probability that the cluster consisting of leaves(S) is present if the cluster corresponding to its parent is not present, given the data D and the tree T . [sent-151, score-0.143]
83 But, as we shall see below, the hierarchies found can be more difficult to interpret than the non-binary hierarchies found. [sent-165, score-0.146]
84 Sampson’s Monastery Network Figure 3 shows the result of running six variants of BHCD on time four of Sampson’s monastery network [17]. [sent-166, score-0.103]
85 As can be seen in Figure 3, most BHCD variants find clear block diagonal structure in the adjacency matrix. [sent-170, score-0.075]
86 The log likelihoods of the trees in Figure 3 are −6. [sent-173, score-0.131]
87 Whilst the log likelihoods of the binary trees in Figure 3 are −8. [sent-176, score-0.131]
88 BHCD finds the most likely tree, and rose trees typically better explain the data than binary trees. [sent-179, score-0.142]
89 BHCD finds the Young Turks and Loyal Opposition groups and chooses to merge some members of the Interstitial group into the Loyal Opposition and Amand into the Outcasts. [sent-180, score-0.08]
90 Mark, however, is placed in a separate community: although Mark has a positive affiliation with Gregory, Mark also has a negative affiliation with John Bosco and so BHCD elects to create a new community to account for this discrepancy. [sent-181, score-0.15]
91 05 q q qq qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq qqq q q q q qq q −0. [sent-186, score-1.144]
92 10 Sparse Binary q Binary Rose Rose 10 1000 Run time (s) Victor Ramuald Peter Bonaventure Berthold Louis Ambrose Simplicius Elias Amand Boniface Winfrid John Bosco Hugh Gregory Basil Mark Albert q qqqq qq q qqqqqqq q q qqq q qqq qqqq q qqq q q qqq q qqq q qq q qq 0. [sent-191, score-1.155]
93 90 qq q qq qq qqq qqq qqq qq qqq qqqqq qqq qq q qq qq q q q q q q qq q q q q 0. [sent-198, score-1.587]
94 This latter factor, I, corresponds to the number of MCMC steps or the number of greedy search steps, may be large and may need to scale as the number of vertices increases. [sent-208, score-0.138]
95 Full NIPS The full NIPS dataset has 2864 vertices and 9466 edges. [sent-218, score-0.095]
96 We cut the tree and retained the top portion of the hierarchy, shown above the clusters. [sent-221, score-0.143]
97 We merged all the leaves of a subtree T into a flat cluster when rT A∈ancestorT (1 − rA ) > 0. [sent-222, score-0.102]
98 6 Discussion and Future Work We proposed an efficient Bayesian procedure for discovering hierarchical communities in social networks. [sent-227, score-0.278]
99 Experimentally our procedure discovers reasonable hierarchies and is able to make predictions about two orders of magnitude faster than one of the fastest existing Bayesian nonparametric schemes, whilst attaining comparable performance. [sent-228, score-0.174]
100 Another way to scale up the model would be to investigate parameterising the network using the sufficient statistics of triangles, instead of edges as in [21]. [sent-232, score-0.094]
wordName wordTfidf (topN-words)
[('bhcd', 0.548), ('ch', 0.189), ('amand', 0.164), ('bosco', 0.164), ('irm', 0.157), ('pq', 0.152), ('ambrose', 0.151), ('basil', 0.151), ('bonaventure', 0.151), ('boniface', 0.151), ('ramuald', 0.151), ('simplicius', 0.151), ('winfrid', 0.151), ('community', 0.15), ('tree', 0.143), ('hugh', 0.133), ('communities', 0.13), ('qq', 0.124), ('elias', 0.122), ('qqq', 0.119), ('berthold', 0.109), ('albert', 0.105), ('trees', 0.104), ('louis', 0.101), ('vertices', 0.095), ('gregory', 0.095), ('victor', 0.094), ('qqqq', 0.094), ('partitions', 0.083), ('sampson', 0.082), ('merge', 0.08), ('adjacency', 0.075), ('whilst', 0.073), ('mark', 0.073), ('ss', 0.071), ('leaves', 0.07), ('monastery', 0.068), ('hierarchical', 0.064), ('blockmodel', 0.062), ('dxy', 0.06), ('edges', 0.059), ('ir', 0.057), ('relational', 0.057), ('social', 0.055), ('ancestort', 0.055), ('erge', 0.055), ('subtreet', 0.055), ('shall', 0.054), ('forest', 0.054), ('peter', 0.054), ('hierarchies', 0.046), ('predictive', 0.045), ('merges', 0.044), ('greedy', 0.043), ('te', 0.042), ('hierarchy', 0.042), ('john', 0.041), ('dss', 0.041), ('ila', 0.041), ('loyal', 0.041), ('bayesian', 0.04), ('rose', 0.038), ('rs', 0.038), ('ta', 0.037), ('liation', 0.037), ('tb', 0.037), ('lfrm', 0.036), ('opposition', 0.036), ('ij', 0.036), ('pp', 0.035), ('network', 0.035), ('td', 0.034), ('tc', 0.034), ('likelihood', 0.034), ('clusters', 0.034), ('heap', 0.033), ('merged', 0.032), ('dii', 0.031), ('dt', 0.031), ('agglomerative', 0.03), ('discovering', 0.029), ('dm', 0.029), ('predictions', 0.028), ('edge', 0.028), ('charles', 0.028), ('partition', 0.028), ('blockmodels', 0.027), ('factorises', 0.027), ('liations', 0.027), ('mondrian', 0.027), ('turks', 0.027), ('likelihoods', 0.027), ('children', 0.027), ('nonparametric', 0.027), ('af', 0.026), ('inferring', 0.026), ('dcc', 0.024), ('djj', 0.024), ('interstitial', 0.024), ('mergers', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 47 nips-2013-Bayesian Hierarchical Community Discovery
Author: Charles Blundell, Yee Whye Teh
Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1
2 0.11977096 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
3 0.11937173 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
4 0.09908694 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: High capacity classifiers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classification performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classifier’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-specified tree, this algorithm modifies the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classification task. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1
5 0.090523131 196 nips-2013-Modeling Overlapping Communities with Node Popularities
Author: Prem Gopalan, Chong Wang, David Blei
Abstract: We develop a probabilistic approach for accurate network modeling using node popularities within the framework of the mixed-membership stochastic blockmodel (MMSB). Our model integrates two basic properties of nodes in social networks: homophily and preferential connection to popular nodes. We develop a scalable algorithm for posterior inference, based on a novel nonconjugate variant of stochastic variational inference. We evaluate the link prediction accuracy of our algorithm on nine real-world networks with up to 60,000 nodes, and on simulated networks with degree distributions that follow a power law. We demonstrate that the AMP predicts significantly better than the MMSB. 1
6 0.087164842 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
7 0.08585991 355 nips-2013-Which Space Partitioning Tree to Use for Search?
8 0.069853634 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
9 0.069483273 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
10 0.065208182 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games
11 0.060109969 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
12 0.05885293 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
13 0.056062408 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
14 0.052541237 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
15 0.052223183 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
16 0.049126938 63 nips-2013-Cluster Trees on Manifolds
17 0.048540935 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
18 0.047564387 340 nips-2013-Understanding variable importances in forests of randomized trees
19 0.046553876 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
20 0.046223633 186 nips-2013-Matrix factorization with binary components
topicId topicWeight
[(0, 0.112), (1, 0.033), (2, -0.049), (3, -0.006), (4, 0.052), (5, 0.059), (6, 0.073), (7, -0.049), (8, 0.021), (9, 0.013), (10, 0.006), (11, -0.022), (12, 0.155), (13, -0.011), (14, -0.006), (15, 0.003), (16, 0.008), (17, 0.037), (18, -0.01), (19, 0.025), (20, 0.117), (21, 0.067), (22, -0.117), (23, 0.01), (24, 0.093), (25, -0.069), (26, 0.019), (27, -0.052), (28, 0.143), (29, 0.004), (30, 0.033), (31, 0.006), (32, -0.018), (33, -0.086), (34, 0.012), (35, 0.034), (36, 0.026), (37, -0.076), (38, 0.066), (39, 0.069), (40, -0.011), (41, -0.014), (42, -0.043), (43, -0.045), (44, 0.043), (45, 0.013), (46, -0.059), (47, -0.126), (48, 0.054), (49, -0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.92161167 47 nips-2013-Bayesian Hierarchical Community Discovery
Author: Charles Blundell, Yee Whye Teh
Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1
2 0.61129481 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
Author: Yuening Hu, Jordan Boyd-Graber, Hal Daume III, Z. Irene Ying
Abstract: Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman’s coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman’s coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman’s coalescent and is qualitatively better at capturing data in bushy hierarchies. 1 The Need For Bushy Hierarchical Clustering Hierarchical clustering is a fundamental data analysis problem: given observations, what hierarchical grouping of those observations effectively encodes the similarities between observations? This is a critical task for understanding and describing observations in many domains [1, 2], including natural language processing [3], computer vision [4], and network analysis [5]. In all of these cases, natural and intuitive hierarchies are not binary but are instead bushy, with more than two children per parent node. Our goal is to provide efficient algorithms to discover bushy hierarchies. We review existing nonparametric probabilistic clustering algorithms in Section 2, with particular focus on Kingman’s coalescent [6] and its generalization, the beta coalescent [7, 8]. While Kingman’s coalescent has attractive properties—it is probabilistic and has edge “lengths” that encode how similar clusters are—it only produces binary trees. The beta coalescent (Section 3) does not have this restriction. However, na¨ve inference is impractical, because bushy trees are more complex: we need ı to consider all possible subsets of nodes to construct each internal nodes in the hierarchy. Our first contribution is a generalization of the belief propagation framework [9] for beta coalescent to compute the joint probability of observations and trees (Section 3). After describing sequential Monte Carlo posterior inference for the beta coalescent, we develop efficient inference strategies in Section 4, where we use proposal distributions that draw on the connection between Dirichlet processes—a ubiquitous Bayesian nonparametric tool for non-hierarchical clustering—and hierarchical coalescents to make inference tractable. We present results on both synthetic and real data that show the beta coalescent captures bushy hierarchies and outperforms Kingman’s coalescent (Section 5). 2 Bayesian Clustering Approaches Recent hierarchical clustering techniques have been incorporated inside statistical models; this requires formulating clustering as a statistical—often Bayesian—problem. Heller et al. [10] build 1 binary trees based on the marginal likelihoods, extended by Blundell et al. [11] to trees with arbitrary branching structure. Ryan et al. [12] propose a tree-structured stick-breaking process to generate trees with unbounded width and depth, which supports data observations at leaves and internal nodes.1 However, these models do not distinguish edge lengths, an important property in distinguishing how “tight” the clustering is at particular nodes. Hierarchical models can be divided into complementary “fragmentation” and “coagulation” frameworks [7]. Both produce hierarchical partitions of a dataset. Fragmentation models start with a single partition and divide it into ever more specific partitions until only singleton partitions remain. Coagulation frameworks repeatedly merge singleton partitions until only one partition remains. Pitman-Yor diffusion trees [13], a generalization of Dirichlet diffusion trees [14], are an example of a bushy fragmentation model, and they model edge lengths and build non-binary trees. Instead, our focus is on bottom-up coalescent models [8], one of the coagulation models and complementary to diffusion trees, which can also discover hierarchies and edge lengths. In this model, n nodes are observed (we use both observed to emphasize that nodes are known and leaves to emphasize topology). These observed nodes are generated through some unknown tree with latent edges and unobserved internal nodes. Each node (both observed and latent) has a single parent. The convention in such models is to assume our observed nodes come at time t = 0, and at time −∞ all nodes share a common ur-parent through some sequence of intermediate parents. Consider a set of n individuals observed at the present (time t = 0). All individuals start in one of n singleton sets. After time ti , a set of these nodes coalesce into a new node. Once a set merges, their parent replaces the original nodes. This is called a coalescent event. This process repeats until there is only one node left, and a complete tree structure π (Figure 1) is obtained. Different coalescents are defined by different probabilities of merging a set of nodes. This is called the coalescent rate, defined by a general family of coalescents: the lambda coalescent [7, 15]. We represent the rate via the symbol λk , the rate at which k out of n nodes merge into a parent node. n From a collection of n nodes, k ≤ n can coalesce at some coalescent event (k can be different for different coalescent events). The rate of a fraction γ of the nodes coalescing is given by γ −2 Λ(dγ), where Λ(dγ) is a finite measure on [0, 1]. So k nodes merge at rate 1 λk = n γ k−2 (1 − γ)n−k Λ(dγ) (2 ≤ k ≤ n). (1) 0 Choosing different measures yields different coalescents. A degenerate Dirac delta measure at 0 results in Kingman’s coalescent [6], where λk is 1 when k = 2 and zero otherwise. Because this n gives zero probability to non-binary coalescent events, this only creates binary trees. Alternatively, using a beta distribution BETA(2 − α, α) as the measure Λ yields the beta coalescent. When α is closer to 1, the tree is bushier; as α approaches 2, it becomes Kingman’s coalescent. If we have ni−1 nodes at time ti−1 in a beta coalescent, the rate λkii−1 for a children set of ki nodes at time n ti and the total rate λni−1 of any children set merging—summing over all possible mergers—is λkii−1 = n Γ(ki − α)Γ(ni−1 − ki + α) and λni−1 = Γ(2 − α)Γ(α)Γ(ni−1 ) ni−1 ni−1 ki λkii−1 . n (2) ki =2 Each coalescent event also has an edge length—duration—δi . The duration of an event comes from an exponential distribution, δi ∼ exp(λni−1 ), and the parent node forms at time ti = ti−1 − δi . Shorter durations mean that the children more closely resemble their parent (the mathematical basis for similarity is specified by a transition kernel, Section 3). Analogous to Kingman’s coalescent, the prior probability of a complete tree π is the product of all of its constituent coalescent events i = 1, . . . m, merging ki children after duration δi , m m λkii−1 · exp(−λni−1 δi ). n p(ki |ni−1 ) · p(δi |ki , ni−1 ) = p(π) = i=1 Merge ki nodes After duration δi (3) i=1 1 This is appropriate where the entirety of a population is known—both ancestors and descendants. We focus on the case where only the descendants are known. For a concrete example, see Section 5.2. 2 Algorithm 1 MCMC inference for generating a tree 1: for Particle s = 1, 2, · · · , S do s 2: Initialize ns = n, i = 0, ts = 0, w0 = 1. 0 s 3: Initialize the node set V = {ρ0 , ρ1 , · · · , ρn }. 4: while ∃s ∈ {1 · · · S} where ns > 1 do 5: Update i = i + 1. 6: for Particle s = 1, 2, · · · , S do (a) Kingman’s coalescent 7: if ns == 1 then 8: Continue. s 9: Propose a duration δi by Equation 10. s 10: Set coalescent time ts = ts − δi . i i−1 11: Sample partitions ps from DPMM. i s 12: Propose a set ρci according to Equation 11. s 13: Update weight wi by Equation 13. s s s 14: Update n = n − |ρci | + 1. s (b) the beta coalescent 15: Remove ρci from V s , add ρs to V s . i 16: Compute effective sample size ESS [16]. Figure 1: The beta coalescent can merge four simi17: if ESS < S/2 then lar nodes at once, while Kingman’s coalescent only 18: Resample particles [17]. merges two each time. 3 Beta Coalescent Belief Propagation The beta coalescent prior only depends on the topology of the tree. In real clustering applications, we also care about a node’s children and features. In this section, we define the nodes and their features, and then review how we use message passing to compute the probabilities of trees. An internal node ρi is defined as the merger of other nodes. The children set of node ρi , ρci , coalesces into a new node ρi ≡ ∪b∈ci ρb . This encodes the identity of the nodes that participate in specific coalescent events; Equation 3, in contrast, only considers the number of nodes involved in an event. In addition, each node is associated with a multidimensional feature vector yi . Two terms specify the relationship between nodes’ features: an initial distribution p0 (yi ) and a transition kernel κti tb (yi , yb ). The initial distribution can be viewed as a prior or regularizer for feature representations. The transition kernel encourages a child’s feature yb (at time tb ) to resemble feature yi (formed at ti ); shorter durations tb − ti increase the resemblance. Intuitively, the transition kernel can be thought as a similarity score; the more similar the features are, the more likely nodes are. For Brownian diffusion (discussed in Section 4.3), the transition kernel follows a Gaussian distribution centered at a feature. The covariance matrix Σ is decided by the mutation rate µ [18, 9], the probability of a mutation in an individual. Different kernels (e.g., multinomial, tree kernels) can be applied depending on modeling assumptions of the feature representations. To compute the probability of the beta coalescent tree π and observed data x, we generalize the belief propagation framework used by Teh et al. [9] for Kingman’s coalescent; this is a more scalable alternative to other approaches for computing the probability of a Beta coalescent tree [19]. We define a subtree structure θi = {θi−1 , δi , ρci }, thus the tree θm after the final coalescent event m is a complete tree π. The message for node ρi marginalizes over the features of the nodes in its children set.2 The total message for a parent node ρi is −1 Mρi (yi ) = Zρi (x|θi ) κti tb (yi , yb )Mρb (yb )dyb . (4) b∈ci where Zρi (x|θi ) is the local normalizer, which can be computed as the combination of initial distribution and messages from a set of children, Zρi (x|θi ) = p0 (yi ) κti tb (yi , yb )Mρb (yb )dyb dyi . b∈ci 2 When ρb is a leaf, the message Mρb (yb ) is a delta function centered on the observation. 3 (5) Recursively performing this marginalization through message passing provides the joint probability of a complete tree π and the observations x. At the root, Z−∞ (x|θm ) = p0 (y−∞ )κ−∞,tm (y−∞ , ym )Mρm (ym )dym dy−∞ (6) where p0 (y−∞ ) is the initial feature distribution and m is the number of coalescent events. This gives the marginal probability of the whole tree, m p(x|π) = Z−∞ (x|θm ) Zρi (x|θi ), (7) i=1 The joint probability of a tree π combines the prior (Equation 3) and likelihood (Equation 7), m λkii−1 exp(−λni−1 δi ) · Zρi (x|θi ). n p(x, π) = Z−∞ (x|θm ) (8) i=1 3.1 Sequential Monte Carlo Inference Sequential Monte Carlo (SMC)—often called particle filters—estimates a structured sequence of hidden variables based on observations [20]. For coalescent models, this estimates the posterior distribution over tree structures given observations x. Initially (i = 0) each observation is in a singleton cluster;3 in subsequent particles (i > 0), points coalesce into more complicated tree s structures θi , where s is the particle index and we add superscript s to all the related notations to distinguish between particles. We use sequential importance resampling [21, SIR] to weight each s particle s at time ti , denoted as wi . The weights from SIR approximate the posterior. Computing the weights requires a conditional distris s s bution of data given a latent state p(x|θi ), a transition distribution between latent states p(θi |θi−1 ), s s and a proposal distribution f (θi |θi−1 , x). Together, these distributions define weights s s wi = wi−1 s s s p(x | θi )p(θi | θi−1 ) . s s f (θi | θi−1 , x) (9) Then we can approximate the posterior distribution of the hidden structure using the normalized weights, which become more accurate with more particles. To apply SIR inference to belief propagation with the beta coalescent prior, we first define the particle s space structure. The sth particle represents a subtree θi−1 at time ts , and a transition to a new i−1 s s s s subtree θi takes a set of nodes ρci from θi−1 , and merges them at ts , where ts = ts − δi and i i i−1 s s s s s θi = {θi−1 , δi , ρci }. Our proposal distribution must provide the duration δi and the children set ρsi c s to merge based on the previous subtree θi−1 . s We propose the duration δi from the prior exponential distribution and propose a children set from the posterior distribution based on the local normalizers. 4 This is the “priorpost” method in Teh et al. [9]. However, this approach is intractable. Given ni−1 nodes at time ti , we must consider all possible n children sets ni−1 + ni−1 + · · · + ni−1 . The computational complexity grows from O(n2 ) i−1 i−1 2 3 ni−1 (Kingman’s coalescent) to O(2 ) (beta coalescent). 4 Efficiently Finding Children Sets with DPMM We need a more efficient way to consider possible children sets. Even for Kingman’s coalescent, which only considers pairs of nodes, Gorur et al. [22] do not exhaustively consider all pairs. Instead, they use data structures from computational geometry to select the R closest pairs as their restriction set, reducing inference to O(n log n). While finding closest pairs is a traditional problem in computational geometry, discovering arbitrary-sized sets is less studied. 3 The relationship between time and particles is non-intuitive. Time t goes backward with subsequent particles. When we use time-specific adjectives for particles, this is with respect to inference. 4 This is a special case of Section 4.2’s algorithm, where the restriction set Ωi is all possible subsets. 4 In this section, we describe how we use a Dirichlet process mixture model [23, DPMM] to discover a restriction set Ω, integrating DPMMs into the SMC proposal. We first briefly review what DPMMs are, describe why they are attractive, and then describe how we incorporate DPMMs in SMC inference. The DPMM is defined by a concentration β and a base distribution G0 . A distribution over mixtures is drawn from a Dirichlet process (DP): G ∼ DP(β, G0 ). Each observation xi is assigned to a mixture component µi drawn from G. Because the Dirichlet process is a discrete distribution, observations i and j can have the same mixture component (µi = µj ). When this happens, points are said to be in the same partition. Posterior inference can discover a distribution over partitions. A full derivation of these sampling equations appears in the supplemental material. 4.1 Attractive Properties of DPMMs DPMM s and Coalescents Berestycki et al. [8] showed that the distribution over partitions in a Dirichlet process is equivalent to the distribution over coalescents’ allelic partitions—the set of members that have the same feature representation—when the mutation rate µ of the associated kernel is half of the Dirichlet concentration β (Section 3). For Brownian diffusion, we can connect DPMM with coalescents by setting the kernel covariance Σ = µI to Σ = β/2I. The base distribution G0 is also related with nodes’ feature. The base distribution G0 of a Dirichlet process generates the probability measure G for each block, which generates the nodes in a block. As a result, we can select a base distribution which fits the distribution of the samples in coalescent process. For example, if we use Gaussian distribution for the transition kernel and prior, a Gaussian is also appropriate as the DPMM base distribution. Effectiveness as a Proposal The necessary condition for a valid proposal [24] is that it should have support on a superset of the true posterior. In our case, the distribution over partitions provided by the DPMM considers all possible children sets that could be merged in the coalescent. Thus the new proposal with DPMM satisfies this requirement, and it is a valid proposal. In addition, Chen [25] gives a set of desirable criteria for a good proposal distribution: accounts for outliers, considers the likelihood, and lies close to the true posterior. The DPMM fulfills these criteria. First, the DPMM provides a distribution over all partitions. Varying the concentration parameter β can control the length of the tail of the distribution over partitions. Second, choosing the base distribution of the DPMM appropriately models the feature likelihood; i.e., ensuring the DPMM places similar nodes together in a partition with high probability. Third, the DPMM qualitatively provides reasonable children sets when compared with exhaustively considering all children sets (Figure 2(c)). 4.2 Incorporating DPMM in SMC Proposals To address the inference intractability in Section 3.1, we use the DPMM to obtain a distribution over partitions of nodes. Each partition contains clusters of nodes, and we take a union over all partitions to create a restriction set Ωi = {ωi1 , ωi2 , · · · }, where each ωij is a subset of the ni−1 nodes. A standard Gibbs sampler provides these partitions (see supplemental). s With this restriction set Ωi , we propose the duration time δi from the exponential distribution and s propose a children set ρci based on the local normalizers s s Zρi (x|θi−1 , δi , ρsi ) c s s s s s s · I ρci ∈ Ωs , (11) fi (ρci |δi , θi−1 ) = fi (δi ) = λs i−1 exp(−λs i−1 δi ) (10) i n n Z0 s where Ωs restricts the candidate children sets, I is the indicator, and we replace Zρi (x|θi ) with i s s s Zρi (x|θi−1 , δi , ρci ) since they are equivalent here. The normalizer is Z0 = ρc s s Zρi (x|θi−1 , δi , ρc ) · I [ρc ∈ Ωs ] = i ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ). (12) Applying the true distribution (the ith multiplicand from Equation 8) and the proposal distribution (Equation 10 and Equation 11) to the SIR weight update (Equation 9), |ρs | c s s wi = wi−1 i λni−1 · ρc ∈Ωs i s s Zρi (x|θi−1 , δi , ρc ) λs i−1 n 5 , (13) |ρs | c s i where |ρsi | is the size of children set ρci ; parameter λni−1 is the rate of the children set ρsi (Equac c s tion 2); and λni−1 is the rate of all possible sets given a total number of nodes ni−1 (Equation 2). We can view this new proposal as a coarse-to-fine process: DPMM proposes candidate children sets; SMC selects a children set from DPMM to coalesce. Since the coarse step is faster and filters “bad” children sets, the slower finer step considers fewer children sets, saving computation time (Algorithm 1). If Ωi has all children sets, it recovers exhaustive SMC. We estimate the effective sample size [16] and resample [17] when needed. For smaller sets, the DPMM is sometimes impractical (and only provides singleton clusters). In such cases it is simpler to enumerate all children sets. 4.3 Example Transition Kernel: Brownian Diffusion This section uses Brownian diffusion as an example for message passing framework. The initial distribution p0 (y) of each node is N (0, ∞); the transition kernel κti tb (y, ·) is a Gaussian centered at y with variance (ti − tb )Σ, where Σ = µI, µ = β/2, β is the concentration parameter of DPMM. Then the local normalizer Zρi (x|θi ) is Zρi (x|θi ) = N (yi ; 0, ∞) b∈ci N (yi ; yb , Σ(vρb + tb − ti ))dyi , ˆ (14) and the node message Mρi (yi ) is normally distributed Mρi (yi ) ∼ N (yi ; yρi , Σvρi ), where ˆ vρi = 5 b∈ci (vρb + tb − ti )−1 −1 , yρi = ˆ b∈ci yρb ˆ vρ b + t b − t i vρ i . Experiments: Finding Bushy Trees In this section, we compare trees built by the beta coalescent (beta) against those built by Kingman’s coalescent (kingman) and hierarchical agglomerative clustering [26, hac] on both synthetic and real data. We show beta performs best and can capture data in more interpretable, bushier trees. Setup The parameter α for the beta coalescent is between 1 and 2. The closer α is to 1, bushier the tree is, and we set α = 1.2.5 We set the mutation rate as 1, thus the DPMM parameter is initialized as β = 2, and updated using slice sampling [27]. All experiments use 100 initial iterations of DPMM inference with 30 more iterations after each coalescent event (forming a new particle). Metrics We use three metrics to evaluate the quality of the trees discovered by our algorithm: purity, subtree and path length. The dendrogram purity score [28, 10] measures how well the leaves in a subtree belong to the same class. For any two leaf nodes, we find the least common subsumer node s and—for the subtree rooted at s—measure the fraction of leaves with same class labels. The subtree score [9] is the ratio between the number of internal nodes with all children in the same class and the total number of internal nodes. The path length score is the average difference—over all pairs—of the lowest common subsumer distance between the true tree and the generated tree, where the lowest common subsumer distance is the distance between the root and the lowest common subsumer of two nodes. For purity and subtree, higher is better, while for length, lower is better. Scores are in expectation over particles and averaged across chains. 5.1 Synthetic Hierarchies To test our inference method, we generated synthetic data with edge length (full details available in the supplemental material); we also assume each child of the root has a unique label and the descendants also have the same label as their parent node (except the root node). We compared beta against kingman and hac by varying the number of observations (Figure 2(a)) and feature dimensions (Figure 2(b)). In both cases, beta is comparable to kingman and hac (no edge length). While increasing the feature dimension improves both scores, more observations do not: for synthetic data, a small number of observations suffice to construct a good tree. 5 With DPMM proposals, α has a negligible effect, so we elide further analysis for different α values. 6 0.8 0.6 beta hac kingman 0.4 0.8 0.6 beta hac kingman 8 60 80 100 beta kingman length Number of Observations Scores 0.6 beta kingman 0.4 0.2 0.0 4 6 length Dimension 0.6 0.4 0.2 0.0 20 40 60 80 100 0.6 beta 2 4 6 length Dimension 0.6 8 enum beta 10 enum 0.4 0.2 0.0 2 Number of Observations (a) Increasing observations 0.8 0.4 2 Scores 40 1.0 10 0.4 20 Scores purity 1.0 Scores purity Scores Scores purity 1.0 4 6 8 Dimension (b) Increasing dimension 10 2 4 6 8 10 Dimension (c) beta v.s. enum Figure 2: Figure 2(a) and 2(b) show the effect of changing the underlying data size or number of dimension. Figure 2(c) shows that our DPMM proposal for children sets is comparable to an exhaustive enumeration of all possible children sets (enum). To evaluate the effectiveness of using our DPMM as a proposal distribution, we compare exhaustively enumerating all children set candidates (enum) while keeping the SMC otherwise unchanged; this experiment uses ten data points (enum is completely intractable on larger data). Beta uses the DPMM and achieved similar accuracy (Figure 2(c)) while greatly improving efficiency. 5.2 Human Tissue Development Our first real dataset is based on the developmental biology of human tissues. As a human develops, tissues specialize, starting from three embryonic germ layers: the endoderm, ectoderm, and mesoderm. These eventually form all human tissues. For example, one developmental pathway is ectoderm → neural crest → cranial neural crest → optic vesicle → cornea. Because each germ layer specializes into many different types of cells at specific times, it is inappropriate to model this development as a binary tree, or with clustering models lacking path lengths. Historically, uncovering these specialization pathways is a painstaking process, requiring inspection of embryos at many stages of development; however, massively parallel sequencing data make it possible to efficiently form developmental hypotheses based on similar patterns of gene expression. To investigate this question we use the transcriptome of 27 tissues with known, unambiguous, time-specific lineages [29]. We reduce the original 182727 dimensions via principle component analysis [30, PCA]. We use five chains with five particles per chain. Using reference developmental trees, beta performs better on all three scores (Table 1) because beta builds up a bushy hierarchy more similar to the true tree. The tree recovered by beta (Figure 3) reflects human development. The first major differentiation is the division of embryonic cells into three layers of tissue: endoderm, mesoderm, and ectoderm. These go on to form almost all adult organs and cells. The placenta (magenta), however, forms from a fourth cell type, the trophoblast; this is placed in its own cluster at the root of the tree. It also successfully captures ectodermal tissue lineage. However, mesodermic and endodermic tissues, which are highly diverse, do not cluster as well. Tissues known to secrete endocrine hormones (dashed borders) cluster together. 5.3 Clustering 20-newsgroups Data Following Heller et al. [10], we also compare the three models on 20-newsgroups,6 a multilevel hierarchy first dividing into general areas (rec, space, and religion) before specializing into areas such as baseball or hockey.7 This true hierarchy is inset in the bottom right of Figure 4, and we assume each edge has the same length. We apply latent Dirichlet allocation [31] with 50 topics to this corpus, and use the topic distribution for each document as the document feature. We use five chains with eighty particles per chain. 6 http://qwone.com/˜jason/20Newsgroups/ 7 We use “rec.autos”, “rec.sport.baseball”, “rec.sport.hockey”, “sci.space” newsgroups but also—in contrast to Heller et al. [10]—added “soc.religion.christian”. 7 ectoderm Stomach Pancreas mesoderm placenta Placenta endoderm Bone Marrow Thyroid Colon Kidney Heart PeripheralBlood Lymphocytes Brain Hypothalamus Brain Amygdala Prostate Uterus Lung Brain Thalamus BrainCorpus Callosum Spleen Thymus Spinal Cord Brain Cerebellum BrainCaudate Nucleus Doc Label rec.sport.baseball rec.autos rec.sport.hocky sci.space soc.religion.christian Trachea Small Intestine Retina Monocytes Mammary Gland ... purity ↑ subtree ↑ length ↓ ... ... ... ... Bladder Figure 3: One sample hierarchy of human tissue from beta. Color indicates germ layer origin of tissue. Dashed border indicates secretory function. While neural tissues from the ectoderm were clustered correctly, some mesoderm and endoderm tissues were commingled. The cluster also preferred placing secretory tissues together and higher in the tree. hac 0.453 0.240 − True Tree Figure 4: One sample hierarchy of the 20newsgroups from beta. Each small square is a document colored by its class label. Large rectangles represent a subtree with all the enclosed documents as leaf nodes. Most of the documents from the same group are clustered together; the three “rec” groups are merged together first, and then merged with the religion and space groups. Biological Data kingman beta 0.474 ± 0.029 0.492 ± 0.028 0.302 ± 0.033 0.331 ± 0.050 0.654 ± 0.041 0.586 ± 0.051 hac 0.465 0.571 − 20-newsgroups Data kingman beta 0.510 ± 0.047 0.565 ± 0.081 0.651 ± 0.013 0.720 ± 0.013 0.477 ± 0.027 0.333 ± 0.047 Table 1: Comparing the three models: beta performs best on all three scores. As with the biological data, beta performs best on all scores for 20-newsgroups. Figure 4 shows a bushy tree built by beta, which mostly recovered the true hierarchy. Documents within a newsgroup merge first, then the three “rec” groups, followed by “space” and “religion” groups. We only use topic distribution as features, so better results could be possible with more comprehensive features. 6 Conclusion This paper generalizes Bayesian hierarchical clustering, moving from Kingman’s coalescent to the beta coalescent. Our novel inference scheme based on SMC and DPMM make this generalization practical and efficient. This new model provides a bushier tree, often a more realistic view of data. While we only consider real-valued vectors, which we model through the ubiquitous Gaussian, other likelihoods might be better suited to other applications. For example, for discrete data such as in natural language processing, a multinomial likelihood may be more appropriate. This is a straightforward extension of our model via other transition kernels and DPMM base distributions. Recent work uses the coalescent as a means of producing a clustering in tandem with a downstream task such as classification [32]. Hierarchies are often taken a priori in natural language processing. Particularly for linguistic tasks, a fully statistical model like the beta coalescent that jointly learns the hierarchy and a downstream task could improve performance in dependency parsing [33] (clustering parts of speech), multilingual sentiment [34] (finding sentiment-correlated words across languages), or topic modeling [35] (finding coherent words that should co-occur in a topic). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments, and thank H´ ctor e Corrada Bravo for pointing us to human tissue data. This research was supported by NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] Kaufman, L., P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. [2] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. [3] Brown, P. F., V. J. D. Pietra, P. V. deSouza, et al. Class-based n-gram models of natural language. Computational Linguistics, 18:18–4, 1990. [4] Bergen, J., P. Anandan, K. Hanna, et al. Hierarchical model-based motion estimation. In ECCV. 1992. [5] Girvan, M., M. E. J. Newman. Community structure in social and biological networks. PNAS, 99:7821– 7826, 2002. [6] Kingman, J. F. C. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. [7] Pitman, J. Coalescents with multiple collisions. The Annals of Probability, 27:1870–1902, 1999. [8] Berestycki, N. Recent progress in coalescent theory. In Ensaios Matematicos, vol. 16. 2009. e [9] Teh, Y. W., H. Daum´ III, D. M. Roy. Bayesian agglomerative clustering with coalescents. In NIPS. 2008. [10] Heller, K. A., Z. Ghahramani. Bayesian hierarchical clustering. In ICML. 2005. [11] Blundell, C., Y. W. Teh, K. A. Heller. Bayesian rose trees. In UAI. 2010. [12] Adams, R., Z. Ghahramani, M. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS. 2010. [13] Knowles, D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI. 2011. [14] Neal, R. M. Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619–629, 2003. [15] Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36:1116–1125, 1999. [16] Neal, R. M. Annealed importance sampling. Technical report 9805, University of Toronto, 1998. [17] Fearhhead, P. Sequential Monte Carlo method in filter theory. PhD thesis, University of Oxford, 1998. [18] Felsenstein, J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet, 25(5):471–492, 1973. [19] Birkner, M., J. Blath, M. Steinrucken. Importance sampling for lambda-coalescents in the infinitely many sites model. Theoretical population biology, 79(4):155–73, 2011. [20] Doucet, A., N. De Freitas, N. Gordon, eds. Sequential Monte Carlo methods in practice. 2001. [21] Gordon, N., D. Salmond, A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F, Radar and Signal Processing, 140(2):107–113, 1993. ou [22] G¨ r¨ r, D., L. Boyles, M. Welling. Scalable inference on Kingman’s coalescent using pair similarity. JMLR, 22:440–448, 2012. [23] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. [24] Cappe, O., S. Godsill, E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. PROCEEDINGS-IEEE, 95(5):899, 2007. [25] Chen, Z. Bayesian filtering: From kalman filters to particle filters, and beyond. McMaster, [Online], 2003. [26] Eads, D. Hierarchical clustering (scipy.cluster.hierarchy). SciPy, 2007. [27] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [28] Powers, D. M. W. Unsupervised learning of linguistic structure an empirical evaluation. International Journal of Corpus Linguistics, 2:91–131, 1997. [29] Jongeneel, C., M. Delorenzi, C. Iseli, et al. An atlas of human gene expression from massively parallel signature sequencing (mpss). Genome Res, 15:1007–1014, 2005. [30] Shlens, J. A tutorial on principal component analysis. In Systems Neurobiology Laboratory, Salk Institute for Biological Studies. 2005. [31] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 2003. [32] Rai, P., H. Daum´ III. The infinite hierarchical factor regression model. In NIPS. 2008. e [33] Koo, T., X. Carreras, M. Collins. Simple semi-supervised dependency parsing. In ACL. 2008. [34] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [35] Andrzejewski, D., X. Zhu, M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. 2009. 9
3 0.56159574 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
4 0.55774981 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi
Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1
5 0.55275637 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
6 0.5264942 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
7 0.52150464 340 nips-2013-Understanding variable importances in forests of randomized trees
8 0.51682782 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
9 0.50953794 355 nips-2013-Which Space Partitioning Tree to Use for Search?
10 0.49644357 196 nips-2013-Modeling Overlapping Communities with Node Popularities
11 0.46809384 161 nips-2013-Learning Stochastic Inverses
12 0.44318488 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
13 0.44009304 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
14 0.43206015 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
15 0.42253545 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
16 0.41019288 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games
17 0.3961902 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
18 0.39421046 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
19 0.39064169 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics
20 0.36973855 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
topicId topicWeight
[(2, 0.016), (16, 0.07), (32, 0.362), (33, 0.089), (34, 0.07), (36, 0.01), (41, 0.015), (49, 0.025), (56, 0.06), (70, 0.035), (85, 0.067), (89, 0.015), (93, 0.051), (95, 0.013), (99, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.71727771 47 nips-2013-Bayesian Hierarchical Community Discovery
Author: Charles Blundell, Yee Whye Teh
Abstract: We propose an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks. Our model is a tree-structured mixture of potentially exponentially many stochastic blockmodels. We describe a family of greedy agglomerative model selection algorithms that take just one pass through the data to learn a fully probabilistic, hierarchical community model. In the worst case, Our algorithms scale quadratically in the number of vertices of the network, but independent of the number of nested communities. In practice, the run time of our algorithms are two orders of magnitude faster than the Infinite Relational Model, achieving comparable or better accuracy. 1
2 0.55120283 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
Abstract: Conditional random fields, which model the distribution of a multivariate response conditioned on a set of covariates using undirected graphs, are widely used in a variety of multivariate prediction applications. Popular instances of this class of models, such as categorical-discrete CRFs, Ising CRFs, and conditional Gaussian based CRFs, are not well suited to the varied types of response variables in many applications, including count-valued responses. We thus introduce a novel subclass of CRFs, derived by imposing node-wise conditional distributions of response variables conditioned on the rest of the responses and the covariates as arising from univariate exponential families. This allows us to derive novel multivariate CRFs given any univariate exponential distribution, including the Poisson, negative binomial, and exponential distributions. Also in particular, it addresses the common CRF problem of specifying “feature” functions determining the interactions between response variables and covariates. We develop a class of tractable penalized M -estimators to learn these CRF distributions from data, as well as a unified sparsistency analysis for this general class of CRFs showing exact structure recovery can be achieved with high probability. 1
3 0.40039679 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
Author: Dae Il Kim, Prem Gopalan, David Blei, Erik Sudderth
Abstract: Stochastic block models characterize observed network relationships via latent community memberships. In large social networks, we expect entities to participate in multiple communities, and the number of communities to grow with the network size. We introduce a new model for these phenomena, the hierarchical Dirichlet process relational model, which allows nodes to have mixed membership in an unbounded set of communities. To allow scalable learning, we derive an online stochastic variational inference algorithm. Focusing on assortative models of undirected networks, we also propose an efficient structured mean field variational bound, and online methods for automatically pruning unused communities. Compared to state-of-the-art online learning methods for parametric relational models, we show significantly improved perplexity and link prediction accuracy for sparse networks with tens of thousands of nodes. We also showcase an analysis of LittleSis, a large network of who-knows-who at the heights of business and government. 1
4 0.39636734 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
5 0.39384952 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
6 0.3911427 350 nips-2013-Wavelets on Graphs via Deep Learning
7 0.38996595 168 nips-2013-Learning to Pass Expectation Propagation Messages
8 0.38931912 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
9 0.38922724 196 nips-2013-Modeling Overlapping Communities with Node Popularities
10 0.38870129 232 nips-2013-Online PCA for Contaminated Data
11 0.38863221 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
12 0.38745528 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
13 0.38658407 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
14 0.38638219 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
15 0.38629407 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.38627955 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
17 0.3862074 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
18 0.38551399 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
19 0.38498959 201 nips-2013-Multi-Task Bayesian Optimization
20 0.3849687 245 nips-2013-Pass-efficient unsupervised feature selection