nips nips2011 nips2011-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract A Bayesian approach to partitioning distance matrices is presented. [sent-4, score-0.214]
2 It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. [sent-5, score-0.269]
3 1 Introduction In cluster analysis we are concerned with identifying subsets of n objects that share some similarity and therefore potentially belong to the same sub-population. [sent-10, score-0.118]
4 Many practical applications leave us without direct access to vectorial representations and instead only supply pairwise distance measures collected in a matrix D. [sent-11, score-0.227]
5 This poses a serious challenge, because great parts of geometric information are hereby lost that could otherwise help to discover hidden structures. [sent-12, score-0.129]
6 Its main drawback, however, is the high computational cost of order O(n3 ) per sweep of a Gibbs sampler, limiting its applicability to relatively small data sets. [sent-15, score-0.117]
7 In this work we present an alternative method which shares all the positive properties of the TIWD while reducing the computational workload to O(n2 ) per Gibbs sweep. [sent-16, score-0.119]
8 The main idea is to solve the problem of missing geometric information by a normalisation procedure, which chooses one particular geometric embedding of the distance data and allows us to use a simple probabilistic model for inferring the unknown underlying partition. [sent-18, score-0.353]
9 The construction we use is guaranteed to give the optimal such geometric embedding if the true partition was known. [sent-19, score-0.226]
10 Of course, this is only a hypothetical precondition, but we show that even rough prior estimates of the true partition significantly outperform ‘naive’ embedding strategies. [sent-20, score-0.181]
11 Using a simple hierarchical clustering model to produce such prior estimates leads to clusterings being at least of the same quality as those obtained by the original TIWD. [sent-21, score-0.108]
12 The algorithmic contribution here is an efficient algorithm for performing this normalisation procedure in O(n2 ) time, which makes the whole pipeline from distance matrix to inferred partition an O(n2 ) process (assuming a constant number of Gibbs sweeps). [sent-22, score-0.392]
13 In the next section we introduce a probabilistic model for partitioning inner product matrices, which is generalised in section 3 to distance matrices using a preprocessing step that breaks the geometric symmetry inherent in distance representations. [sent-30, score-0.523]
14 2 A Wishart Model for Partitioning Inner Product Matrices Suppose there is a matrix X ∈ Rn×d representing n objects in Rd that belong to one of k subpopulations. [sent-32, score-0.115]
15 For convenience we define the generalised central Wishart distribution which also allows rank-deficient S and/or Σ as d 1 (1) p(S|Ψ, d) ∝ det(S) 2 (d−n−1) det(Ψ) 2 exp − d tr(ΨS) , 2 where det(•) is the product of non-zero eigenvalues and Ψ denotes the (generalised) inverse of Σ. [sent-43, score-0.264]
16 Note that the inverse of a block diagonal matrix is also block diagonal, so we can formulate the prior in terms of Σ, which is easier to parametrise. [sent-49, score-0.269]
17 For this purpose we adapt the method in [2] using a Multinomial-Dirichlet process model [3, 4, 5] to define a flexible prior distribution over block matrices without specifying the exact number of blocks. [sent-50, score-0.212]
18 A partition B ∈ Bn can be represented in matrix form as B(i, j) = 1 if y(i) = y(j) and B(i, j) = 0 otherwise, with y being a function that maps [n] to some label set L. [sent-53, score-0.167]
19 A partition process is a series of distributions Pn on the set Bn in which Pn is the marginal of Pn+1 . [sent-55, score-0.133]
20 Chinese Restaurant Process), nb is the number of objects in block b and kB ≤ k is the total number of blocks in B. [sent-63, score-0.311]
21 The prior is exchangeable meaning rows and columns can be (jointly) permuted arbitrarily and therefore partition matrices can always be brought to block diagonal form. [sent-64, score-0.287]
22 The final block diagonal covariance matrix used in (2) has the form Σ = Ψ−1 = α(In + θB), with θ := β/α. [sent-67, score-0.183]
23 Multiplying the Wishart likelihood (2), the prior over partitions (3) and suitable priors over α, θ gives the joint posterior. [sent-69, score-0.112]
24 Note that the (usually unknown) degree of freedom d has the formal role of an annealing parameter, and it can indeed be used to ‘cool’ the Markov chain by increasing d, if desired, until a partition is ‘frozen’. [sent-76, score-0.104]
25 In one sweep of the Gibbs sampler, we have to iteratively compute the membership probability of one object indexed by i to the kB currently existing blocks in partition B (plus one new block), given the assignments for the n − 1 remaining ones denoted by the superscript (−i) [7, 8]. [sent-78, score-0.269]
26 In every step of this inner loop over kB existing blocks we have to evaluate the Wishart ¯ likelihood, i. [sent-79, score-0.122]
27 Given trace tr(−i) , we update Sbb for kB blocks (−i) b ∈ B which in total needs O(n) operations. [sent-82, score-0.108]
28 Given det , the computation of all kB updated determinants induces costs of O(kB ). [sent-83, score-0.163]
29 In total, there are n objects, so a full sweep requires O(n2 + nkB ) operations, which is equal to O(n2 ) since the maximum number of blocks is n, i. [sent-84, score-0.165]
30 Compared to the original TIWD, the worst case complexity in the Dirichlet process model with an infinite number of blocks in the population, k = ∞, is reduced from O(n3 ) to O(n2 ) . [sent-88, score-0.107]
31 3 The fastTIWD Model for Partitioning Distance Matrices Consider now the case where S is not accessible, but only squared pairwise distances D ∈ Rn×n : D(i, j) = S(i, i) + S(j, j) − 2 S(i, j). [sent-89, score-0.139]
32 In fact, if S∗ ∼ W(Σ), the distribution of a general S ∈ S is non-central Wishart, which can be easily seen as follows: S is exactly the set of inner product matrices that can be constructed by varying c ∈ Rd in a modified matrix normal model X ∼ N (M, Σ ⊗ Id ) with mean matrix M = 1n ct . [sent-94, score-0.332]
33 Note further that ‘shifts’ ci do not affect pairwise 1 distances between rows in X. [sent-96, score-0.139]
34 The modified matrix normal distribution implies that S = d XX t is −1 t non-central Wishart, S ∼ W(Σ, Θ), with non-centrality matrix Θ := Σ M M . [sent-97, score-0.171]
35 This normalisation procedure can be implemented solely based on S, leading to the well-known centering procedure in kernel PCA, [10]: Sc = QI S Qt , I with projection QI = I − (1/n)11t . [sent-106, score-0.14]
36 (9) Contrary to the PCA setting, however, this column normalisation induced by QI does not work well here, because the elements of a column vector in X are not independent. [sent-107, score-0.14]
37 Hereby, we not only remove the shifts ci , but also alter the distribution: the non-centrality matrix does not vanish in general and as a result, Sc is no longer central Wishart distributed. [sent-109, score-0.19]
38 In the following we present a solution to the problem of finding a candidate matrix S∗ that recasts inference based on the translation-invariant Wishart distribution as a method to reconstruct the optimal S∗ . [sent-110, score-0.116]
39 Our proposal is guided by a particular analogy between trees and partition matrices and aims at exploiting a tree-structure to guarantee low computational costs. [sent-111, score-0.217]
40 Assuming that S∗ ∼ Wd (Σ), the distribution of an arbitrary member S ∈ S(D) can be derived analytically as a generalised central Wishart distribution with a rank-deficient covariance, see [2]. [sent-115, score-0.221]
41 Its likelihood in the rank-deficient inverse covariance matrix Ψ is d d L(Ψ) ∝ det(Ψ) 2 exp − d tr(ΨS∗ ) = det(Ψ) 2 exp 2 d 4 tr(ΨD) , (10) with Ψ = Ψ − (1t Ψ1)−1 Ψ11t Ψ. [sent-116, score-0.195]
42 (12) By treating Q as a fixed matrix, this expression can also be seen as a central Wishart in the transformed matrix S∗ = QSQt , parametrised by the full-rank matrix Ψ if det(Ψ) is substituted by the appropriate normalisation term det(Ψ). [sent-130, score-0.4]
43 From this viewpoint, inference using the translation-invariant Wishart distribution can be interpreted as finding a (rank-deficient) representative S∗ = QSQt ∈ S(D) which follows a generalised central Wishart distribution with full-rank inverse covariance matrix Ψ. [sent-131, score-0.344]
44 Thus S∗ can be seen as an optimal candidate inner-product matrix in the set S(D) for a central Wishart model parametrised by Ψ. [sent-133, score-0.25]
45 If, on the other hand, we had some initial estimate of Ψ, we could find a reasonable transformation Q and hereby a reasonable candidate S∗ . [sent-136, score-0.169]
46 Note that even if the estimate of Ψ is far away from the true inverse covariance, the pairwise distances are at least guaranteed not to change under Q S(Q )t . [sent-137, score-0.176]
47 Our construction is guided by an analogy between binary trees and weighted sums of cut matrices, which are binary complements of partition matrices with two blocks. [sent-142, score-0.292]
48 We use a binary tree with n leaves representing n objects. [sent-143, score-0.128]
49 It encodes a path distance matrix Dtree between those n objects, and for an optimal tree Dtree = D. [sent-144, score-0.247]
50 Such an optimal tree exists only if D is additive, and the task of finding an approximation is a well-studied problem. [sent-145, score-0.128]
51 We will not discuss the various tree reconstruction algorithms, but only mention that there exist algorithms for reconstructing the closest ultrametric tree (in the ∞ norm) in O(n2 ) time, [14]. [sent-146, score-0.256]
52 Figure 1: From left to right: Unknown samples X, pairwise distances collected in D, closest tree structure and an exemplary building block. [sent-147, score-0.267]
53 A tree metric induced by Dtree is composed of elementary cut (pseudo-)metrics. [sent-148, score-0.165]
54 Any such metric lies in the metric space L1 and is also a member of (L2 )2 , which is the metric part of the space of squared Euclidean distance matrices D. [sent-149, score-0.167]
55 In fact, any matrix Stree has a canonical decomposition into a weighted sum of 2-block partition matrices, which is constructed by cutting all edges (2n − 2 for a rooted tree) and observing the resulting classification of leaf nodes. [sent-151, score-0.167]
56 Suppose, we keep track of such an assignment with indicator 1j induced by a single cut j, then the inner product matrix is Stree = 2n−2 j=1 ¯ ¯j λj (1j 1t + 1j 1t ), j (13) ¯ where λj is the weight of edge j to be cut and 1j → {0, 1}n is the complementary assignment, i. [sent-152, score-0.217]
57 Each term (1j 1t + 1j 1t ) is a 2-block partition matrix. [sent-155, score-0.104]
58 The remaining panels show the single-linkage clustering tree, all 2n − 2 = 48 weighted 2-block partition matrices, and the final Stree (= sum of all individual 2-block matrices, rescaled to full gray-value range). [sent-159, score-0.143]
59 Note that single-linkage fails to identify the clusters in the three branches closest to root, but still the structure of B is clearly visible in Stree . [sent-160, score-0.147]
60 Left to right: Partition matrix B for n = 25 objects in 3 clusters, single-linkage tree, all weighted 2-block partition matrices, final Stree . [sent-162, score-0.219]
61 tree For the proof we need the following lemma: Lemma 1. [sent-166, score-0.128]
62 (of lemma 1) Restating (13) and defining m := 2n − 2, we have m Stree y = j=1 n = l=1 m ¯ ¯j λj 1j 1t + 1j 1t y = j m yl j=1 ¯ λj 1j + j=1 m j=1 n λj 1j λj 1j n l=1 l=1 n ¯ 1jl yl + 1j 1jl yl − m j=1 l=1 ¯ λj 1j ¯jl yl 1 n l=1 (14) 1jl yl . [sent-169, score-0.78]
63 Furthermore, assume Ri is the set of all nodes on the branch starting from node i and leading to the tree’s root: (Stree y)i = = n l=1 n l=1 n l=1 yl yl j ∈Ri / m j=1 m j=1 λj + λj − j∈Ri j∈Ri λj l∈Rj λj + 2 yl − j∈Ri j ∈Ri / λj yj − λj m j=1 l∈Rj yl (15) λj yj . [sent-171, score-0.71]
64 m j=1 Note that yl , λj and λj yj are constants and computed in O(n) time. [sent-172, score-0.199]
65 This can be done directly on the tree structure in two separate traversals: 1. [sent-174, score-0.128]
66 It is important to stress that the above two tree traversals fully describe the complete algorithm. [sent-181, score-0.171]
67 (of theorem 1) First, note that only the matrix-vector product a := Ψtree 1 is needed in Qtree SQt = I − tree t 1 1t Ψtree 1 11 Ψtree t t 1 S I − Ψtree 1t Ψtree 1 11t = S − (1/1 a) 1a S − (1/1t a) S a1t + (1/1t a)2 1at S a1t . [sent-183, score-0.164]
68 Due to lemma 1, a can be computed in O(n2 ) time and is used in (16) to compute S∗ = Qtree SQt (only matrix-vector products, so O(n2 ) complexity tree is maintained). [sent-186, score-0.128]
69 A partition matrix B of size n = 200 containing k = 3 blocks is sampled from which we construct ΣB = α(I + θB). [sent-189, score-0.245]
70 The experiment was repeated 200 times and the quality of the inferred clusters was measured by the adjusted Rand index w. [sent-194, score-0.188]
71 For the hierarchical methods we report two different performance values: splitting the tree such that the ‘true’ number k = 3 of clusters is obtained and computing the best value among all possible splits into [2, n] clusters (‘*. [sent-198, score-0.456]
72 The reader should notice that both values are in favour of the hierarchical algorithms, since neither the true k nor the true labels are used for inferring the clusters in the Wishart-type methods. [sent-200, score-0.181]
73 3 we conclude that (i) both ‘naive’ normalisation strategies WD C and WD R are clearly outperformed by TIWD and fastTIWD (‘fTIWD’ in the boxplot). [sent-202, score-0.14]
74 Left half: Partition matrix (top), distance matrix (bottom) and 2D-PCA embedding of a dataset drawn from the generative model. [sent-206, score-0.224]
75 For this purpose we sample again 3 clusters in d = 300 dimensions, but now use a log-normal distribution that tends to produce a high number of ‘atypical’ samples. [sent-212, score-0.147]
76 Note that such a distribution should not induce severe problems for hierarchical methods when optimising the Rand index over all possible tree cuttings, since the ‘atypical’ samples are likely to form singleton clusters while the main structure is still visible in other branches of the tree. [sent-213, score-0.449]
77 As for the fastTIWD model, we want to test if the prior over partitions is flexible enough to introduce additional singleton clusters: In the experiment, it performed at least as well as Ward’s method, and clearly outperformed single- and complete-linkage. [sent-215, score-0.111]
78 We also compared it to the affinity-propagation method (AP), which, however, has severe problems on this dataset, even when optimising the input preference parameter that affects the number of clusters in the partition. [sent-216, score-0.246]
79 As large-scale application we present a semisupervised clustering example which is an upscaled version of an experiment with protein sequences presented in [1]. [sent-222, score-0.153]
80 While traditional semi-supervised classifiers assume at least one labelled object per class, our model is flexible enough to allow additional new clusters that have no counterpart in the subset of labelled objects. [sent-223, score-0.147]
81 We apply this idea on two different databases, one being high quality due to manual annotation with a stringent review process (SwissProt) while the other contains automatically annotated proteins and is not reviewed (TrEMBL). [sent-224, score-0.191]
82 The annotations in SwissProt are used as supervision information resulting in a set of class labels, whereas the proteins in TrEMBL are treated as unlabelled objects, potentially forming new clusters. [sent-225, score-0.129]
83 In contrast to a relatively small set of globin sequences in [1], we extract a total number of 12,290 (manually or automatically) annotated proteins to have some role in oxygen transport or binding. [sent-226, score-0.265]
84 The proteins are represented as a matrix of pairwise alignment scores. [sent-228, score-0.257]
85 A subset of 1731 annotated sequences is from SwissProt, resulting in 356 protein classes. [sent-229, score-0.147]
86 Among the 10,559 TrEMBL sequences 7 we could identify 23 new clusters which are dissimilar to any SwissProt proteins, see Fig. [sent-230, score-0.207]
87 Most of the newly identified clusters contain sequences sharing some rare and specific properties. [sent-232, score-0.207]
88 In accordance with the results in [1], we find a large new cluster containing flavohemoglobins from specific species of funghi and bacteria that share a certain domain architecture composed of a globin domain fused with ferredoxin reductase-like FAD- and NAD-binding modules. [sent-233, score-0.142]
89 An additional example is a cluster of proteins with chemotaxis methyl-accepting receptor domain from a very special class of magnetic bacteria to orient themselves according to earth’s magnetic field. [sent-234, score-0.294]
90 The domain architecture of these proteins involving 6 domains is unique among all sequences in our dataset. [sent-235, score-0.189]
91 Another cluster contains iron-sulfur cluster repair di-iron proteins that build on a polymetallic system, the di-iron center, constituted by two iron ions bridged by two sulfide ions. [sent-236, score-0.261]
92 Figure 5: Partition of all 12,290 proteins into 379 clusters: 356 predefined by sequences from SwissProt and 23 new formed by sequences from TrEMBL (red box). [sent-238, score-0.249]
93 5 Conclusion We have presented a new model for partitioning pairwise distance data, which is motivated by the great success of the TIWD model, shares all its positive properties, and additionally reduces the computational workload from O(n3 ) to O(n2 ) per sweep of the Gibbs sampler. [sent-242, score-0.404]
94 Compared to vectorial representations, pairwise distances do not convey information about translations and rotations of the underlying coordinate system. [sent-243, score-0.212]
95 We show that our construction principle for selecting S∗ among all inner product matrices corresponding to an observed distance matrix D and finds an optimal candidate if the true covariance was known. [sent-248, score-0.424]
96 Although it is a pure theoretical guarantee, it is successfully exploited by a simple hierarchical cluster method to produce an initial covariance estimate—all without specifying the number of clusters, which is one of the model’s key properties. [sent-249, score-0.153]
97 On the algorithmic side, we prove that S∗ can be computed in O(n2 ) time using tree traversals. [sent-250, score-0.128]
98 Assuming the number of Gibbs sweeps necessary is independent of n (which, of course, depends on the problem), we now have a probabilistic algorithm for partitioning distance matrices running in O(n2 ) time. [sent-251, score-0.315]
99 Our experiment containing ≈ 12,000 proteins shows that fastTIWD can be successfully used to mine large relational datasets and leads to automatic identification of protein clusters sharing rare structural properties. [sent-254, score-0.363]
100 Assuming that in most clustering problems it is acceptable to obtain a solution within some hours, any further size increase of the input matrix will become more and more a problem of memory capacity rather than computation time. [sent-255, score-0.102]
wordName wordTfidf (topN-words)
[('stree', 0.368), ('tiwd', 0.325), ('wishart', 0.311), ('kb', 0.216), ('sbb', 0.195), ('fasttiwd', 0.173), ('qsqt', 0.173), ('det', 0.163), ('yl', 0.156), ('clusters', 0.147), ('normalisation', 0.14), ('proteins', 0.129), ('tree', 0.128), ('nb', 0.114), ('swissprot', 0.108), ('partition', 0.104), ('wd', 0.104), ('generalised', 0.1), ('tr', 0.091), ('central', 0.091), ('gibbs', 0.088), ('hereby', 0.087), ('sweep', 0.087), ('dtree', 0.087), ('qtree', 0.087), ('sqt', 0.087), ('trembl', 0.087), ('workload', 0.087), ('matrices', 0.081), ('blocks', 0.078), ('partitioning', 0.077), ('distances', 0.074), ('ward', 0.07), ('sweeps', 0.07), ('block', 0.067), ('cluster', 0.066), ('pairwise', 0.065), ('matrix', 0.063), ('dirichlet', 0.061), ('sequences', 0.06), ('ri', 0.06), ('distance', 0.056), ('rand', 0.055), ('sc', 0.055), ('protein', 0.054), ('candidate', 0.053), ('covariance', 0.053), ('optimising', 0.052), ('objects', 0.052), ('pn', 0.05), ('severe', 0.047), ('normal', 0.045), ('xx', 0.045), ('inner', 0.044), ('yj', 0.043), ('globin', 0.043), ('parametrised', 0.043), ('traversals', 0.043), ('vectorial', 0.043), ('geometric', 0.042), ('likelihood', 0.042), ('embedding', 0.042), ('adjusted', 0.041), ('singleton', 0.041), ('qi', 0.041), ('rn', 0.039), ('clustering', 0.039), ('construction', 0.038), ('boxplot', 0.038), ('bn', 0.038), ('cut', 0.037), ('inverse', 0.037), ('determinant', 0.037), ('product', 0.036), ('shifts', 0.036), ('prior', 0.035), ('partitions', 0.035), ('atypical', 0.035), ('basel', 0.035), ('hierarchical', 0.034), ('annotated', 0.033), ('bacteria', 0.033), ('magnetic', 0.033), ('mine', 0.033), ('shares', 0.032), ('toy', 0.032), ('analogy', 0.032), ('normalised', 0.031), ('rightmost', 0.031), ('probabilistic', 0.031), ('hours', 0.03), ('id', 0.03), ('trace', 0.03), ('member', 0.03), ('applicability', 0.03), ('sinica', 0.03), ('statistica', 0.03), ('rotations', 0.03), ('transformation', 0.029), ('process', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
2 0.1065561 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
3 0.10117716 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
4 0.088503458 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
Author: Le Song, Eric P. Xing, Ankur P. Parikh
Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1
5 0.081537284 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
6 0.078648426 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
7 0.071628094 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
8 0.069855839 66 nips-2011-Crowdclustering
9 0.06975621 198 nips-2011-On U-processes and clustering performance
10 0.068888307 54 nips-2011-Co-regularized Multi-view Spectral Clustering
11 0.067012258 22 nips-2011-Active Ranking using Pairwise Comparisons
12 0.065786421 226 nips-2011-Projection onto A Nonnegative Max-Heap
13 0.064187072 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
14 0.063943423 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.062793605 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
16 0.061772242 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations
17 0.061334196 104 nips-2011-Generalized Beta Mixtures of Gaussians
18 0.06132808 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
19 0.061068334 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
20 0.058205094 221 nips-2011-Priors over Recurrent Continuous Time Processes
topicId topicWeight
[(0, 0.202), (1, 0.047), (2, -0.042), (3, -0.03), (4, -0.04), (5, -0.082), (6, -0.02), (7, -0.035), (8, 0.09), (9, -0.037), (10, 0.021), (11, 0.072), (12, -0.006), (13, -0.042), (14, 0.007), (15, 0.052), (16, -0.054), (17, -0.01), (18, 0.034), (19, -0.055), (20, 0.066), (21, -0.027), (22, 0.016), (23, -0.016), (24, -0.121), (25, 0.017), (26, 0.023), (27, 0.035), (28, -0.004), (29, -0.017), (30, -0.052), (31, 0.058), (32, -0.026), (33, 0.049), (34, -0.059), (35, 0.04), (36, 0.028), (37, 0.045), (38, -0.104), (39, 0.047), (40, -0.024), (41, -0.098), (42, 0.004), (43, -0.057), (44, 0.042), (45, 0.019), (46, 0.042), (47, -0.023), (48, 0.053), (49, -0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.93945223 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
2 0.60485089 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
Author: Yee W. Teh, Charles Blundell, Lloyd Elliott
Abstract: We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. 1
3 0.60280472 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
Author: Jiayu Zhou, Jianhui Chen, Jieping Ye
Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms. 1
4 0.58504313 226 nips-2011-Projection onto A Nonnegative Max-Heap
Author: Jun Liu, Liang Sun, Jieping Ye
Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called maximal root-tree of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1
5 0.57586104 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
Author: Patrick O. Perry, Michael W. Mahoney
Abstract: Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the first nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Definite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which 2 -regularized or 1 -regularized 2 -regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. 1
6 0.56659043 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
7 0.54465443 221 nips-2011-Priors over Recurrent Continuous Time Processes
8 0.54451007 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
9 0.54382265 186 nips-2011-Noise Thresholds for Spectral Clustering
10 0.54051441 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations
11 0.53671467 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
12 0.53310764 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
13 0.52778202 95 nips-2011-Fast and Accurate k-means For Large Datasets
14 0.525392 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
15 0.51508808 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression
16 0.51415247 131 nips-2011-Inference in continuous-time change-point models
17 0.51271337 241 nips-2011-Scalable Training of Mixture Models via Coresets
18 0.50970447 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
19 0.50797403 55 nips-2011-Collective Graphical Models
20 0.50156897 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
topicId topicWeight
[(0, 0.037), (3, 0.267), (4, 0.048), (20, 0.055), (26, 0.016), (31, 0.065), (33, 0.052), (43, 0.05), (45, 0.074), (57, 0.057), (74, 0.074), (83, 0.054), (84, 0.028), (99, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.7629236 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
2 0.7334677 101 nips-2011-Gaussian process modulated renewal processes
Author: Yee W. Teh, Vinayak Rao
Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1
3 0.66691232 33 nips-2011-An Exact Algorithm for F-Measure Maximization
Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier
Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification. 1
4 0.63966173 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
Author: Jun Zhu, Ning Chen, Eric P. Xing
Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.
5 0.53065312 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei
Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.
6 0.52659392 227 nips-2011-Pylon Model for Semantic Segmentation
7 0.51584911 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
8 0.51256686 303 nips-2011-Video Annotation and Tracking with Active Learning
9 0.51176119 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
10 0.51099825 186 nips-2011-Noise Thresholds for Spectral Clustering
11 0.51081866 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
12 0.51025283 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
13 0.51001287 66 nips-2011-Crowdclustering
14 0.509148 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
15 0.50799412 276 nips-2011-Structured sparse coding via lateral inhibition
16 0.50752324 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
17 0.50736713 55 nips-2011-Collective Graphical Models
18 0.50615352 102 nips-2011-Generalised Coupled Tensor Factorisation
19 0.50598603 219 nips-2011-Predicting response time and error rates in visual search
20 0.50399238 127 nips-2011-Image Parsing with Stochastic Scene Grammar