nips nips2011 nips2011-43 knowledge-graph by maker-knowledge-mining

43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

Source: pdf

Author: David Adametz, Volker Roth

Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract A Bayesian approach to partitioning distance matrices is presented. [sent-4, score-0.214]

2 It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. [sent-5, score-0.269]

3 1 Introduction In cluster analysis we are concerned with identifying subsets of n objects that share some similarity and therefore potentially belong to the same sub-population. [sent-10, score-0.118]

4 Many practical applications leave us without direct access to vectorial representations and instead only supply pairwise distance measures collected in a matrix D. [sent-11, score-0.227]

5 This poses a serious challenge, because great parts of geometric information are hereby lost that could otherwise help to discover hidden structures. [sent-12, score-0.129]

6 Its main drawback, however, is the high computational cost of order O(n3 ) per sweep of a Gibbs sampler, limiting its applicability to relatively small data sets. [sent-15, score-0.117]

7 In this work we present an alternative method which shares all the positive properties of the TIWD while reducing the computational workload to O(n2 ) per Gibbs sweep. [sent-16, score-0.119]

8 The main idea is to solve the problem of missing geometric information by a normalisation procedure, which chooses one particular geometric embedding of the distance data and allows us to use a simple probabilistic model for inferring the unknown underlying partition. [sent-18, score-0.353]

9 The construction we use is guaranteed to give the optimal such geometric embedding if the true partition was known. [sent-19, score-0.226]

10 Of course, this is only a hypothetical precondition, but we show that even rough prior estimates of the true partition signiﬁcantly outperform ‘naive’ embedding strategies. [sent-20, score-0.181]

11 Using a simple hierarchical clustering model to produce such prior estimates leads to clusterings being at least of the same quality as those obtained by the original TIWD. [sent-21, score-0.108]

12 The algorithmic contribution here is an efﬁcient algorithm for performing this normalisation procedure in O(n2 ) time, which makes the whole pipeline from distance matrix to inferred partition an O(n2 ) process (assuming a constant number of Gibbs sweeps). [sent-22, score-0.392]

13 In the next section we introduce a probabilistic model for partitioning inner product matrices, which is generalised in section 3 to distance matrices using a preprocessing step that breaks the geometric symmetry inherent in distance representations. [sent-30, score-0.523]

14 2 A Wishart Model for Partitioning Inner Product Matrices Suppose there is a matrix X ∈ Rn×d representing n objects in Rd that belong to one of k subpopulations. [sent-32, score-0.115]

15 For convenience we deﬁne the generalised central Wishart distribution which also allows rank-deﬁcient S and/or Σ as d 1 (1) p(S|Ψ, d) ∝ det(S) 2 (d−n−1) det(Ψ) 2 exp − d tr(ΨS) , 2 where det(•) is the product of non-zero eigenvalues and Ψ denotes the (generalised) inverse of Σ. [sent-43, score-0.264]

16 Note that the inverse of a block diagonal matrix is also block diagonal, so we can formulate the prior in terms of Σ, which is easier to parametrise. [sent-49, score-0.269]

17 For this purpose we adapt the method in [2] using a Multinomial-Dirichlet process model [3, 4, 5] to deﬁne a ﬂexible prior distribution over block matrices without specifying the exact number of blocks. [sent-50, score-0.212]

18 A partition B ∈ Bn can be represented in matrix form as B(i, j) = 1 if y(i) = y(j) and B(i, j) = 0 otherwise, with y being a function that maps [n] to some label set L. [sent-53, score-0.167]

19 A partition process is a series of distributions Pn on the set Bn in which Pn is the marginal of Pn+1 . [sent-55, score-0.133]

20 Chinese Restaurant Process), nb is the number of objects in block b and kB ≤ k is the total number of blocks in B. [sent-63, score-0.311]

21 The prior is exchangeable meaning rows and columns can be (jointly) permuted arbitrarily and therefore partition matrices can always be brought to block diagonal form. [sent-64, score-0.287]

22 The ﬁnal block diagonal covariance matrix used in (2) has the form Σ = Ψ−1 = α(In + θB), with θ := β/α. [sent-67, score-0.183]

23 Multiplying the Wishart likelihood (2), the prior over partitions (3) and suitable priors over α, θ gives the joint posterior. [sent-69, score-0.112]

24 Note that the (usually unknown) degree of freedom d has the formal role of an annealing parameter, and it can indeed be used to ‘cool’ the Markov chain by increasing d, if desired, until a partition is ‘frozen’. [sent-76, score-0.104]

25 In one sweep of the Gibbs sampler, we have to iteratively compute the membership probability of one object indexed by i to the kB currently existing blocks in partition B (plus one new block), given the assignments for the n − 1 remaining ones denoted by the superscript (−i) [7, 8]. [sent-78, score-0.269]

26 In every step of this inner loop over kB existing blocks we have to evaluate the Wishart ¯ likelihood, i. [sent-79, score-0.122]

27 Given trace tr(−i) , we update Sbb for kB blocks (−i) b ∈ B which in total needs O(n) operations. [sent-82, score-0.108]

28 Given det , the computation of all kB updated determinants induces costs of O(kB ). [sent-83, score-0.163]

29 In total, there are n objects, so a full sweep requires O(n2 + nkB ) operations, which is equal to O(n2 ) since the maximum number of blocks is n, i. [sent-84, score-0.165]

30 Compared to the original TIWD, the worst case complexity in the Dirichlet process model with an inﬁnite number of blocks in the population, k = ∞, is reduced from O(n3 ) to O(n2 ) . [sent-88, score-0.107]

31 3 The fastTIWD Model for Partitioning Distance Matrices Consider now the case where S is not accessible, but only squared pairwise distances D ∈ Rn×n : D(i, j) = S(i, i) + S(j, j) − 2 S(i, j). [sent-89, score-0.139]

32 In fact, if S∗ ∼ W(Σ), the distribution of a general S ∈ S is non-central Wishart, which can be easily seen as follows: S is exactly the set of inner product matrices that can be constructed by varying c ∈ Rd in a modiﬁed matrix normal model X ∼ N (M, Σ ⊗ Id ) with mean matrix M = 1n ct . [sent-94, score-0.332]

33 Note further that ‘shifts’ ci do not affect pairwise 1 distances between rows in X. [sent-96, score-0.139]

34 The modiﬁed matrix normal distribution implies that S = d XX t is −1 t non-central Wishart, S ∼ W(Σ, Θ), with non-centrality matrix Θ := Σ M M . [sent-97, score-0.171]

35 This normalisation procedure can be implemented solely based on S, leading to the well-known centering procedure in kernel PCA, [10]: Sc = QI S Qt , I with projection QI = I − (1/n)11t . [sent-106, score-0.14]

36 (9) Contrary to the PCA setting, however, this column normalisation induced by QI does not work well here, because the elements of a column vector in X are not independent. [sent-107, score-0.14]

37 Hereby, we not only remove the shifts ci , but also alter the distribution: the non-centrality matrix does not vanish in general and as a result, Sc is no longer central Wishart distributed. [sent-109, score-0.19]

38 In the following we present a solution to the problem of ﬁnding a candidate matrix S∗ that recasts inference based on the translation-invariant Wishart distribution as a method to reconstruct the optimal S∗ . [sent-110, score-0.116]

39 Our proposal is guided by a particular analogy between trees and partition matrices and aims at exploiting a tree-structure to guarantee low computational costs. [sent-111, score-0.217]

40 Assuming that S∗ ∼ Wd (Σ), the distribution of an arbitrary member S ∈ S(D) can be derived analytically as a generalised central Wishart distribution with a rank-deﬁcient covariance, see [2]. [sent-115, score-0.221]

41 Its likelihood in the rank-deﬁcient inverse covariance matrix Ψ is d d L(Ψ) ∝ det(Ψ) 2 exp − d tr(ΨS∗ ) = det(Ψ) 2 exp 2 d 4 tr(ΨD) , (10) with Ψ = Ψ − (1t Ψ1)−1 Ψ11t Ψ. [sent-116, score-0.195]

42 (12) By treating Q as a ﬁxed matrix, this expression can also be seen as a central Wishart in the transformed matrix S∗ = QSQt , parametrised by the full-rank matrix Ψ if det(Ψ) is substituted by the appropriate normalisation term det(Ψ). [sent-130, score-0.4]

43 From this viewpoint, inference using the translation-invariant Wishart distribution can be interpreted as ﬁnding a (rank-deﬁcient) representative S∗ = QSQt ∈ S(D) which follows a generalised central Wishart distribution with full-rank inverse covariance matrix Ψ. [sent-131, score-0.344]

44 Thus S∗ can be seen as an optimal candidate inner-product matrix in the set S(D) for a central Wishart model parametrised by Ψ. [sent-133, score-0.25]

45 If, on the other hand, we had some initial estimate of Ψ, we could ﬁnd a reasonable transformation Q and hereby a reasonable candidate S∗ . [sent-136, score-0.169]

46 Note that even if the estimate of Ψ is far away from the true inverse covariance, the pairwise distances are at least guaranteed not to change under Q S(Q )t . [sent-137, score-0.176]

47 Our construction is guided by an analogy between binary trees and weighted sums of cut matrices, which are binary complements of partition matrices with two blocks. [sent-142, score-0.292]

48 We use a binary tree with n leaves representing n objects. [sent-143, score-0.128]

49 It encodes a path distance matrix Dtree between those n objects, and for an optimal tree Dtree = D. [sent-144, score-0.247]

50 Such an optimal tree exists only if D is additive, and the task of ﬁnding an approximation is a well-studied problem. [sent-145, score-0.128]

51 We will not discuss the various tree reconstruction algorithms, but only mention that there exist algorithms for reconstructing the closest ultrametric tree (in the ∞ norm) in O(n2 ) time, [14]. [sent-146, score-0.256]

52 Figure 1: From left to right: Unknown samples X, pairwise distances collected in D, closest tree structure and an exemplary building block. [sent-147, score-0.267]

53 A tree metric induced by Dtree is composed of elementary cut (pseudo-)metrics. [sent-148, score-0.165]

54 Any such metric lies in the metric space L1 and is also a member of (L2 )2 , which is the metric part of the space of squared Euclidean distance matrices D. [sent-149, score-0.167]

55 In fact, any matrix Stree has a canonical decomposition into a weighted sum of 2-block partition matrices, which is constructed by cutting all edges (2n − 2 for a rooted tree) and observing the resulting classiﬁcation of leaf nodes. [sent-151, score-0.167]

56 Suppose, we keep track of such an assignment with indicator 1j induced by a single cut j, then the inner product matrix is Stree = 2n−2 j=1 ¯ ¯j λj (1j 1t + 1j 1t ), j (13) ¯ where λj is the weight of edge j to be cut and 1j → {0, 1}n is the complementary assignment, i. [sent-152, score-0.217]

57 Each term (1j 1t + 1j 1t ) is a 2-block partition matrix. [sent-155, score-0.104]

58 The remaining panels show the single-linkage clustering tree, all 2n − 2 = 48 weighted 2-block partition matrices, and the ﬁnal Stree (= sum of all individual 2-block matrices, rescaled to full gray-value range). [sent-159, score-0.143]

59 Note that single-linkage fails to identify the clusters in the three branches closest to root, but still the structure of B is clearly visible in Stree . [sent-160, score-0.147]

60 Left to right: Partition matrix B for n = 25 objects in 3 clusters, single-linkage tree, all weighted 2-block partition matrices, ﬁnal Stree . [sent-162, score-0.219]

61 tree For the proof we need the following lemma: Lemma 1. [sent-166, score-0.128]

62 (of lemma 1) Restating (13) and deﬁning m := 2n − 2, we have m Stree y = j=1 n = l=1 m ¯ ¯j λj 1j 1t + 1j 1t y = j m yl j=1 ¯ λj 1j + j=1 m j=1 n λj 1j λj 1j n l=1 l=1 n ¯ 1jl yl + 1j 1jl yl − m j=1 l=1 ¯ λj 1j ¯jl yl 1 n l=1 (14) 1jl yl . [sent-169, score-0.78]

63 Furthermore, assume Ri is the set of all nodes on the branch starting from node i and leading to the tree’s root: (Stree y)i = = n l=1 n l=1 n l=1 yl yl j ∈Ri / m j=1 m j=1 λj + λj − j∈Ri j∈Ri λj l∈Rj λj + 2 yl − j∈Ri j ∈Ri / λj yj − λj m j=1 l∈Rj yl (15) λj yj . [sent-171, score-0.71]

64 m j=1 Note that yl , λj and λj yj are constants and computed in O(n) time. [sent-172, score-0.199]

65 This can be done directly on the tree structure in two separate traversals: 1. [sent-174, score-0.128]

66 It is important to stress that the above two tree traversals fully describe the complete algorithm. [sent-181, score-0.171]

67 (of theorem 1) First, note that only the matrix-vector product a := Ψtree 1 is needed in Qtree SQt = I − tree t 1 1t Ψtree 1 11 Ψtree t t 1 S I − Ψtree 1t Ψtree 1 11t = S − (1/1 a) 1a S − (1/1t a) S a1t + (1/1t a)2 1at S a1t . [sent-183, score-0.164]

68 Due to lemma 1, a can be computed in O(n2 ) time and is used in (16) to compute S∗ = Qtree SQt (only matrix-vector products, so O(n2 ) complexity tree is maintained). [sent-186, score-0.128]

69 A partition matrix B of size n = 200 containing k = 3 blocks is sampled from which we construct ΣB = α(I + θB). [sent-189, score-0.245]

70 The experiment was repeated 200 times and the quality of the inferred clusters was measured by the adjusted Rand index w. [sent-194, score-0.188]

71 For the hierarchical methods we report two different performance values: splitting the tree such that the ‘true’ number k = 3 of clusters is obtained and computing the best value among all possible splits into [2, n] clusters (‘*. [sent-198, score-0.456]

72 The reader should notice that both values are in favour of the hierarchical algorithms, since neither the true k nor the true labels are used for inferring the clusters in the Wishart-type methods. [sent-200, score-0.181]

73 3 we conclude that (i) both ‘naive’ normalisation strategies WD C and WD R are clearly outperformed by TIWD and fastTIWD (‘fTIWD’ in the boxplot). [sent-202, score-0.14]

74 Left half: Partition matrix (top), distance matrix (bottom) and 2D-PCA embedding of a dataset drawn from the generative model. [sent-206, score-0.224]

75 For this purpose we sample again 3 clusters in d = 300 dimensions, but now use a log-normal distribution that tends to produce a high number of ‘atypical’ samples. [sent-212, score-0.147]

76 Note that such a distribution should not induce severe problems for hierarchical methods when optimising the Rand index over all possible tree cuttings, since the ‘atypical’ samples are likely to form singleton clusters while the main structure is still visible in other branches of the tree. [sent-213, score-0.449]

77 As for the fastTIWD model, we want to test if the prior over partitions is ﬂexible enough to introduce additional singleton clusters: In the experiment, it performed at least as well as Ward’s method, and clearly outperformed single- and complete-linkage. [sent-215, score-0.111]

78 We also compared it to the afﬁnity-propagation method (AP), which, however, has severe problems on this dataset, even when optimising the input preference parameter that affects the number of clusters in the partition. [sent-216, score-0.246]

79 As large-scale application we present a semisupervised clustering example which is an upscaled version of an experiment with protein sequences presented in [1]. [sent-222, score-0.153]

80 While traditional semi-supervised classiﬁers assume at least one labelled object per class, our model is ﬂexible enough to allow additional new clusters that have no counterpart in the subset of labelled objects. [sent-223, score-0.147]

81 We apply this idea on two different databases, one being high quality due to manual annotation with a stringent review process (SwissProt) while the other contains automatically annotated proteins and is not reviewed (TrEMBL). [sent-224, score-0.191]

82 The annotations in SwissProt are used as supervision information resulting in a set of class labels, whereas the proteins in TrEMBL are treated as unlabelled objects, potentially forming new clusters. [sent-225, score-0.129]

83 In contrast to a relatively small set of globin sequences in [1], we extract a total number of 12,290 (manually or automatically) annotated proteins to have some role in oxygen transport or binding. [sent-226, score-0.265]

84 The proteins are represented as a matrix of pairwise alignment scores. [sent-228, score-0.257]

85 A subset of 1731 annotated sequences is from SwissProt, resulting in 356 protein classes. [sent-229, score-0.147]

86 Among the 10,559 TrEMBL sequences 7 we could identify 23 new clusters which are dissimilar to any SwissProt proteins, see Fig. [sent-230, score-0.207]

87 Most of the newly identiﬁed clusters contain sequences sharing some rare and speciﬁc properties. [sent-232, score-0.207]

88 In accordance with the results in [1], we ﬁnd a large new cluster containing ﬂavohemoglobins from speciﬁc species of funghi and bacteria that share a certain domain architecture composed of a globin domain fused with ferredoxin reductase-like FAD- and NAD-binding modules. [sent-233, score-0.142]

89 An additional example is a cluster of proteins with chemotaxis methyl-accepting receptor domain from a very special class of magnetic bacteria to orient themselves according to earth’s magnetic ﬁeld. [sent-234, score-0.294]

90 The domain architecture of these proteins involving 6 domains is unique among all sequences in our dataset. [sent-235, score-0.189]

91 Another cluster contains iron-sulfur cluster repair di-iron proteins that build on a polymetallic system, the di-iron center, constituted by two iron ions bridged by two sulﬁde ions. [sent-236, score-0.261]

92 Figure 5: Partition of all 12,290 proteins into 379 clusters: 356 predeﬁned by sequences from SwissProt and 23 new formed by sequences from TrEMBL (red box). [sent-238, score-0.249]

93 5 Conclusion We have presented a new model for partitioning pairwise distance data, which is motivated by the great success of the TIWD model, shares all its positive properties, and additionally reduces the computational workload from O(n3 ) to O(n2 ) per sweep of the Gibbs sampler. [sent-242, score-0.404]

94 Compared to vectorial representations, pairwise distances do not convey information about translations and rotations of the underlying coordinate system. [sent-243, score-0.212]

95 We show that our construction principle for selecting S∗ among all inner product matrices corresponding to an observed distance matrix D and ﬁnds an optimal candidate if the true covariance was known. [sent-248, score-0.424]

96 Although it is a pure theoretical guarantee, it is successfully exploited by a simple hierarchical cluster method to produce an initial covariance estimate—all without specifying the number of clusters, which is one of the model’s key properties. [sent-249, score-0.153]

97 On the algorithmic side, we prove that S∗ can be computed in O(n2 ) time using tree traversals. [sent-250, score-0.128]

98 Assuming the number of Gibbs sweeps necessary is independent of n (which, of course, depends on the problem), we now have a probabilistic algorithm for partitioning distance matrices running in O(n2 ) time. [sent-251, score-0.315]

99 Our experiment containing ≈ 12,000 proteins shows that fastTIWD can be successfully used to mine large relational datasets and leads to automatic identiﬁcation of protein clusters sharing rare structural properties. [sent-254, score-0.363]

100 Assuming that in most clustering problems it is acceptable to obtain a solution within some hours, any further size increase of the input matrix will become more and more a problem of memory capacity rather than computation time. [sent-255, score-0.102]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stree', 0.368), ('tiwd', 0.325), ('wishart', 0.311), ('kb', 0.216), ('sbb', 0.195), ('fasttiwd', 0.173), ('qsqt', 0.173), ('det', 0.163), ('yl', 0.156), ('clusters', 0.147), ('normalisation', 0.14), ('proteins', 0.129), ('tree', 0.128), ('nb', 0.114), ('swissprot', 0.108), ('partition', 0.104), ('wd', 0.104), ('generalised', 0.1), ('tr', 0.091), ('central', 0.091), ('gibbs', 0.088), ('hereby', 0.087), ('sweep', 0.087), ('dtree', 0.087), ('qtree', 0.087), ('sqt', 0.087), ('trembl', 0.087), ('workload', 0.087), ('matrices', 0.081), ('blocks', 0.078), ('partitioning', 0.077), ('distances', 0.074), ('ward', 0.07), ('sweeps', 0.07), ('block', 0.067), ('cluster', 0.066), ('pairwise', 0.065), ('matrix', 0.063), ('dirichlet', 0.061), ('sequences', 0.06), ('ri', 0.06), ('distance', 0.056), ('rand', 0.055), ('sc', 0.055), ('protein', 0.054), ('candidate', 0.053), ('covariance', 0.053), ('optimising', 0.052), ('objects', 0.052), ('pn', 0.05), ('severe', 0.047), ('normal', 0.045), ('xx', 0.045), ('inner', 0.044), ('yj', 0.043), ('globin', 0.043), ('parametrised', 0.043), ('traversals', 0.043), ('vectorial', 0.043), ('geometric', 0.042), ('likelihood', 0.042), ('embedding', 0.042), ('adjusted', 0.041), ('singleton', 0.041), ('qi', 0.041), ('rn', 0.039), ('clustering', 0.039), ('construction', 0.038), ('boxplot', 0.038), ('bn', 0.038), ('cut', 0.037), ('inverse', 0.037), ('determinant', 0.037), ('product', 0.036), ('shifts', 0.036), ('prior', 0.035), ('partitions', 0.035), ('atypical', 0.035), ('basel', 0.035), ('hierarchical', 0.034), ('annotated', 0.033), ('bacteria', 0.033), ('magnetic', 0.033), ('mine', 0.033), ('shares', 0.032), ('toy', 0.032), ('analogy', 0.032), ('normalised', 0.031), ('rightmost', 0.031), ('probabilistic', 0.031), ('hours', 0.03), ('id', 0.03), ('trace', 0.03), ('member', 0.03), ('applicability', 0.03), ('sinica', 0.03), ('statistica', 0.03), ('rotations', 0.03), ('transformation', 0.029), ('process', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

Author: David Adametz, Volker Roth

2 0.1065561 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

3 0.10117716 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efﬁciently learn a label tree for large scale classiﬁcation with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classiﬁers for each node in the tree. This approach also allows ﬁne grained control over the efﬁciency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classiﬁcation with 10184 classes and 9 million images. We demonstrate signiﬁcant improvements in test accuracy and efﬁciency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

4 0.088503458 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efﬁcient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1

5 0.081537284 258 nips-2011-Sparse Bayesian Multi-Task Learning

Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau

Abstract: We propose a new sparse Bayesian model for multi-task regression and classiﬁcation. The model is able to capture correlations between tasks, or more speciﬁcally a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classiﬁcation tasks it achieves competitive predictive performance compared to previously proposed methods. 1

6 0.078648426 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

7 0.071628094 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

8 0.069855839 66 nips-2011-Crowdclustering

9 0.06975621 198 nips-2011-On U-processes and clustering performance

10 0.068888307 54 nips-2011-Co-regularized Multi-view Spectral Clustering

11 0.067012258 22 nips-2011-Active Ranking using Pairwise Comparisons

12 0.065786421 226 nips-2011-Projection onto A Nonnegative Max-Heap

13 0.064187072 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

14 0.063943423 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

15 0.062793605 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

16 0.061772242 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

17 0.061334196 104 nips-2011-Generalized Beta Mixtures of Gaussians

18 0.06132808 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

19 0.061068334 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

20 0.058205094 221 nips-2011-Priors over Recurrent Continuous Time Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, 0.047), (2, -0.042), (3, -0.03), (4, -0.04), (5, -0.082), (6, -0.02), (7, -0.035), (8, 0.09), (9, -0.037), (10, 0.021), (11, 0.072), (12, -0.006), (13, -0.042), (14, 0.007), (15, 0.052), (16, -0.054), (17, -0.01), (18, 0.034), (19, -0.055), (20, 0.066), (21, -0.027), (22, 0.016), (23, -0.016), (24, -0.121), (25, 0.017), (26, 0.023), (27, 0.035), (28, -0.004), (29, -0.017), (30, -0.052), (31, 0.058), (32, -0.026), (33, 0.049), (34, -0.059), (35, 0.04), (36, 0.028), (37, 0.045), (38, -0.104), (39, 0.047), (40, -0.024), (41, -0.098), (42, 0.004), (43, -0.057), (44, 0.042), (45, 0.019), (46, 0.042), (47, -0.023), (48, 0.053), (49, -0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93945223 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

Author: David Adametz, Volker Roth

2 0.60485089 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

Author: Yee W. Teh, Charles Blundell, Lloyd Elliott

Abstract: We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for ﬂexible modelling of the number of clusters, and they avoid label switching non-identiﬁability problems. We develop an efﬁcient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. 1

3 0.60280472 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

Author: Jiayu Zhou, Jianhui Chen, Jieping Ye

Abstract: Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i.e., tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing signiﬁcant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is signiﬁcantly more efﬁcient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efﬁciency of the proposed algorithms. 1

4 0.58504313 226 nips-2011-Projection onto A Nonnegative Max-Heap

Author: Jun Liu, Liang Sun, Jieping Ye

Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classiﬁcation task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to ﬁnd the so-called maximal root-tree of the subtree rooted at each node. A naive approach for ﬁnding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for eﬃciently ﬁnding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the eﬀectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1

5 0.57586104 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

Author: Patrick O. Perry, Michael W. Mahoney

Abstract: Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the ﬁrst nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Deﬁnite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which 2 -regularized or 1 -regularized 2 -regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefﬁcient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the ﬁrst nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. 1

6 0.56659043 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

7 0.54465443 221 nips-2011-Priors over Recurrent Continuous Time Processes

8 0.54451007 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

9 0.54382265 186 nips-2011-Noise Thresholds for Spectral Clustering

10 0.54051441 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

11 0.53671467 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

12 0.53310764 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

13 0.52778202 95 nips-2011-Fast and Accurate k-means For Large Datasets

14 0.525392 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

15 0.51508808 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

16 0.51415247 131 nips-2011-Inference in continuous-time change-point models

17 0.51271337 241 nips-2011-Scalable Training of Mixture Models via Coresets

18 0.50970447 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

19 0.50797403 55 nips-2011-Collective Graphical Models

20 0.50156897 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.037), (3, 0.267), (4, 0.048), (20, 0.055), (26, 0.016), (31, 0.065), (33, 0.052), (43, 0.05), (45, 0.074), (57, 0.057), (74, 0.074), (83, 0.054), (84, 0.028), (99, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7629236 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

Author: David Adametz, Volker Roth

2 0.7334677 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efﬁcient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

3 0.66691232 33 nips-2011-An Exact Algorithm for F-Measure Maximization

Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier

Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classiﬁcation, multi-label classiﬁcation, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efﬁcient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classiﬁcation. 1

4 0.63966173 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

Author: Jun Zhu, Ning Chen, Eric P. Xing

Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing inﬁnite latent support vector machines (iLSVM) and multi-task inﬁnite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classiﬁcation and multi-task learning, respectively. We present efﬁcient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.

5 0.53065312 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei

Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the ﬁrst rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.

6 0.52659392 227 nips-2011-Pylon Model for Semantic Segmentation

7 0.51584911 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

8 0.51256686 303 nips-2011-Video Annotation and Tracking with Active Learning

9 0.51176119 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

10 0.51099825 186 nips-2011-Noise Thresholds for Spectral Clustering

11 0.51081866 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

12 0.51025283 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

13 0.51001287 66 nips-2011-Crowdclustering

14 0.509148 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

15 0.50799412 276 nips-2011-Structured sparse coding via lateral inhibition

16 0.50752324 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

17 0.50736713 55 nips-2011-Collective Graphical Models

18 0.50615352 102 nips-2011-Generalised Coupled Tensor Factorisation

19 0.50598603 219 nips-2011-Predicting response time and error rates in visual search

20 0.50399238 127 nips-2011-Image Parsing with Stochastic Scene Grammar