nips nips2012 nips2012-342 knowledge-graph by maker-knowledge-mining

342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Source: pdf

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

Abstract: In this paper, we derive a novel algorithm to cluster hidden Markov models (HMMs) according to their probability distributions. We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We illustrate the beneﬁts of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The variational hierarchical EM algorithm for clustering hidden Markov models Emanuele Coviello ECE Dept. [sent-1, score-0.464]

2 edu Abstract In this paper, we derive a novel algorithm to cluster hidden Markov models (HMMs) according to their probability distributions. [sent-12, score-0.284]

3 We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i. [sent-13, score-0.329]

4 We illustrate the beneﬁts of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging. [sent-16, score-0.56]

5 A discrete-time hidden state process, which evolves as a Markov chain, encodes the dynamics of the signal, and an observation process, at each time conditioned on the current state, encodes the appearance of the signal. [sent-18, score-0.199]

6 HMMs have successfully served a variety of applications, including speech recognition [1], music analysis [2] and identiﬁcation [3], and clustering of time series data [4, 5]. [sent-19, score-0.316]

7 Various applications motivate the design of HMM clustering algorithms, ranging from hierarchical clustering of sequential data (e. [sent-23, score-0.339]

8 , speech or motion sequences modeled by HMMs [4]), over hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large datasets (e. [sent-25, score-0.292]

9 , semantic annotation models for music and video) — by clustering HMMs, efﬁciently estimated from many small subsets of the data, into a more compact mixture model of all data. [sent-27, score-0.464]

10 While this approach has proven successful to group HMMs into similar clusters [4], it does not allow to generate novel HMMs as cluster centers. [sent-37, score-0.234]

11 , the HMM which the spectral clustering procedure maps the closest to each spectral clustering center. [sent-40, score-0.364]

12 Spectral clustering can be based on other afﬁnity scores between HMMs distributions than Bhattacharyya afﬁnity, such as KL divergence approximated with sampling [7]. [sent-44, score-0.179]

13 Instead, in this paper we propose to cluster HMMs directly with respect to the probability distributions they represent. [sent-45, score-0.138]

14 We derive a hierarchical expectation maximization (HEM) algorithm that, starting from a group of HMMs, estimates a smaller mixture model that concisely represents and clusters the input HMMs (i. [sent-46, score-0.338]

15 This algorithm starts from a Gaussian mixture model (GMM) and reduces it to another GMM with fewer components, where each of the mixture components of the reduced GMM represents, i. [sent-50, score-0.223]

16 HEM has been applied successfully to many machine learning tasks for images [10], video [9] and music [11, 12]. [sent-57, score-0.145]

17 The HEM algorithm is similar in spirit to Bregman-clustering [13], which is based on assigning points to cluster centers using KL-divergence. [sent-58, score-0.201]

18 To extend the HEM framework for GMMs to hidden Markov mixture models (H3Ms), additional marginalization of the hidden-state processes is required, as for DTMs. [sent-59, score-0.2]

19 Therefore, in this work, we derive a variational formulation of the HEM algorithm (VHEM), and then leverage a variational approximation derived in [14] (which has not been used in a learning context so far) to make the inference in the E-step tractable. [sent-61, score-0.306]

20 The proposed VHEM algorithm for H3Ms (VHEM-H3M) allows to cluster hidden Markov models, while also learning novel HMM centers that are representative of each cluster, in a way that is consistent with the underlying generative model of the input HMMs. [sent-62, score-0.371]

21 The resulting VHEM algorithm can be generalized to handle other classes of graphical models, for which exact computation of the E-step in standard HEM would be intractable, by leveraging similar variational approximations. [sent-63, score-0.141]

22 The efﬁcacy of the VHEM-H3M algorithm is demonstrated on hierarchical motion clustering and semantic music annotation and retrieval. [sent-64, score-0.568]

23 We review the hidden Markov model (HMM) and the hidden Markov mixture model (H3M) in Section 2. [sent-66, score-0.318]

24 2 The hidden Markov (mixture) model A hidden Markov model (HMM) M assumes a sequence of τ observations y1:τ is generated by a double embedded stochastic process. [sent-68, score-0.275]

25 The hidden state process x1:τ is a ﬁrst order Markov chain on S states, with transition matrix A whose entries are aβ,γ = P (xt+1 = γ|xt = β), and initial state distribution π = [π1 , . [sent-69, score-0.206]

26 Each state β generates observations according to an emission probability density function p(y|x = β, M) which here we assume time-invariant and modeled as a Gaussian mixture with M components, i. [sent-73, score-0.184]

27 , cβ,M ) is the hidden variable that selects the mixture component, cβ,m the mixture weight of the mth Gaussian component, and p(y|ζ = m, M) = N (y; µβ,m , Σβ,m ) is the probability density function of a multivariate Gaussian distribution with mean µβ,m and covariance matrix Σβ,m . [sent-78, score-0.282]

28 A hidden Markov mixture model (H3M) models a set of observation sequences as samples from a group of K hidden Markov models, each associated to a speciﬁc sub-behavior [5]. [sent-80, score-0.483]

29 To reduce clutter, here we assume that all the HMMs have the same number S of hidden states and that all emission probabilities have M mixture components, though our derivation could be easily extended to the more general case, and in the remainder of the paper we use the notation in Table 1. [sent-84, score-0.315]

30 Eyt |xt =β,M(b) [·] EM(b) [·] i (r) Mj,ρ, i,β Eyt |ζt =m,xt =β,M(b) [·] Gaussian component i i EM(b) [·] i,β,m Clustering hidden Markov models We now derive the variational hierarchical EM algorithm for clustering HMMs (VHEM-H3M). [sent-94, score-0.522]

31 Let (b) (b) (b) M(b) = {ωi , Mi }K be a base hidden Markov mixture model (H3M) with K (b) components. [sent-95, score-0.257]

32 i=1 The goal of the VHEM-H3M algorithm is to ﬁnd a reduced hidden Markov mixture model M(r) = (r) (r) (r) {ωj , Mj }K with fewer components (i. [sent-96, score-0.259]

33 At a high j=1 level, the VHEM-H3M algorithm estimates the reduced H3M model M(r) from virtual samples distributed according to the base H3M model M(b) . [sent-99, score-0.245]

34 From this estimation procedure, the VHEM algorithm provides: (i) a (soft) clustering of the original K (b) HMMs into K (r) groups, encoded in assignment variables zi,j , and (ii) novel HMM cluster centers, i. [sent-100, score-0.32]

35 Finally, because we take the expectation over the virtual samples, the estimation is carried out in an efﬁcient manner that requires only knowledge of the parameters of the base model without the need of generating actual virtual samples. [sent-103, score-0.36]

36 1 Parameter estimation We consider a set Y of N virtual samples distributed accordingly to the base model M(b) , such that (b) (i,m) (i,m) (b) the Ni = N ωi samples Yi = {y1:τ }Ni are from the ith component (i. [sent-105, score-0.311]

37 We m=1 (b) denote the entire set of samples as Y = {Yi }K , and, in order to obtain a consistent clustering of the i=1 (b) input HMMs Mi , we assume the entirety of samples Yi is assigned to the same component of the (i,m) (i,m) reduced model [8]. [sent-108, score-0.268]

38 Note that, in this formulation, we are not using virtual samples {x1:τ , y1:τ } (b) for each base component, according to its joint distribution p(x1:τ , y1:τ |Mi ), but we treat Xi = (i,m) {x1:τ }Ni as “missing” information, and estimate them in the E-step. [sent-109, score-0.219]

39 The reason is that a basis m=1 (b) (r) mismatch between components of Mi will cause problems when the parameters of Mj are (b) (b) K computed from virtual samples of the hidden states of {Mi }i=1 . [sent-110, score-0.335]

40 , K (b) (r) log p(Y |M(r) ) = ), with respect to M(r) , and uses the law of large numi=1 log p(Yi |M (b) bers to turn the virtual samples into an expectation over the base model components Mi . [sent-113, score-0.368]

41 To estimate M(r) , we will maximize the expected log-likelihood of the virtual samples, K (b) (r) J (M (r) ) = EM(b) log p(Y |M EM(b) log p(Yi |M(r) ) , ) = i=1 (2) i (b) where the expectation is over the base model components Mi . [sent-115, score-0.354]

42 A general framework for maximum likelihood estimation in the presence of hidden variables (which is the case for H3Ms) is the EM algorithm [15]. [sent-116, score-0.166]

43 In this work, we take a variational perspective [16, 17, 18], which views both E- and M-step as a maximization step. [sent-117, score-0.141]

44 The variational E-step ﬁrst obtains a family of lower bounds to the log-likelihood (i. [sent-118, score-0.188]

45 , to equation 2), indexed by variational parameters, and then optimizes over the variational parameters to ﬁnd the tightest bound. [sent-120, score-0.282]

46 The corresponding M-step then maximizes the lower bound (with the variational parameters ﬁxed) with respect to the 3 model parameters. [sent-121, score-0.228]

47 One advantage of the variational formulation is that it allows to replace a difﬁcult inference in the E-step with a variational approximation, by restricting the maximization to a smaller domain for which the lower bound is tractable. [sent-122, score-0.369]

48 1 Lower bound to an expected log-likelihood Before proceeding with the derivation of VHEM for H3Ms, we ﬁrst need to derive a lower-bound to an expected log-likelihood term (e. [sent-125, score-0.145]

49 In all generality, let {O, H} be the observation and hidden variables of a probabilistic model, respectively, where p(H) is the distribution of the hidden variables, p(O|H) is the conditional likelihood of the observations, and p(O) = H p(O|H)p(H) is the observation likelihood. [sent-129, score-0.337]

50 We can deﬁne a variational lower bound to the observation log-likelihood [18, 19]: log p(O) ≥ log p(O) − D(q(H)||p(H|O)) = q(H) log H p(H)p(O|H) q(H) (3) where p(H|O) is the posterior distribution of H given observation O, and q(H) is the variational distribution (i. [sent-130, score-0.569]

51 When q(y) the variational distribution equals the true posterior, q(H) = P (H|O), then the KL divergence is zero, and hence the lower-bound reaches log p(O). [sent-134, score-0.203]

52 Since pb (O) is non-negative, taking the expectation on both sides of (4) yields, Eb [log p(O)] ≥ Eb max q∈Q = max q∈Q q(H) log H q(H) log H p(H)p(O|H) q(H) ≥ max Eb q∈Q q(H) log H p(H)p(O|H) q(H) p(H) + Eb [log p(O|H)] , q(H) (5) (6) where (5) follows from Jensen’s inequality (i. [sent-137, score-0.181]

53 2 Variational lower bound We now derive the lower bound of the expected log-likelihood cost function in (2). [sent-142, score-0.221]

54 The derivation proceeds by successively applying the lower bound from (6) on each arising expected log-likelihood term, which results in a set of nested lower bounds. [sent-143, score-0.212]

55 (r) The second lower bound, Li,j M , is on the expected log-likelihood of an HMM Mj , marginalHM (b) ized over observation sequences from a different HMM Mi . [sent-146, score-0.176]

56 Although the data log-likelihood (r) log p(y1:τ |Mj ) can be computed exactly using the forward algorithm [1], calculating its expecta(r) tion is not analytically tractable since y1:τ ∼ Mj is essentially an observation from a mixture with (b) (r) O(S τ ) components. [sent-147, score-0.161]

57 The third lower bound is between GMM emission densities Mi,βt and Mj,ρt . [sent-148, score-0.145]

58 4 H3M lower bound - Looking at an individual term in (2), p(Yi |M(r) ) is a mixture of HMMs, and thus the observation variable is Yi and the hidden variable is zi (the assignment of Yi to a component (r) Mj ). [sent-149, score-0.427]

59 Hence, introducing the variational distribution qi (zi ) and applying (6), we have qi (zi = j) log qi i p(zi = j) (r) + Ni EM(b) [log p(y1:τ |Mj )] qi (zi = j) i qi (zi = j) log EM(b) log p(Yi |M(r) ) ≥ max p(zi = j) + Ni Li,j M HM qi (zi = j) j ≥ max qi j Li H3M . [sent-150, score-0.778]

60 samples, and we use the lower bound (8) for (r) the expectation of log p(y1:τ |Mj ), which is the observation log-likelihood of an HMM and hence its expectation cannot be calculated directly. [sent-154, score-0.23]

61 To compute Li H3M , we will restrict the variational K (r) j=1 distributions to the form qi (zi = j) = zij for all i, where zij = 1, and zij ≥ 0 ∀j. [sent-155, score-0.386]

62 (r) HMM lower bound - For the HMM likelihood p(y1:τ |Mj ), the observation variable is y1:τ and the hidden variable is its state sequence ρ1:τ . [sent-156, score-0.352]

63 The variational distribution state sequences ρ1:τ ∼ (r) Mj , i,j qβ1:τ (ρ1:τ ) (14) = 1 for each value (b) assigns state sequences β1:τ ∼ Mi based on how well (in expectation) the state sequence ρ1:τ ∼ (b) can explain an observation sequence generated by HMM Mi (b) (b) β1:τ ∼ Mi , i. [sent-159, score-0.526]

64 to (r) Mj evolving through state sequence GMM lower bound - In [20] we derive the lower bound (9), by marginalizing EM(b) over GMM i,βt i,j assignment m, introducing the variational distributions qβ,ρ (ζ = l|m), and applying (6). [sent-162, score-0.47]

65 We will i,j restrict the variational distributions to qβ,ρ (ζ = l|m) = η and (i,β ),(j,ρt ) η |m t ≥0 (i,β),(j,ρ) , |m where η (i,βt ),(j,ρt ) =1 |m ∀m, ∀ ,m. [sent-163, score-0.166]

66 Intuitively, η (i,βt ),(j,ρt ) is the responsibility matrix between Gaussian ob(b) (r) servation components for state βt in Mi and state ρt in Mj , where η that an observation from component m of 3. [sent-164, score-0.192]

67 E-step - The variational E-step (see [20] for details) calculates the variational parameters zij , τ φi,j (ρ1:τ |β1:τ ) = φi,j (ρ1 ) t=2 φi,j (ρt |ρt−1 ), and η (i,β),(j,ρ) for the lower bounds in (9) (13) β1 βt (10). [sent-168, score-0.378]

68 The quantity ν i,j (ρ, β) is the expected number of times that ˆ (r) (b) the HMM Mj is in state ρ when the HMM Mi is in state β, when both are modeling sequences (b) ˆ generated by Mi . [sent-177, score-0.18]

69 Similarly, the quantity ξ i,j (ρ, ρ ) is the expected number of transitions from (r) (b) state ρ to state ρ of Mj , when modeling sequences generated by Mi . [sent-178, score-0.18]

70 [4] cluster a collection of HMMs by applying spectral clustering to a probability product kernel (PPK) matrix between HMMs. [sent-185, score-0.295]

71 While this has been proven successful in grouping HMMs into similar clusters, it cannot learn novel HMM cluster centers and therefore is suboptimal for hierarchical estimation of mixture models (see Section 4. [sent-186, score-0.404]

72 The VHEM-H3M algorithm clusters a collection of HMMs directly through the distributions they represent, by estimating a smaller mixture of novel HMMs that concisely models the distribution represented by the input HMMs. [sent-190, score-0.243]

73 As a result, the VHEM cluster centers are consistent with the underlying generative probabilistic framework. [sent-192, score-0.201]

74 As a ﬁrst advantage, since VHEM-H3M estimates novel HMM cluster centers, we expect the learned cluster centers to retain more information on the clusters’ structure and VHEM-H3M to produce better hierarchical clusterings than [4], which suffers out-of-sample limitations. [sent-193, score-0.435]

75 Note that in the second stage we could use the spectral clustering algorithm in [4] instead of VHEM — run spectral clustering over intermediate models pooled together, and form the ﬁnal H3M with the HMMs mapped the closest to the K cluster centers. [sent-203, score-0.502]

76 VHEM, however, is expected to do better since it learns novel cluster centers. [sent-204, score-0.165]

77 As an alternative to VHEM, we tested a version of HEM that, instead of marginalizing over virtual samples, uses actual sampling and the EM algorithm [5] to learn the reduced H3M. [sent-205, score-0.151]

78 213 Level 4 time (h) 678 1860 1033 426 5 Level 3 jog 5 sit 4 walk 2 3 run 2 4 6 7 8 40 50 1 soccer 30 basket 20 jump walk 1 10 2 1 2 1 2 10 retrieval MAP P@10 0. [sent-222, score-0.335]

79 The Rand-index is the probability that any pair of motion sequences are correctly clustered with respect to each other. [sent-276, score-0.206]

80 VHEM-H3M algorithm Experiment on hierarchical motion clustering 3 4 5 6 7 8 Level 1 4. [sent-278, score-0.32]

81 We tested the VHEM algorithm on hierarchical motion clustering, where each of the input HMMs to be clustered is estimated on a sequence of motion capture data from the Motion Capture dataset (http://mocap. [sent-281, score-0.362]

82 In particular, we start from K1 = 56 motion examples from 8 different classes (“jump”, “run”, ‘jog‘”, “walk 1” and “walk 2” which are from two different subjects, “basket”, “soccer”, “sit”), and learn a HMM for each of them, forming the ﬁrst level of the hierarchy. [sent-285, score-0.162]

83 A tree-structure is formed by successively clustering HMMs with the VHEM algorithm, and using the learned cluster centers as the representative HMMs at the new level. [sent-286, score-0.399]

84 The hierarchical clustering obtained with VHEM is illustrated in Figure 1 (top). [sent-288, score-0.205]

85 At Level 2, the 8 HMM clusters are shown with vertical bars, with the colors indicating the proportions of the motion classes in the cluster. [sent-290, score-0.205]

86 At Level 2, VHEM produces clusters with examples from a single motion class (e. [sent-291, score-0.185]

87 Moving up the hierarchy, VHEM clusters similar motions classes together (as indicated by the arrows), and at Level 4 it creates a dichotomy between “sit” and the other (more dynamic) motion classes. [sent-294, score-0.185]

88 On the 7 bottom, in Figure 1, the same experiment is repeated using spectral clustering in tandem with PPK similarity (PPK-SC). [sent-295, score-0.182]

89 PPK-SC clusters motion sequences properly, however, at Level 2 it incorrectly aggregates “sit” and “soccer” that have quite different dynamics, and Level 4 is not as interpretable as the one by VHEM. [sent-296, score-0.282]

90 This suggests that the novel HMM cluster centers learned by VHEM-H3M retain more information on the clusters’ structure than the spectral cluster centers, which is increasingly visible moving up the hierarchy. [sent-309, score-0.412]

91 2 Experiment on automatic music tagging We evaluated VHEM-H3M on content-based music auto-tagging on the CAL500 [11], a collection of 502 songs annotated with respect to a vocabulary V of 149 tags. [sent-312, score-0.391]

92 We formulate music auto-tagging as supervised multi-class labeling [10], where each class is a tag from V and is modeled as a H3M probability distribution estimated from audio-sequences (of T = 125 audio features, i. [sent-317, score-0.244]

93 The last two use an efﬁcient HEM algorithm for learning, and are state-of-the-art baselines for music tagging. [sent-323, score-0.145]

94 Designed for clustering HMMs, PPK-SC does not produce accurate annotation models, since it discards information on the clusters’ structure by approximating it with one of the original HMMs. [sent-330, score-0.207]

95 Instead, VHEM-H3M generates novel HMM cluster centers that effectively summarizes each cluster. [sent-331, score-0.23]

96 Hence, VHEM-H3M outperforms HEM-DTM on the human MoCap data (see Table (2)), which has non-linear dynamics, while the two perform similarly on the music data (difference were statistically signiﬁcant only on annotation P), where the audio features are stationary over short time frames. [sent-335, score-0.275]

97 3 Conclusion We presented a variational HEM algorithm for clustering HMMs through their distributions and generates novel HMM cluster centers. [sent-337, score-0.442]

98 The efﬁcacy of the algorithm was demonstrated on hierarchical motion clustering and automatic music tagging, with improvement over current methods. [sent-338, score-0.491]

99 Semantic annotation and retrieval of music and sound effects. [sent-435, score-0.249]

100 A view of the em algorithm that justiﬁes incremental, sparse, and other variants. [sent-477, score-0.147]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hmm', 0.461), ('hmms', 0.376), ('vhem', 0.341), ('mj', 0.278), ('hem', 0.273), ('gmm', 0.166), ('em', 0.147), ('music', 0.145), ('variational', 0.141), ('mi', 0.139), ('clustering', 0.134), ('virtual', 0.125), ('hidden', 0.118), ('motion', 0.115), ('cluster', 0.113), ('centers', 0.088), ('mixture', 0.082), ('hm', 0.075), ('qi', 0.073), ('annotation', 0.073), ('hierarchical', 0.071), ('clusters', 0.07), ('eb', 0.069), ('sequences', 0.069), ('soccer', 0.06), ('chan', 0.059), ('emission', 0.058), ('markov', 0.058), ('base', 0.057), ('audio', 0.057), ('sit', 0.055), ('songs', 0.055), ('ni', 0.051), ('basket', 0.051), ('coviello', 0.051), ('lgmtm', 0.051), ('zij', 0.049), ('spectral', 0.048), ('lower', 0.047), ('level', 0.047), ('zi', 0.046), ('jog', 0.045), ('ppk', 0.045), ('state', 0.044), ('yt', 0.043), ('log', 0.042), ('tag', 0.042), ('bound', 0.04), ('sequence', 0.039), ('samples', 0.037), ('concisely', 0.037), ('observation', 0.037), ('speech', 0.037), ('yi', 0.035), ('derivation', 0.035), ('jebara', 0.034), ('component', 0.034), ('cityu', 0.034), ('expecta', 0.034), ('eyt', 0.034), ('components', 0.033), ('walk', 0.032), ('expectation', 0.032), ('retrieval', 0.031), ('song', 0.031), ('semantic', 0.03), ('ldss', 0.03), ('jump', 0.029), ('novel', 0.029), ('mz', 0.028), ('aggregates', 0.028), ('likelihood', 0.027), ('automatic', 0.026), ('gert', 0.026), ('reduced', 0.026), ('ece', 0.025), ('distributions', 0.025), ('intermediate', 0.025), ('averages', 0.024), ('derive', 0.024), ('gmms', 0.024), ('expected', 0.023), ('pb', 0.023), ('bhattacharyya', 0.023), ('diego', 0.023), ('assignment', 0.023), ('representative', 0.023), ('group', 0.022), ('clustered', 0.022), ('li', 0.022), ('states', 0.022), ('estimation', 0.021), ('kong', 0.021), ('learned', 0.021), ('divergence', 0.02), ('tagging', 0.02), ('successively', 0.02), ('colors', 0.02), ('acknowledges', 0.02), ('hong', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

2 0.19975181 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

Abstract: For phoneme classiﬁcation, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classiﬁcation performance, on the synthetic and benchmark datasets. 1

3 0.11969399 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulﬁlled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the ﬂexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

4 0.1133197 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

Author: Daniel Zoran, Yair Weiss

Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '

5 0.10436205 69 nips-2012-Clustering Sparse Graphs

Author: Yudong Chen, Sujay Sanghavi, Huan Xu

Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difﬁcult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1

6 0.10268874 126 nips-2012-FastEx: Hash Clustering with Exponential Families

7 0.1022199 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

8 0.10161258 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

9 0.096216843 247 nips-2012-Nonparametric Reduced Rank Regression

10 0.093149319 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

11 0.092749536 180 nips-2012-Learning Mixtures of Tree Graphical Models

12 0.086530246 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

13 0.082539208 188 nips-2012-Learning from Distributions via Support Measure Machines

14 0.078831278 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

15 0.076048881 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

16 0.075515755 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models

17 0.074141406 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

18 0.073676959 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

19 0.073193513 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

20 0.071479589 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.048), (2, -0.014), (3, 0.026), (4, -0.104), (5, -0.076), (6, -0.025), (7, -0.029), (8, 0.031), (9, -0.031), (10, 0.102), (11, -0.072), (12, -0.012), (13, -0.022), (14, 0.065), (15, -0.065), (16, -0.062), (17, 0.019), (18, -0.013), (19, -0.119), (20, -0.135), (21, -0.061), (22, -0.05), (23, 0.023), (24, 0.033), (25, 0.049), (26, 0.045), (27, -0.125), (28, 0.003), (29, 0.081), (30, 0.047), (31, -0.005), (32, -0.018), (33, -0.051), (34, -0.045), (35, 0.095), (36, 0.105), (37, -0.21), (38, 0.052), (39, 0.127), (40, 0.064), (41, 0.115), (42, -0.193), (43, -0.045), (44, 0.031), (45, -0.021), (46, -0.002), (47, -0.055), (48, 0.025), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94462425 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

2 0.59351534 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

3 0.56749171 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classiﬁcation and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the ﬁrst layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report signiﬁcant improvements over standard baselines. 1

4 0.52449429 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

5 0.51416034 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identiﬁcation in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identiﬁcation algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These beneﬁts are shown to extend to real neural data.

6 0.49292269 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

7 0.48975953 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

8 0.47451341 26 nips-2012-A nonparametric variable clustering model

9 0.4736698 359 nips-2012-Variational Inference for Crowdsourcing

10 0.46682203 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

11 0.46484402 126 nips-2012-FastEx: Hash Clustering with Exponential Families

12 0.46474332 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

13 0.45997575 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models

14 0.44964653 294 nips-2012-Repulsive Mixtures

15 0.44957736 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

16 0.44639906 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

17 0.44494474 155 nips-2012-Human memory search as a random walk in a semantic network

18 0.43967396 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

19 0.42895657 2 nips-2012-3D Social Saliency from Head-mounted Cameras

20 0.42526695 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.117), (21, 0.035), (38, 0.099), (42, 0.036), (54, 0.02), (55, 0.012), (71, 0.241), (74, 0.048), (76, 0.131), (80, 0.113), (92, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7791934 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

2 0.77616292 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

3 0.74628663 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Speciﬁcally, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efﬁcient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efﬁciently implemented and deployed in tasks where running efﬁciency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1

4 0.69576192 192 nips-2012-Learning the Dependency Structure of Latent Factors

Author: Yunlong He, Yanjun Qi, Koray Kavukcuoglu, Haesun Park

Abstract: In this paper, we study latent factor models with dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main beneﬁt (novelty) of the model is that we can simultaneously learn the lowerdimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classiﬁcation performance. 1

5 0.69167995 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

Abstract: The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difﬁcult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture signiﬁcantly improves its performance compared to using a previously-proposed static architecture. 1

6 0.68910819 233 nips-2012-Multiresolution Gaussian Processes

7 0.68754888 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

8 0.6812073 179 nips-2012-Learning Manifolds with K-Means and K-Flats

9 0.66619736 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

10 0.6656211 282 nips-2012-Proximal Newton-type methods for convex optimization

11 0.66527879 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

12 0.66512376 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

13 0.66436106 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

14 0.66360468 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

15 0.65966165 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

16 0.65929145 12 nips-2012-A Neural Autoregressive Topic Model

17 0.65839022 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.65434784 200 nips-2012-Local Supervised Learning through Space Partitioning

19 0.65406251 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

20 0.65106171 47 nips-2012-Augment-and-Conquer Negative Binomial Processes