nips nips2012 nips2012-19 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade
Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
Reference: text
sentIndex sentText sentNum sentScore
1 gov Abstract Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. [sent-10, score-0.213]
2 This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. [sent-11, score-0.229]
3 This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). [sent-12, score-0.229]
4 For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i. [sent-13, score-0.263]
5 The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). [sent-16, score-0.609]
6 Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. [sent-17, score-0.213]
7 1 Introduction Topic models use latent variables to explain the observed (co-)occurrences of words in documents. [sent-18, score-0.24]
8 They posit that each document is associated with a (possibly sparse) mixture of active topics, and that each word in the document is accounted for (in fact, generated) by one of these active topics. [sent-19, score-0.493]
9 In Latent Dirichlet Allocation (LDA) [1], a Dirichlet prior gives the distribution of active topics in documents. [sent-20, score-0.147]
10 LDA and related models possess a rich representational power because they allow for documents to be comprised of words from several topics, rather than just a single topic. [sent-21, score-0.188]
11 This increased representational power comes at the cost of a more challenging unsupervised estimation problem, when only the words are observed and the corresponding topics are hidden. [sent-22, score-0.255]
12 In practice, the most common unsupervised estimation procedures for topic models are based on finding maximum likelihood estimates, through either local search or sampling based methods, e. [sent-23, score-0.229]
13 For document modeling, a typical goal is to form a sparse decomposition of a term by document matrix (which represents the word counts in each ∗ Contributions to this work by NIST, an agency of the US government, are not subject to copyright laws. [sent-27, score-0.448]
14 1 document) into two parts: one which specifies the active topics in each document and the other which specifies the distributions of words under each topic. [sent-28, score-0.361]
15 This work provides an alternative approach to parameter recovery based on the method of moments [7], which attempts to match the observed moments with those posited by the model. [sent-29, score-0.702]
16 Our approach does this efficiently through a particular decomposition of the low-order observable moments, which can be extracted using singular value decompositions (SVDs). [sent-30, score-0.24]
17 This method is simple and efficient to implement, and is guaranteed to recover the parameters of a wide class of topic models, including the LDA model. [sent-31, score-0.229]
18 We exploit exchangeability of the observed variables and, more generally, the availability of multiple views drawn independently from the same hidden component. [sent-32, score-0.155]
19 1 Summary of contributions We present an approach called Excess Correlation Analysis (ECA) based on the low-order (cross) moments of observed variables. [sent-34, score-0.349]
20 These observed variables are assumed to be exchangeable (and, more generally, drawn from a multi-view model). [sent-35, score-0.123]
21 , directions where the moments are in excess of those suggested by a Gaussian distribution. [sent-38, score-0.374]
22 The SVDs are performed only on k×k matrices, where k is the number of latent factors; note that the number of latent factors (topics) k is typically much smaller than the dimension of the observed space d (number of words). [sent-39, score-0.372]
23 The method is applicable to a wide class of latent variable models including exchangeable and multiview models. [sent-40, score-0.202]
24 We first consider the class of exchangeable variables with independent latent factors. [sent-41, score-0.232]
25 We show that the (exact) low-order moments permit a decomposition that recovers the parameters for model class, and that this decomposition can be computed using two SVD computations. [sent-42, score-0.504]
26 We then consider LDA and show that the same decomposition of a modified third-order moment correctly recovers both the probability distribution of words under each topic, as well as the parameters of the Dirichlet prior. [sent-43, score-0.228]
27 We note that in order to estimate third-order moments in the LDA model, it suffices for each document to contain at least three words. [sent-44, score-0.456]
28 While the methods described assume exact moments, it is straightforward to write down the analogue “plug-in” estimators based on empirical moments from sampled data. [sent-45, score-0.323]
29 We provide a simple sample complexity analysis that shows that estimating the third-order moments is not as difficult as it might na¨vely seem since we only need a k × k matrix to be accurate. [sent-46, score-0.323]
30 ı Finally, we remark that the moment decomposition can also be obtained using other techniques, including tensor decomposition methods and simultaneous matrix diagonalization methods. [sent-47, score-0.428]
31 2 Related work Under the assumption that a single active topic occurs in each document, the work of [9] provides the first provable guarantees for recovering the topic distributions (i. [sent-51, score-0.596]
32 , the distribution of words under each topic), albeit with a rather stringent separation condition (where the words in each topic are essentially non-overlapping). [sent-53, score-0.563]
33 Understanding what separation conditions permit efficient learning is a natural question; in the clustering literature, a line of work has focussed on understanding the relationship between the separation of the mixture components and the complexity of learning. [sent-54, score-0.3]
34 For the topic modeling problem in which only a single active topic is present per document, [22] provides an algorithm for learning topics with no separation requirement, but under a certain full rank assumption on the topic probability matrix. [sent-56, score-0.971]
35 For the case of LDA (where each document may be about multiple topics), the recent work of [23] provides the first provable result under a natural separation condition. [sent-57, score-0.308]
36 The condition requires that 2 each topic be associated with “anchor words” that only occur in documents about that topic. [sent-58, score-0.298]
37 Under this assumption, [23] provide the first provably correct algorithm for learning the topic distributions. [sent-60, score-0.258]
38 Their work also justifies the use of non-negative matrix (NMF) as a provable procedure for this problem (the original motivation for NMF was as a topic modeling algorithm, though, prior to this work, formal guarantees as such were rather limited). [sent-61, score-0.323]
39 Furthermore, [23] provides results for certain correlated topic models. [sent-62, score-0.229]
40 Our approach makes further progress on this problem by relaxing the need for this separation condition and establishing a much simpler procedure for parameter estimation. [sent-63, score-0.144]
41 The idea has been extended to other discrete mixture models such as discrete hidden Markov models (HMMs) and mixture models with a single active topic in each document (see [22, 25, 26]). [sent-67, score-0.614]
42 For such single topic models, the work in [22] demonstrates the generality of the eigenvector method and the irrelevance of the noise model for the observations, making it applicable to both discrete models like HMMs as well as certain Gaussian mixture models. [sent-68, score-0.354]
43 Another set of related techniques is the body of algebraic methods used for the problem of blind source separation [27]. [sent-69, score-0.164]
44 These approaches are tailored for independent source separation with additive noise (usually Gaussian) [28]. [sent-70, score-0.211]
45 Much of the literature focuses on understanding the effects of measurement noise, which often requires more sophisticated algebraic tools (typically, knowledge of noise statistics or the availability of multiple views of the latent factors is not assumed). [sent-71, score-0.313]
46 , multiple words in a document), which are drawn independently conditioned on the same hidden state. [sent-75, score-0.157]
47 Furthermore, the exchangeability assumption permits us to have an arbitrary noise model (rather than an additive Gaussian noise, which is not appropriate for multinomial and other discrete distributions). [sent-78, score-0.221]
48 This construction bridges the gap between the single topic models (as in [22, 24]) and the independent latent factors model. [sent-80, score-0.472]
49 More generally, the multi-view approach has been exploited in previous works for semi-supervised learning and for learning mixtures of well-separated distributions (e. [sent-81, score-0.126]
50 These previous works essentially use variants of canonical correlation analysis [34] between the two views. [sent-84, score-0.152]
51 2 The independent latent factors and LDA models Let h = (h1 , h2 , . [sent-86, score-0.243]
52 , hk ) ∈ Rk be a random vector specifying the latent factors (i. [sent-89, score-0.213]
53 , the hidden state) of a model, where hi is the value of the i-th factor. [sent-91, score-0.168]
54 The goal is to estimate the matrix O, sometimes referred to as the topic matrix. [sent-109, score-0.229]
55 1 Independent latent factors model In the independent latent factors model, we assume h has a product distribution, i. [sent-112, score-0.456]
56 Multiple mixtures of Gaussians: Suppose xv = Oh + η, where η is Gaussian noise and h is a binary vector (under a product distribution). [sent-119, score-0.459]
57 This generalizes the classic mixture of k Gaussians, as the model now permits any number of Gaussians to be responsible for generating the hidden state (i. [sent-121, score-0.171]
58 Multiple mixtures of Poissons: Suppose [Oh]j specifies the Poisson rate of counts for [xv ]j . [sent-127, score-0.13]
59 For example, xv could be a vector of word counts in the v-th sentence of a document. [sent-128, score-0.444]
60 Here, O would be a matrix with positive entries, and hi would scale the rate at which topic i generates words in a sentence (as specified by the i-th column of O). [sent-129, score-0.469]
61 Here, multiple topics may be responsible for generating the words in each sentence. [sent-131, score-0.217]
62 As α0 → 0, the distribution degenerates to one over pure topics (i. [sent-138, score-0.139]
63 , xv represents what the v-th word in a document is, so d represents the number of words in the language). [sent-146, score-0.622]
64 The sampling process for a document is as follows. [sent-148, score-0.133]
65 First, the topic mixture h is drawn from the Dirichlet distribution. [sent-149, score-0.308]
66 Then, the v-th word in the document (for v = 1, 2, . [sent-150, score-0.221]
67 k} according to the discrete distribution specified by h, then (ii) drawing xv according to the discrete distribution specified by Ot (the t-th column of O). [sent-156, score-0.417]
68 For this model to fit in our setting, we use the “one-hot” encoding for xv from [22]: xv ∈ {0, 1}d with [xv ]j = 1 iff the v-th word in the document is the j-th word in the vocabulary. [sent-158, score-0.949]
69 3 Excess Correlation Analysis (ECA) We now present efficient algorithms for exactly recovering O from low-order moments of the observed variables. [sent-161, score-0.349]
70 The algorithm is based on two singular value decompositions: the first SVD whitens the data (based on the correlation between two variables), and the second SVD is carried 4 Algorithm 1 ECA, with skewed factors Input: vector θ ∈ Rk ; the moments Pairs and Triples. [sent-162, score-0.655]
71 SVD: Let Ξ be the set of left singular vectors of W Triples(W θ)W corresponding to non-repeated singular values (i. [sent-170, score-0.202]
72 We start with the case of independent factors, as these algorithms make the basic diagonalization approach clear. [sent-176, score-0.125]
73 1 Independent and skewed latent factors Define the following moments: µ := E[x1 ], Pairs := E[(x1 − µ) ⊗ (x2 − µ)], Triples := E[(x1 − µ) ⊗ (x2 − µ) ⊗ (x3 − µ)] (here ⊗ denotes the tensor product, so µ ∈ Rd , Pairs ∈ Rd×d , and Triples ∈ Rd×d×d ). [sent-179, score-0.363]
74 To see the latter, observe the distribution of any xv is unaltered if, for any i ∈ [k], we multiply the i-th column of O by a scalar c = 0 and divide the variable hi by the same scalar c. [sent-183, score-0.479]
75 In canonical form, O is unique up to a signed column permutation. [sent-194, score-0.143]
76 Let µi,p := E[(hi − E[hi ])p ] denote the p-th central moment of hi , so the variance and skewness of 2 3 hi are given by σi := µi,2 and γi := µi,3 /σi . [sent-195, score-0.331]
77 Under the independent latent factor model, the following hold. [sent-201, score-0.163]
78 • No False Positives: For all θ ∈ Rk , Algorithm 1 returns a subset of the columns of O, in canonical form up to sign. [sent-202, score-0.177]
79 If θ ∈ Rk is drawn uniformly at random from the unit sphere S k−1 , then with probability 1, Algorithm 1 returns all columns of O, in canonical form up to sign. [sent-204, score-0.205]
80 Under the independent latent factor model, k Pairs 2 2 2 2 σi Oi ⊗ Oi = O diag(σ1 , σ2 , . [sent-208, score-0.163]
81 Since M is orthogonal, the above is an eigendecomposition of W Triples(W θ)W , and hence the set of left singular vectors corresponding to nonrepeated singular values are uniquely defined up to sign. [sent-233, score-0.202]
82 Each such singular vector ξ is of the form si M ei = si W Oei = si W Oi for some i ∈ [k] and si ∈ {±1}, so (W + ) ξ = si W (W W )−1 W Oi = si Oi (because range(W ) = range(U ) = range(O)). [sent-234, score-0.383]
83 Even though the distribution on h is proportional to the product hα1 −1 hα2 −1 · · · hαk −1 , the hi are not independent because h is constrained to 1 2 k live in the simplex. [sent-244, score-0.15]
84 These mild dependencies suggest using a certain correction of the moments with ECA. [sent-245, score-0.323]
85 (α0 + 2)(α0 + 1) 6 Algorithm 2 ECA for Latent Dirichlet Allocation Input: vector θ ∈ Rk ; the modified moments Pairsα0 and Triplesα0 . [sent-254, score-0.323]
86 In this case, the modified moments tend to the raw (cross) moments: lim Pairsα0 = E[x1 ⊗ x2 ], lim Triplesα0 = E[x1 ⊗ x2 ⊗ x3 ]. [sent-264, score-0.415]
87 At the other extreme α0 → ∞, the modified moments tend to the central moments: lim Pairsα0 = E[(x1 − µ) ⊗ (x2 − µ)], α0 →∞ lim Triplesα0 = E[(x1 − µ) ⊗ (x2 − µ) ⊗ (x3 − µ)] α0 →∞ (to see this, expand the central moment and use exchangeability: E[x1 x2 ] = E[x2 x3 ] = E[x1 x3 ]). [sent-266, score-0.442]
88 1, except here we must use the specific properties of the Dirichlet distribution to show that the corrections to the raw (cross) moments have the desired effect. [sent-282, score-0.351]
89 This is akin to O being in canonical form as per the skewed factor 7 model of Theorem 3. [sent-289, score-0.161]
90 ECA seamlessly interpolates between the single topic model (α0 → 0) of [22] and the skewness-based ECA, Algorithm 1 (α0 → ∞). [sent-295, score-0.229]
91 1 Discussion Sample complexity It is straightforward to derive a “plug-in” variant of Algorithm 2 based on empirical moments rather than exact population moments. [sent-297, score-0.378]
92 The empirical moments are formed using the word co-occurrence statistics for documents in a corpus. [sent-298, score-0.445]
93 Let pmin = mini α0 and let σk (O) denote the smallest (non-zero) singular value of O. [sent-304, score-0.147]
94 Suppose that we obtain N ≥ C1 · ((α0 + 1)/(pmin σk (O)2 ))2 independent samples of x1 , x2 , x3 in the LDA model, which are used to form empirical moments Pairsα0 and Triplesα0 . [sent-305, score-0.353]
95 Alternative decomposition methods Algorithm 1 is a theoretically efficient and simple-to-state method for obtaining the desired decomk position of the tensor Triples = i=1 µi,3 Oi ⊗ Oi ⊗ Oi (a similar tensor form for Triplesα0 in the case of LDA can also be given). [sent-311, score-0.244]
96 However, in practice the method is not particularly stable, due to the use of internal randomization to guarantee strict separation of singular values. [sent-312, score-0.21]
97 2 can be exploited to derive very efficient estimation algorithms for all the models considered here (and others) based on a tensor power iteration. [sent-316, score-0.125]
98 We have used a simplified version of this tensor power iteration in preliminary experiments for estimating topic models, and found the results (Appendix A) to be very encouraging, especially due to the speed and robustness of the algorithm. [sent-317, score-0.322]
99 Two svds suffice: spectral decompositions for probabilistic topic models and latent dirichlet allocation, 2012. [sent-375, score-0.752]
100 A method of moments for mixture models and hidden markov models. [sent-457, score-0.422]
wordName wordTfidf (topN-words)
[('triples', 0.409), ('moments', 0.323), ('xv', 0.32), ('topic', 0.229), ('oi', 0.226), ('eca', 0.209), ('lda', 0.187), ('dirichlet', 0.156), ('latent', 0.133), ('document', 0.133), ('hi', 0.12), ('separation', 0.109), ('svds', 0.107), ('oh', 0.105), ('canonical', 0.104), ('topics', 0.103), ('singular', 0.101), ('diagonalization', 0.095), ('mixtures', 0.094), ('tensor', 0.093), ('rk', 0.093), ('focs', 0.09), ('word', 0.088), ('diag', 0.085), ('pairs', 0.085), ('decompositions', 0.081), ('words', 0.081), ('factors', 0.08), ('allocation', 0.073), ('exchangeable', 0.069), ('hsu', 0.067), ('kakade', 0.066), ('provable', 0.066), ('anandkumar', 0.062), ('svd', 0.06), ('decomposition', 0.058), ('skewed', 0.057), ('moment', 0.055), ('algebraic', 0.055), ('exchangeability', 0.053), ('moitra', 0.053), ('excess', 0.051), ('ge', 0.051), ('mixture', 0.051), ('colt', 0.049), ('arora', 0.049), ('correlation', 0.048), ('hidden', 0.048), ('chaudhuri', 0.047), ('ej', 0.047), ('pmin', 0.046), ('whitens', 0.046), ('spectral', 0.046), ('noise', 0.045), ('representational', 0.045), ('rd', 0.044), ('active', 0.044), ('returns', 0.043), ('ei', 0.042), ('si', 0.04), ('permits', 0.039), ('dean', 0.039), ('column', 0.039), ('remark', 0.038), ('counts', 0.036), ('degenerates', 0.036), ('skewness', 0.036), ('condition', 0.035), ('recovers', 0.034), ('documents', 0.034), ('responsible', 0.033), ('lim', 0.032), ('pr', 0.032), ('modi', 0.032), ('exploited', 0.032), ('simultaneous', 0.031), ('nmf', 0.031), ('oe', 0.031), ('permit', 0.031), ('cross', 0.03), ('independent', 0.03), ('positives', 0.03), ('learnability', 0.03), ('sham', 0.03), ('columns', 0.03), ('recovery', 0.03), ('discrete', 0.029), ('foster', 0.029), ('kalai', 0.029), ('provably', 0.029), ('microsoft', 0.029), ('drawn', 0.028), ('award', 0.028), ('claims', 0.028), ('rather', 0.028), ('assumption', 0.028), ('raw', 0.028), ('additive', 0.027), ('variant', 0.027), ('observed', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade
Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
2 0.25500116 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
Author: Michael Paul, Mark Dredze
Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1
3 0.18247375 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees
Author: Percy Liang, Daniel J. Hsu, Sham M. Kakade
Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods [1] cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. 1
4 0.14934739 12 nips-2012-A Neural Autoregressive Topic Model
Author: Hugo Larochelle, Stanislas Lauly
Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
5 0.13876833 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
Author: James T. Kwok, Ryan P. Adams
Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1
6 0.13867143 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
7 0.13658768 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
8 0.12727492 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
9 0.12419671 180 nips-2012-Learning Mixtures of Tree Graphical Models
10 0.12129839 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features
11 0.12076014 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
12 0.1197231 147 nips-2012-Graphical Models via Generalized Linear Models
13 0.11894894 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models
14 0.11050036 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
15 0.11024596 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis
16 0.10496527 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
17 0.10296921 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
18 0.095984511 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
19 0.0874734 126 nips-2012-FastEx: Hash Clustering with Exponential Families
20 0.079369575 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
topicId topicWeight
[(0, 0.205), (1, 0.081), (2, 0.001), (3, -0.04), (4, -0.217), (5, -0.034), (6, -0.017), (7, -0.021), (8, 0.054), (9, -0.047), (10, 0.255), (11, 0.13), (12, 0.046), (13, -0.028), (14, 0.029), (15, 0.103), (16, 0.153), (17, 0.112), (18, 0.087), (19, 0.076), (20, -0.026), (21, 0.008), (22, -0.069), (23, -0.053), (24, 0.022), (25, 0.023), (26, 0.053), (27, -0.02), (28, 0.06), (29, -0.012), (30, 0.083), (31, 0.028), (32, 0.048), (33, 0.079), (34, 0.01), (35, -0.014), (36, -0.092), (37, 0.034), (38, 0.175), (39, 0.032), (40, 0.013), (41, -0.065), (42, -0.006), (43, 0.007), (44, -0.011), (45, -0.045), (46, 0.052), (47, 0.024), (48, 0.046), (49, -0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.96383464 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade
Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
2 0.8242507 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models
Author: Michael Paul, Mark Dredze
Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1
3 0.80218208 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis
Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing
Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1
4 0.78135431 12 nips-2012-A Neural Autoregressive Topic Model
Author: Hugo Larochelle, Stanislas Lauly
Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
5 0.69700861 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features
Author: Xianxing Zhang, Lawrence Carin
Abstract: A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring interpretable latent binary features for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents defined by the legislation. Advantages of the proposed model are demonstrated for prediction of votes on a new piece of legislation, based only on the observed text of legislation. The coupling of the text and legislation is also shown to yield insight into the properties of the matrix decomposition for roll-call data. 1
6 0.69464844 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
7 0.63429809 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees
8 0.60883015 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models
9 0.58142632 345 nips-2012-Topic-Partitioned Multinetwork Embeddings
10 0.57076877 192 nips-2012-Learning the Dependency Structure of Latent Factors
11 0.53586918 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
12 0.5303874 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
13 0.52550983 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior
14 0.52174866 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
15 0.51419973 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
16 0.48626921 180 nips-2012-Learning Mixtures of Tree Graphical Models
17 0.47912249 22 nips-2012-A latent factor model for highly multi-relational data
18 0.47218198 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
19 0.47166425 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
20 0.46409708 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
topicId topicWeight
[(0, 0.106), (14, 0.198), (21, 0.019), (38, 0.143), (42, 0.044), (54, 0.04), (55, 0.011), (74, 0.062), (76, 0.145), (80, 0.107), (92, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.8631084 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade
Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
2 0.85565889 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
Author: Florian T. Pokorny, Hedvig Kjellström, Danica Kragic, Carl Ek
Abstract: We present a novel method for learning densities with bounded support which enables us to incorporate ‘hard’ topological constraints. In particular, we show how emerging techniques from computational algebraic topology and the notion of persistent homology can be combined with kernel-based methods from machine learning for the purpose of density estimation. The proposed formalism facilitates learning of models with bounded support in a principled way, and – by incorporating persistent homology techniques in our approach – we are able to encode algebraic-topological constraints which are not addressed in current state of the art probabilistic models. We study the behaviour of our method on two synthetic examples for various sample sizes and exemplify the benefits of the proposed approach on a real-world dataset by learning a motion model for a race car. We show how to learn a model which respects the underlying topological structure of the racetrack, constraining the trajectories of the car. 1
3 0.82395887 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
Author: Aaron Defazio, Tibério S. Caetano
Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.
4 0.80404419 254 nips-2012-On the Sample Complexity of Robust PCA
Author: Matthew Coudron, Gilad Lerman
Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1
5 0.78271824 192 nips-2012-Learning the Dependency Structure of Latent Factors
Author: Yunlong He, Yanjun Qi, Koray Kavukcuoglu, Haesun Park
Abstract: In this paper, we study latent factor models with dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main benefit (novelty) of the model is that we can simultaneously learn the lowerdimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classification performance. 1
6 0.78255451 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
7 0.77368647 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
8 0.77334869 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
9 0.77087718 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
10 0.77004236 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
11 0.76998311 233 nips-2012-Multiresolution Gaussian Processes
12 0.7680741 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
13 0.76548165 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
14 0.76391846 69 nips-2012-Clustering Sparse Graphs
15 0.76345354 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables
16 0.76204896 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System
17 0.76002961 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
18 0.75973731 282 nips-2012-Proximal Newton-type methods for convex optimization
19 0.75947434 292 nips-2012-Regularized Off-Policy TD-Learning
20 0.75911337 199 nips-2012-Link Prediction in Graphs with Autoregressive Features