nips nips2012 nips2012-220 knowledge-graph by maker-knowledge-mining

220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models


Source: pdf

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu † Abstract An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. [sent-9, score-0.743]

2 This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. [sent-10, score-0.369]

3 In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. [sent-12, score-0.583]

4 1 Introduction Topic models, such as Latent Dirichlet Allocation (LDA) [3], have shown great promise in discovering latent semantic representations of large collections of text documents. [sent-14, score-0.12]

5 One notable extension is supervised topic models, which were developed to incorporate supervising side information for discovering predictive latent topic representations. [sent-16, score-0.654]

6 MedLDA differs from its counterpart supervised topic models by imposing discriminative constraints (i. [sent-18, score-0.358]

7 , max-margin constraints) directly on the desired posterior distributions, instead of defining a normalized likelihood model as in sLDA and DiscLDA. [sent-20, score-0.082]

8 Such topic models with max-margin posterior constraints have shown superior performance in various settings [16, 14, 13, 9]. [sent-21, score-0.361]

9 Previous inference methods for such models with max-margin posterior constraints have been exclusively on the variational methods [7] usually with a strict mean-field assumption. [sent-23, score-0.213]

10 1 1 In this paper, we develop efficient Monte Carlo methods for max-margin supervised topic models, which we believe is crucial for highly scalable implementation, and further performance enhancement of this class of models. [sent-30, score-0.287]

11 Specifically, we first provide a new and equivalent formulation of the MedLDA as a regularized Bayesian model with max-margin posterior constraints, based on Zellner’s interpretation of Bayes’ rule as a learning model [15] and the recent development of regularized Bayesian inference [17]. [sent-31, score-0.202]

12 This interpretation is arguably more natural than the original formulation of MedLDA as a hybrid max-likelihood and max-margin learning, where the log-likelihood is approximated by a variational upper bound for computational tractability. [sent-32, score-0.107]

13 Then, we deal with the set of soft max-margin constraints with convex duality methods and derive the optimal solutions of the desired posterior distributions. [sent-33, score-0.124]

14 To effectively reduce the size of the sampling space, we develop two samplers, namely, an importance sampler and a collapsed Gibbs sampler [4, 1], with a much weaker assumption on the desired posterior distribution compared to the mean field methods in [16]. [sent-34, score-0.485]

15 We note that the work [11] presents a duality method to handle moment matching constraints in maximum entropy models. [sent-35, score-0.094]

16 Our work is an extension of their results to learn topic models, which have nontrivially structured latent variables and also use the general soft margin constraints. [sent-36, score-0.309]

17 2 Latent Dirichlet Allocation LDA [3] is a hierarchical Bayesian model that posits each document as an admixture of K topics, where each topic Φk is a multinomial distribution over a V -word vocabulary. [sent-37, score-0.276]

18 For document d, its topic proportion θ d is a multinomial distribution drawn from a Dirichlet prior. [sent-38, score-0.276]

19 Let wd = {wdn }N n=1 denote the words appearing in document d. [sent-39, score-0.091]

20 For the n-th word wdn , a topic assignment zdn = k is drawn from θ d and wdn is drawn from Φk . [sent-40, score-0.761]

21 In short, the generative process of d is θ d ∼ Dir(α), zdn = k ∼ Mult(θ d ), wdn ∼ Mult(Φk ), (1) where Dir(·) is a Dirichlet, Mult(·) is a multinomial. [sent-41, score-0.416]

22 For fully-Bayesian LDA, the topics are also random samples drawn from a Dirichlet prior, i. [sent-42, score-0.084]

23 Let W = {wd }D denote all the words in a corpus with D documents, and define zd = {zdn }N , n=1 d=1 Z = {zd }D , Θ = {θ d }D . [sent-45, score-0.093]

24 The goal of LDA is to infer the posterior distribution d=1 d=1 p(Θ, Z, Φ|W, α, β) = p0 (Θ, Z, Φ|α, β)p(W|Θ, Z, Φ) . [sent-46, score-0.1]

25 p(W|α, β) (2) Since inferring the true posterior distribution is intractable, researchers must resort to variational [3] or Monte Carlo [5] approximate methods. [sent-47, score-0.144]

26 3 MedLDA: a supervised topic model with max-margin constraints MedLDA extends LDA by integrating the max-margin learning into the procedure of discovering latent topic representations to learn latent representations that are good for predicting class labels or rating scores of a document. [sent-53, score-0.772]

27 Empirically, MedLDA and its various extensions [14, 13, 9] have demonstrated promise in learning more discriminative topic representations. [sent-54, score-0.266]

28 The original MedLDA was designed as a hybrid max likelihood and max-margin learning, where the intractable loglikelihood is approximated by a variational bound. [sent-55, score-0.101]

29 To derive our sampling methods, we present a new interpretation of MedLDA from the perspective of regularized Bayesian inference [17]. [sent-56, score-0.105]

30 (2), Bayesian inference is an information processing rule that projects the prior p0 and empirical evidence to a post-data posterior distribution via the Bayes’ rule. [sent-59, score-0.131]

31 It is the core for likelihood-based supervised topic models [2, 12]. [sent-60, score-0.287]

32 Specifically, Zellner showed that the posterior distribution by Bayes’ rule is the solution of an optimization problem. [sent-62, score-0.104]

33 For instance, the posterior p(Θ, Z, Φ|W) of LDA is equivalent to the optimum solution of (3) min KL[p(Θ, Z, Φ)∥p0 (Θ, Z, Φ)] − Ep [log p(W|Θ, Z, Φ)], p(Θ,Z,Φ)∈P where KL(q||p) is the Kullback-Leibler divergence from q to p, and P is the space of probability distributions. [sent-63, score-0.113]

34 Let D = {(wd , yd )}D be a given fully-labeled d=1 training set, where the response variable Y takes values from a finite set Y = {1, . [sent-67, score-0.089]

35 Since our goal is to discover latent representations Z that are good for classification, one natural solution is to connect Z directly to our ultimate goal. [sent-75, score-0.086]

36 Then, we want to infer the joint posterior distribution p(η, Θ, Z, Φ|D). [sent-80, score-0.1]

37 With the above definitions, a natural prediction rule is (5) y = argmax F (y; w), ˆ y and we would like to “regularize” the properties of the latent topic representations to make them suitable for a classification task. [sent-82, score-0.37]

38 Let L(p) = KL(p||p0 (η, Θ, Z, Φ)) − ¯ ¯ ¯ Ep [log p(W|Z, Φ)] and ∆f (y, zd ) = f (yd , zd ) − f (y, zd ). [sent-84, score-0.279]

39 : Ep [η ∆f (y, zd )] ≥ ℓd (y) − ξd , ξd ≥ 0, ∀d, ∀y, where the prior is p0 (η, Θ, Z, Φ) = p0 (η)p0 (Θ, Z, Φ). [sent-87, score-0.093]

40 Also, problem (7) can be equivalently written as min L(p(η, Θ, Z, Φ)) + CR(p(η, Θ, Z, Φ)) (8) p(η,Θ,Z,Φ)∈P ∑ 1 ⊤ ¯ where R = D d argmaxy (ℓd (y) − Ep [η ∆f (y, zd )]) is the hinge loss, an upper bound of the prediction error on training data. [sent-89, score-0.139]

41 4 Monte Carlo methods for MedLDA As in other variants of topic models, it is intractable to solve problem (7) or the equivalent problem (8) directly. [sent-90, score-0.278]

42 It is easy to show that the variational EM method in [16] is a coordinate descent algorithm to solve problem (7), with the additional fully-factorized mean-field constraint, ∏ ∏ ∏ p(zdn )) p(Φk ). [sent-92, score-0.084]

43 (9) p(η, Θ, Z, Φ) = p(η)( p(θ d ) n d k Below, we present two MC sampling methods to solve the MedLDA problem, with much weaker constraints on p, and thus they could be expected to produce more accurate solutions. [sent-93, score-0.117]

44 z By using the Lagrangian methods with multipliers λ, we have the optimum posterior distribution p(η) ∝ p0 (η)eη ⊤ ∑ ∑ y · D z d=1 y λd ∆f (y,E[¯d ]) (11) . [sent-101, score-0.15]

45 Note that κ is the posterior mean of classifier parameters η, z d and the element κyk represents the contribution of topic k in classifying a data point to category y. [sent-109, score-0.319]

46 z Although in theory we can solve this subproblem again using Lagrangian dual methods, it would be hard to derive the dual objective function (if possible at all). [sent-115, score-0.117]

47 It is easy to show that by fixing ξ at ξ∗ , we will have the optimum solution p(Θ, Z, Φ) ∝ p(W, Z, Θ, Φ)e(κ ∗ ⊤ ) ∑ y ∗ z dy (λd ) ∆f (y,¯d ) , (14) The differences between MedLDA and LDA lie in the above posterior distribution. [sent-117, score-0.113]

48 The first term is the same as the posterior of LDA (the evidence p(W) can be absorbed into the normalization constant). [sent-118, score-0.082]

49 The second term indicates the regularization effects due to the max-margin posterior constraints, which is consistent with our intuition. [sent-119, score-0.082]

50 , the data are around the decision boundary or misclassified), the second term will bias the model towards a new posterior distribution that favors more discriminative representations on these “hard” data points. [sent-122, score-0.147]

51 Now, the remaining problem is how to efficiently draw samples from p(Θ, Z, Φ) and estimate the expectations E[¯] as accurate as possible, which are needed in learning classification models. [sent-123, score-0.088]

52 Below, z we present two representative samplers – an importance sampler and a collapsed Gibbs sampler. [sent-124, score-0.247]

53 1 Importance sampler To avoid dealing with the intractable normalization constant of p(Θ, Z, Φ), one natural choice is to use importance sampling. [sent-126, score-0.184]

54 However, directly applying importance sampling to p(Θ, Z, Φ) may cause some issues since importance sampling suffers from severe limitations in large sample spaces. [sent-128, score-0.207]

55 (14) has the factorization form p(Θ, Z, Φ) = p0 (Θ, Φ)p(Z|Θ, Φ), another posˆ ˆ sible method is to adopt the ancestral sampling strategy to draw sample (Θ, Φ) from p0 (Θ, Φ) and ˆ Φ). [sent-130, score-0.089]

56 Although it is easy to draw a sample from the Dirichlet prior ˆ then draw samples from p(Z|Θ, p0 (Θ, Φ) = Dir(α)Dir(β), it would require a large number of samples to get a robust estimate of the expectations E[Z]. [sent-131, score-0.2]

57 4 One feasible method to reduce the sample space is to collapse (Θ, Φ) out and directly draw samples from the marginal distribution p(Z). [sent-133, score-0.11]

58 However, this will cause tight couplings between Z and make the number of samples needed to estimate the expectation grow exponentially with the dimensionality of Z for importance sampler. [sent-134, score-0.114]

59 A practical sampler for this collapsed distribution would be a Markov chain, as we will present in next section. [sent-135, score-0.185]

60 Here, we propose to use the MAP estimate of (Θ, Φ) as their “single sample”3 and proceed to draw samples of Z. [sent-136, score-0.088]

61 The difference yk (κ∗d k − κ∗ ) represents the different contribution of topic k in classifying d to the true category yd y yk and a wrong category y. [sent-138, score-0.422]

62 If the difference is positive, topic k contributes to make a correct prediction for d; otherwise, it contributes to make a wrong prediction. [sent-139, score-0.283]

63 j=1 dn ˆ ˆ The above two steps are repeated until convergence, initializing (Θ, Φ) to be uniform, and the samples from the last iteration are used to estimate the expectation statistics needed in the problem of inferring p(η). [sent-141, score-0.129]

64 2 Collapsed Gibbs sampler As we have stated, another way to effectively reduce the sample space is to integrate out the intermediate variables (Θ, Φ) and build a Markov chain whose equilibrium distribution is the resulting marginal distribution p(Z). [sent-143, score-0.148]

65 We propose to use collapsed Gibbs sampling, which has been successfully used for LDA [5]. [sent-144, score-0.082]

66 5 For those data on the margin or misclassified (with non-zero Lagrange multipliers), the last term is non-zero and acts as a regularizer directly affecting the topic assignments of these difficult data. [sent-150, score-0.259]

67 , finished the burn-in stage), we draw J samples {Z(j) } and estimate the expectation statistics Nd J 1 ∑ 1 ∑ (j) ¯ z . [sent-155, score-0.088]

68 (22) E[¯dk ] = z E[zdn ], ∀¯dk ∈ zd , and E[zdn ] = z Nd n=1 J j=1 dn 4. [sent-156, score-0.17]

69 3 Prediction To make prediction on unlabeled testing data using the prediction rule (5), we take the approach that has been adopted for the variational MedLDA, which uses a point estimate of topics Φ from training ˆ data and makes prediction based on them. [sent-157, score-0.252]

70 For the ∑J 1 t (j) ˆ ˆ collapsed Gibbs sampler, an estimate of Φ using the samples is ϕkt ∝ J j=1 Ck + βt , where t Ck (j) is the times that term t is assigned to topic k in the j-th sample. [sent-161, score-0.371]

71 Given a new document w to be predicted, for importance sampler, the importance weight should ∏K (j) j ˆ be altered as γn = k=1 (θk ϕkwn /g(k))I(zn =k) . [sent-162, score-0.163]

72 For Gibbs sampler, we infer its latent components z using the obtained Φ as p(zn = ˆkw (C k + αk ), where C k is the times that the terms in this document w assigned to k|z¬n ) ∝ ϕ n ¬n ¬n topic k with the n-th term excluded. [sent-165, score-0.344]

73 z 5 Experiments We empirically evaluate the importance sampler and the Gibbs sampler for MedLDA (denoted by iMedLDA and gMedLDA respectively) on the 20 Newsgroups data set with a standard list of stop words4 removed. [sent-168, score-0.268]

74 We use the cutting-plane algorithm [6] to solve the multi-class SVM to infer p(η) and solve for the lagrange multipliers λ in MedLDA. [sent-171, score-0.127]

75 , = 3 × K) samples {Z(j) }J from g(z) j=1 outside the iteration loop and only update the importance weights to save time. [sent-175, score-0.094]

76 We use symmetric Dirichlet priors for all LDA topic models, i. [sent-184, score-0.237]

77 To measure the sparsity of the latent representations ∑ 1 of documents, we compute the average entropy over test documents: |Dt | d∈Dt H(θ d ). [sent-197, score-0.16]

78 We also measure the sparsity of the inferred topic distributions Φ in terms of the average entropy over topics, ∑K 1 i. [sent-198, score-0.311]

79 1 Performance with different topic numbers This section compares gMedLDA and iMedLDA with baseline methods. [sent-206, score-0.237]

80 LDA model that uses collapsed Gibbs sampling [5] (denoted by gLDA) and the LDA model that uses fully-factorized variational methods [3] (denoted by fLDA). [sent-218, score-0.174]

81 For LDA models, we discover the latent representations of the training documents and use them to build a multi-class SVM classifier. [sent-219, score-0.146]

82 01 for the other topic models just as in the literature [5]. [sent-226, score-0.237]

83 1(b) shows the average entropy of latent representations Θ over test documents. [sent-234, score-0.138]

84 This implies that sampling methods for MedLDA can effectively concentrate the probability mass on just several topics thus discover more predictive topic representations. [sent-236, score-0.319]

85 However, fMedLDA yields the smallest entropy, which is mainly because the fully-factorized variational methods tend to get too compact results, e. [sent-237, score-0.083]

86 1(c) shows the average entropy of topic distributions Φ over topics. [sent-241, score-0.289]

87 This is probably because many of the samples (topic assignments) generated by the proposal distribution are “incorrect” but importance sampler still assigns weights to these samples. [sent-247, score-0.242]

88 As a result, the inferred topic distributions are very dense and thus have a large entropy. [sent-248, score-0.237]

89 This is probably because with a finite number of samples, Gibbs sampler tends to produce a too sparse estimate of E[Z], and a slightly stronger prior is helpful to deal with the sample sparsity issue. [sent-261, score-0.189]

90 In contrast, the importance sampler avoids such sparsity issue by using a uniform proposal distribution, which could make the samples well cover all topic dimensions. [sent-262, score-0.48]

91 We found that the sample size in the training process has almost no influence on the prediction accuracy even when it equals to 1. [sent-286, score-0.12]

92 In this experiment, we don’t bound the maximum number of iterations and allow the Gibbs sampler to run until the tolerance ϵ is reached. [sent-295, score-0.103]

93 This is mainly because gMedLDA alternately updates posterior distribution and Lagrangian multipliers. [sent-299, score-0.101]

94 For instance, when the topic number is 90, gMedLDA takes 11,986 seconds on training when ϵ = 10−4 but 1,795 seconds when ϵ = 10−2 . [sent-302, score-0.258]

95 2(d) shows the the classification accuracy of MedLDA with various inference methods as a function of iteration when the topic number is set at 30. [sent-306, score-0.296]

96 3 gLDA fLDA reports the total training time of different models, which includes 10 20 40 60 80 100 120 two phases – inferring the latent topic representations and training # Topics SVMs. [sent-316, score-0.365]

97 3 2 6 Conclusions We have presented two Monte Carlo methods for MedLDA, a supervised topic model using maxmargin constraints directly on the desired posterior distributions for discovering predictive latent topic representations. [sent-324, score-0.732]

98 Experimental results on the 20 Newsgroups data set show that Monte Carlo methods are robust to hyper-parameters and could yield very competitive results for such max-margin topic models. [sent-326, score-0.237]

99 A combination of topic models with max-margin learning for relation detection. [sent-402, score-0.237]

100 MedLDA: maximum margin supervised topic models for regression and classification. [sent-442, score-0.309]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gmedlda', 0.493), ('medlda', 0.439), ('imedlda', 0.355), ('zdn', 0.308), ('topic', 0.237), ('fmedlda', 0.2), ('lda', 0.137), ('glda', 0.123), ('wdn', 0.108), ('sampler', 0.103), ('zd', 0.093), ('gmedldak', 0.093), ('imedldak', 0.093), ('flda', 0.088), ('posterior', 0.082), ('collapsed', 0.082), ('dn', 0.077), ('yd', 0.068), ('gibbs', 0.066), ('variational', 0.062), ('importance', 0.062), ('ck', 0.054), ('carlo', 0.054), ('topics', 0.052), ('entropy', 0.052), ('wd', 0.052), ('monte', 0.051), ('dir', 0.05), ('latent', 0.05), ('supervised', 0.05), ('yk', 0.048), ('dk', 0.048), ('supervising', 0.046), ('zellner', 0.046), ('dirichlet', 0.045), ('constraints', 0.042), ('ep', 0.042), ('slda', 0.041), ('cd', 0.04), ('classi', 0.039), ('document', 0.039), ('documents', 0.039), ('sensitivity', 0.039), ('multipliers', 0.037), ('representations', 0.036), ('draw', 0.036), ('tsinghua', 0.035), ('discovering', 0.034), ('samples', 0.032), ('accuracy', 0.032), ('optimum', 0.031), ('mult', 0.031), ('mc', 0.031), ('disclda', 0.031), ('inseparable', 0.031), ('sampling', 0.03), ('discriminative', 0.029), ('lagrange', 0.028), ('eld', 0.028), ('inference', 0.027), ('svm', 0.026), ('bayes', 0.026), ('dual', 0.026), ('interpretation', 0.025), ('prediction', 0.025), ('bayesian', 0.025), ('kt', 0.024), ('proposal', 0.024), ('subproblem', 0.024), ('china', 0.023), ('regularized', 0.023), ('weaker', 0.023), ('sample', 0.023), ('newsgroups', 0.022), ('solve', 0.022), ('rule', 0.022), ('chain', 0.022), ('stable', 0.022), ('sparsity', 0.022), ('margin', 0.022), ('kw', 0.021), ('wrong', 0.021), ('cation', 0.021), ('training', 0.021), ('zn', 0.021), ('probably', 0.021), ('judge', 0.021), ('get', 0.021), ('lagrangian', 0.02), ('kl', 0.02), ('hybrid', 0.02), ('estimate', 0.02), ('dim', 0.019), ('collapse', 0.019), ('blei', 0.019), ('hard', 0.019), ('alternately', 0.019), ('intractable', 0.019), ('equals', 0.019), ('infer', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

2 0.1599406 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

3 0.12505659 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

4 0.12013762 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

5 0.11894894 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

6 0.10588497 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

7 0.095423929 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.094814375 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

9 0.090864867 126 nips-2012-FastEx: Hash Clustering with Exponential Families

10 0.085049711 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

11 0.077799082 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

12 0.076561831 12 nips-2012-A Neural Autoregressive Topic Model

13 0.076358959 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

14 0.073279686 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

15 0.069834083 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

16 0.06824553 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

17 0.064890839 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

18 0.062740244 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

19 0.06006686 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

20 0.058621142 111 nips-2012-Efficient Sampling for Bipartite Matching Problems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.149), (1, 0.054), (2, -0.019), (3, 0.006), (4, -0.193), (5, -0.054), (6, -0.0), (7, 0.027), (8, 0.112), (9, -0.072), (10, 0.113), (11, 0.142), (12, 0.069), (13, 0.037), (14, 0.03), (15, 0.006), (16, 0.006), (17, 0.007), (18, 0.007), (19, 0.065), (20, 0.036), (21, 0.013), (22, -0.032), (23, 0.009), (24, -0.013), (25, -0.065), (26, 0.004), (27, 0.067), (28, 0.013), (29, -0.02), (30, -0.021), (31, -0.019), (32, 0.036), (33, 0.0), (34, -0.009), (35, -0.034), (36, -0.011), (37, 0.053), (38, -0.007), (39, -0.056), (40, 0.072), (41, -0.039), (42, 0.028), (43, -0.009), (44, 0.06), (45, 0.064), (46, 0.011), (47, 0.019), (48, 0.017), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93674368 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

2 0.80513948 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

3 0.77893806 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

Author: Kosuke Fukumasu, Koji Eguchi, Eric P. Xing

Abstract: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models. 1

4 0.75776321 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

5 0.74934983 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

Author: Sean Gerrish, David M. Blei

Abstract: We develop a probabilistic model of legislative data that uses the text of the bills to uncover lawmakers’ positions on specific political issues. Our model can be used to explore how a lawmaker’s voting patterns deviate from what is expected and how that deviation depends on what is being voted on. We derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we demonstrate both improvement in heldout predictive performance and the model’s utility in interpreting an inherently multi-dimensional space. 1

6 0.74850279 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

7 0.73822534 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.73751432 12 nips-2012-A Neural Autoregressive Topic Model

9 0.68881023 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

10 0.65711242 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.64261317 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

12 0.64216393 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

13 0.5844183 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

14 0.58293861 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

15 0.56957304 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

16 0.516366 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

17 0.51436216 192 nips-2012-Learning the Dependency Structure of Latent Factors

18 0.49506271 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

19 0.49381581 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

20 0.47789142 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.087), (9, 0.01), (21, 0.032), (38, 0.112), (39, 0.012), (42, 0.026), (45, 0.263), (54, 0.023), (55, 0.023), (68, 0.014), (74, 0.049), (76, 0.123), (80, 0.078), (92, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82557297 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

same-paper 2 0.75025666 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

3 0.69595426 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.

4 0.69500059 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

5 0.67133999 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

6 0.62482727 192 nips-2012-Learning the Dependency Structure of Latent Factors

7 0.61629707 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

8 0.6158582 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

9 0.61388797 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

10 0.61312103 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.61294502 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

12 0.61200875 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

13 0.61153007 233 nips-2012-Multiresolution Gaussian Processes

14 0.60923606 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

15 0.60887724 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.60701847 260 nips-2012-Online Sum-Product Computation Over Trees

17 0.60607141 69 nips-2012-Clustering Sparse Graphs

18 0.60596991 65 nips-2012-Cardinality Restricted Boltzmann Machines

19 0.60584062 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

20 0.6047762 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity