nips nips2013 nips2013-287 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Logistic-normal topic models can effectively discover correlation structures among latent topics. [sent-10, score-0.346]
2 However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. [sent-11, score-0.326]
3 Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. [sent-12, score-0.049]
4 This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. [sent-13, score-0.052]
5 To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. [sent-14, score-0.269]
6 1 Introduction In Bayesian models, though conjugate priors normally result in easier inference problems, nonconjugate priors could be more expressive in capturing desired model properties. [sent-16, score-0.091]
7 One popular example is admixture topic models which have obtained much success in discovering latent semantic structures from data. [sent-17, score-0.322]
8 For the most popular latent Dirichlet allocation (LDA) [5], a Dirichlet distribution is used as the conjugate prior for multinomial mixing proportions. [sent-18, score-0.116]
9 But a Dirichlet prior is unable to model topic correlation, which is important for understanding/visualizing the semantic structures of complex data, especially in large-scale applications. [sent-19, score-0.292]
10 One elegant extension of LDA is the logistic-normal topic models (aka correlated topic models, CTMs) [3], which use a logisticnormal prior to capture the correlation structures among topics effectively. [sent-20, score-0.685]
11 Along this line, many subsequent extensions have been developed, including dynamic topic models [4] that deal with time series via a dynamic linear system on the Gaussian variables and infinite CTMs [11] that can resolve the number of topics from data. [sent-21, score-0.37]
12 Although significant progress has been made on developing scalable inference algorithms for LDA using either distributed [10, 16, 1] or online [7] learning methods, the inference of logistic-normal topic models still remains a challenge, due to the non-conjugate priors. [sent-23, score-0.351]
13 Existing algorithms on learning logistic-normal topic models mainly rely on approximate techniques, e. [sent-24, score-0.232]
14 Furthermore, existing algorithms can only deal with small corpora and learn a limited number of topics. [sent-29, score-0.047]
15 It is important to develop scalable algorithms in order to apply the models to large collections of documents, which are becoming increasingly common in both scientific and engineering fields. [sent-30, score-0.065]
16 To address the limitations listed above, we develop a scalable Gibbs sampling algorithm for logisticnormal topic models, without making any restricting assumptions on the posterior distribution. [sent-31, score-0.388]
17 Moreover, the auxiliary variables are locally associated with each individual document, and this locality naturally allows us to develop a distributed sampler by splitting the documents into multiple subsets and allocating them to multiple machines. [sent-33, score-0.151]
18 We successfully apply the scalable inference algorithm to learning a correlation graph of thousands of topics on large corpora with millions of documents. [sent-35, score-0.299]
19 These results are the largest automatically learned topic correlation structures to our knowledge. [sent-36, score-0.326]
20 2 Logistic-Normal Topic Models Let W = {wd }D be a set of documents, where wd = {wdn }Nd denote the words appearing n=1 d=1 in document d of length Nd . [sent-37, score-0.074]
21 A hierarchical Bayesian topic model posits each document as an admixture of K topics, where each topic Φk is a multinomial distribution over a V -word vocabulary. [sent-38, score-0.557]
22 For Bayesian CTM, the topics are samples drawn from a prior, e. [sent-42, score-0.106]
23 Given a set of documents W, CTM infers the posterior distribution p(η, Z, Φ|W) ∝ p0 (η, Z, Φ)p(W|Z, Φ) by the Bayes’ rule. [sent-46, score-0.088]
24 This problem is generally hard because of the nonconjugacy between the normal prior and the logistic transformation function (can be seen as a likelihood model for θ). [sent-47, score-0.12]
25 3 Gibbs Sampling with Data Augmentation We now present a block-wise Gibbs sampling algorithm for logistic-normal topic models. [sent-50, score-0.268]
26 To improve mixing rates, we first integrate out the Dirichlet variables Φ, by exploring the conjugacy between a Dirichlet prior and multinomial likelihood. [sent-51, score-0.075]
27 Sampling Topic Assignments When the variables η = {ηd }D are given, we draw samples from p(Z|η, W). [sent-54, score-0.064]
28 The local conditional distribution is: k k k p(zdn = 1|Z¬n , wdn , W¬dn , η) ∝ p(wdn |zdn = 1, Z¬n , W¬dn )eηd ∝ w dn Ck,¬n + βwdn V j=1 j Ck,¬n + V j=1 k βj eηd ,(1) · where C·,¬n indicates that term n is excluded from the corresponding document or topic. [sent-56, score-0.199]
29 2 Sampling Logistic-Normal Parameters When the topic assignments Z are given, we draw samples from the posterior distribution p(η|Z, W) ∝ D d=1 Nd n=1 e d ηz n K j=1 e ηd j N (ηd |µ, Σ), which is a Bayesian logistic regression model 2 with Z as the multinomial observations. [sent-58, score-0.429]
30 Though it is hard to draw samples directly due to nonconjugacy, we can leverage recent advances in data augmentation to solve this inference task efficiently, with analytical local conditionals for Gibbs sampling, as detailed below. [sent-59, score-0.189]
31 Specifically, we have the likelihood of “observing” the topic assignments zd for document d p(zd |ηd ) = Nd n=1 zdn eη d K j=1 j eη d . [sent-60, score-0.563]
32 Λ = Σ−1 is the precision matrix of a d kk kk Gaussian distribution. [sent-63, score-0.044]
33 k This is a posterior distribution of a Bayesian logistic model with a Gaussian prior, where zdn are binary response variables. [sent-64, score-0.306]
34 Due to the non-conjugacy between the normal prior and logistic likelihood, we do not have analytical form of this posterior distribution. [sent-65, score-0.129]
35 , rejection sampling) can be applied, they normally require a good proposal distribution and may have the trouble to deal with accept/reject rates. [sent-68, score-0.058]
36 , [8] presented a two layer data augmentation representation with logistic distributions and [9] applied another data augmentation with uniform variables and truncated Gaussian distributions, which may involve sophisticated accept/reject strategies [14]. [sent-71, score-0.247]
37 Below, we develop a simple exact sampling method without a proposal distribution. [sent-72, score-0.052]
38 Our method is based on a new data augmentation representation, following the recent developments in Bayesian logistic regression [13], which is a direct data augmentation scheme with only one layer of auxiliary variables and does not need to tune in order to get optimal performance. [sent-73, score-0.276]
39 Specifically, for the above posterior inference problem, we can show the following lemma. [sent-74, score-0.075]
40 d d 2 Therefore, we can draw samples from the complete distribution. [sent-78, score-0.064]
41 By discarding the augmented k ¬k variable λk , we get the samples of the posterior distribution p(ηd |ηd , Z, W). [sent-79, score-0.079]
42 To draw samples from the Polya-Gamma distribution, note that a naive implementation of the sampling using the infinite sum-of-Gamma representation is not efficient and it also involves a potentially inaccurate step of truncating the infinite sum. [sent-83, score-0.116]
43 Here we adopt the exact method prok posed in [13], which draws the samples through drawing Nd samples from PG(1, ηd ). [sent-84, score-0.044]
44 1 Due to the independence, we can treat documents separately. [sent-86, score-0.048]
45 3 1 2 3 ηk ∼ P(ηk | ηd¬ k, Z, W ) d d (b) Fully-Bayesian Models Figure 3: (a) frequency of f (z) with z ∼ PG(m, ρ); and (b) frequency of samples k k ¬k from ηd ∼ p(ηd |ηd , Z, W). [sent-91, score-0.068]
46 3, this involves: 1) computing N IW posterior parameters, and 2) sampling from Eq. [sent-104, score-0.092]
47 Finally, the topic assignments zd are conditionally independent given the topic counts Ck . [sent-108, score-0.495]
48 We synchronize Ck globally by leveraging the recent advances on scalable inference of LDA [1, 16], which implemented a general framework to synchronize such counts. [sent-109, score-0.134]
49 To further speed up the inference algorithm, we designed a fast approximate sampling method to draw PG(n, ρ) samples, reducing the time complexity from O(n) in [13] to O(1). [sent-110, score-0.145]
50 However, this sampler can be slow when n is large. [sent-114, score-0.074]
51 For our Gibbs sampler, n is the document length, often around hundreds. [sent-115, score-0.057]
52 Fortunately, an effective approximation can be developed to achieve constant time sampling of PG. [sent-116, score-0.052]
53 If both m and n are large, y and z should be both samples from normal distribution. [sent-121, score-0.043]
54 In practice, we found that even when m = 1, the algorithm k ¬k still can draw good samples from p(ηd |ηd , Z, W) (See Fig. [sent-125, score-0.064]
55 Hence, we are able to speed up the Polya-Gamma sampling process significantly by applying this approximation. [sent-127, score-0.068]
56 4 Furthermore, we can perform sparsity-aware fast sampling [19] in the Gibbs sampler. [sent-129, score-0.052]
57 Specifically, let Ak = V j=1 wdn Ck,¬n j Ck,¬n + βwdn k V j=1 βj eη d , B k = V j=1 j Ck,¬n + k V j=1 βj eηd , then Eq. [sent-130, score-0.107]
58 (1) can be written as k p(zdn = 1|Z¬n , wdn , W¬dn , η) ∝ Ak + Bk . [sent-131, score-0.107]
59 We can show B A that the sampling of zdn can be done by sampling from Mult( ZA ) or Mult( ZB ), due to the fact: k p(zdn = 1|Z¬n , wdn , W¬dn , η) = Bk Ak Bk Ak + = (1 − p) +p , ZA + ZB ZA + ZB ZA ZB (4) where p = ZAZB B . [sent-133, score-0.438]
60 Thus a sample of zdn can be drawn by flipping a coin with probability p being head. [sent-136, score-0.227]
61 If B A it is tail, we draw zdn from Mult( ZA ); otherwise from Mult( ZB ). [sent-137, score-0.269]
62 In fact, A has few non-zero entries due to the sparsity of the topic counts Ck . [sent-139, score-0.216]
63 With the above techniques, the time complexity per document of the Gibbs sampler is O(Nd s(K)) 2 for sampling zd , O(K 2 ) for computing (µk , σk ), and O(SK) for sampling ηd with Eq. [sent-144, score-0.274]
64 (2), d k where S is the number of sub-burn-in steps over sampling ηd . [sent-145, score-0.052]
65 5 Experiments We now present qualitative and quantitative evaluation to demonstrate the efficacy and scalability of the Gibbs sampler for CTM (denoted by gCTM). [sent-147, score-0.111]
66 Data Sets: Experiments are conducted on several benchmark data sets, including NIPS paper abstracts, 20Newsgroups, and NYTimes (New York Times) corpora from [2] and the Wikipedia corpus from [20]. [sent-159, score-0.057]
67 Following the settings in [3], we partition each document in the testing set into an observed part and a held-out part. [sent-161, score-0.057]
68 1 Qualitative Evaluation We first examine the correlation structure of 1,000 topics learned by CTM using our scalable sampler on the NYTimes corpus with 285,000 documents. [sent-163, score-0.32]
69 Since the entire correlation graph is too large, we build a 3-layer hierarchy by clustering the learned topics, with their learned correlation strength as the similarity measure. [sent-164, score-0.162]
70 4 shows a part of the hierarchy2 , where the subgraph A represents the top layer with 10 clusters. [sent-166, score-0.044]
71 The subgraphs B and C are two second layer clusters; and D and E are two correlation subgraphs consisting of leaf nodes (i. [sent-167, score-0.152]
72 To represent their semantic meanings, we present 4 most frequent words for each topic; and for each topic cluster, we also show most frequent words by building a hyper-topic that aggregates all the included topics. [sent-170, score-0.274]
73 Clearly, we can see that many topics have strong correlations and the structure is useful to help humans understand/browse the large collection of topics. [sent-172, score-0.084]
74 With 40 machines, our parallel Gibbs sampler finishes the training in 2 hours, which means that we are able to process real world corpus in considerable speed. [sent-173, score-0.184]
75 2 The entire correlation graph can be found on http://ml-thu. [sent-175, score-0.064]
76 net/˜scalable-ctm 5 denotes the number of topics a cluster contains. [sent-176, score-0.1]
77 113 113 B 47 82 31 130 4 5 A 6 12 6 314 248 22 27 D 41 C 48 42 17 13 4 E 12 17 6 7 4 12 7 5 4 3 3 3 Figure 4: A hierarchical visualization of the correlation graph with 1,000 topics learned from 285,000 articles of the NYTimes. [sent-177, score-0.165]
78 A denotes the top-layer subgraph with 10 big clusters; B and C denote two second-layer clusters; and D and E are two subgraphs with leaf nodes (i. [sent-178, score-0.046]
79 We present most frequent words of each topic cluster. [sent-181, score-0.236]
80 Edges denote a correlation (above some threshold) and the distance between two nodes represents the strength of their correlation. [sent-182, score-0.064]
81 The node size of a cluster is determined by the number of topics included in that cluster. [sent-183, score-0.1]
82 6 1800 4500 4000 perplexity 2000 time (s) perplexity 10 vCTM gCTM (M=1, P=1) gCTM (M=1, P=12) 2 10 vCTM gCTM (M=1, P=1) gCTM (M=1, P=12) 1600 1400 0 20 40 60 K 80 100 (a) 10 20 40 60 K 80 6 10 gCTM (M=1, P=12) gCTM (M=40, P=480) Y! [sent-184, score-0.292]
83 9 min of vCTM and gCTM on different sized 20NG 11K 200 16 hr 9 min data sets. [sent-203, score-0.047]
84 5 hr NYTimes 285K 400 immediately becomes impractical when 17 hr Wiki 6M 1000 N/A* the data size reaches 285K, while by uti*not finished within 1 week. [sent-205, score-0.094]
85 5(c)&(d) show the results on the NYTimes corpus, which contains over 285K training documents and cannot be handled well by non-parallel methods. [sent-211, score-0.065]
86 We can see that: 1) both versions of gCTM obtain comparable perplexity to Y! [sent-214, score-0.146]
87 LDA is faster than gCTM because of the model difference — LDA does not learn correlation structure among topics. [sent-218, score-0.064]
88 Our main observations are twofold: 1) despite various S, all versions of gCTMs reach a similar level of perplexity that is better than vCTM; and 2) a moderate number of sub-iterations, e. [sent-225, score-0.146]
89 We adopt Cauchy’s criterion [15] for convergence: given an ǫ > 0, an algorithm converges at iteration T if ∀i, j ≥ T, |Perp i − Perp j | < ǫ, where Perp i and Perp j are perplexity at iteration i and j respectively. [sent-229, score-0.146]
90 We can see that for both K = 50 and K = 100, the perplexity is invariant under a wide range of prior settings. [sent-237, score-0.175]
91 6 Conclusions and Discussions We present a scalable Gibbs sampling algorithm for logistic-normal topic models. [sent-253, score-0.317]
92 Our method builds on a novel data augmentation formulation and addresses the non-conjugacy without making strict mean-field assumptions. [sent-254, score-0.09]
93 The algorithm is naturally parallelizable and can be further boosted by approximate sampling techniques. [sent-255, score-0.052]
94 We are also interested in developing scalable sampling algorithms of other logistic-normal topic models, e. [sent-259, score-0.317]
95 Finally, the fast sampler of Poly-Gamma distributions can be used in relational and supervised topic models [6, 21]. [sent-262, score-0.322]
96 Bayesian auxiliary variable models for binary and multinomial regression. [sent-314, score-0.091]
97 Gibbs sampling for logistic normal topic models with graph-based priors. [sent-320, score-0.344]
98 The discrete infinite logistic normal distribution for mixedmembership modeling. [sent-333, score-0.06]
99 Bayesian inference for logistic models using PolyaGamma latent variables. [sent-349, score-0.111]
100 Efficient methods for topic model inference on streaming document collections. [sent-383, score-0.308]
wordName wordTfidf (topN-words)
[('gctm', 0.759), ('vctm', 0.263), ('zdn', 0.227), ('topic', 0.216), ('perplexity', 0.146), ('pg', 0.145), ('ctm', 0.139), ('lda', 0.12), ('wdn', 0.107), ('mult', 0.097), ('ck', 0.091), ('augmentation', 0.09), ('iw', 0.088), ('topics', 0.084), ('gibbs', 0.079), ('zb', 0.075), ('sampler', 0.074), ('za', 0.07), ('nytimes', 0.068), ('correlation', 0.064), ('document', 0.057), ('perp', 0.055), ('sampling', 0.052), ('tsinghua', 0.05), ('scalable', 0.049), ('documents', 0.048), ('hr', 0.047), ('multinomial', 0.046), ('draw', 0.042), ('cd', 0.041), ('posterior', 0.04), ('logistic', 0.039), ('zd', 0.039), ('dirichlet', 0.037), ('scalability', 0.037), ('normally', 0.036), ('polson', 0.036), ('dn', 0.035), ('inference', 0.035), ('ak', 0.034), ('ar', 0.033), ('corpus', 0.032), ('ctms', 0.031), ('logisticnormal', 0.031), ('nonconjugacy', 0.031), ('subgraphs', 0.03), ('bk', 0.03), ('auxiliary', 0.029), ('structures', 0.029), ('prior', 0.029), ('parallel', 0.028), ('layer', 0.028), ('distributively', 0.027), ('bayesian', 0.026), ('blei', 0.026), ('synchronize', 0.025), ('variational', 0.025), ('corpora', 0.025), ('zhu', 0.024), ('china', 0.024), ('assignments', 0.024), ('frequency', 0.023), ('kk', 0.022), ('deal', 0.022), ('millions', 0.022), ('samples', 0.022), ('admixture', 0.022), ('normal', 0.021), ('latent', 0.021), ('mimno', 0.021), ('nd', 0.02), ('frequent', 0.02), ('cing', 0.02), ('conjugate', 0.02), ('thousands', 0.02), ('zheng', 0.019), ('machines', 0.019), ('clusters', 0.019), ('sacri', 0.019), ('sk', 0.019), ('semantic', 0.018), ('nips', 0.018), ('dim', 0.017), ('lab', 0.017), ('wd', 0.017), ('wikipedia', 0.017), ('training', 0.017), ('learned', 0.017), ('considerable', 0.017), ('augmented', 0.017), ('eld', 0.017), ('dir', 0.017), ('world', 0.016), ('cluster', 0.016), ('relational', 0.016), ('speed', 0.016), ('dynamic', 0.016), ('national', 0.016), ('subgraph', 0.016), ('models', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
2 0.12395062 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
3 0.11608213 301 nips-2013-Sparse Additive Text Models with Low Rank Background
Author: Lei Shi
Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1
Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade
Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identifiability, tensor decomposition.
5 0.096181333 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
6 0.094356015 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
7 0.093763962 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
8 0.092680812 317 nips-2013-Streaming Variational Bayes
9 0.086322129 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
10 0.079047799 70 nips-2013-Contrastive Learning Using Spectral Methods
11 0.077991292 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
12 0.075641364 98 nips-2013-Documents as multiple overlapping windows into grids of counts
13 0.065571941 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
14 0.062755518 215 nips-2013-On Decomposing the Proximal Map
15 0.059823152 88 nips-2013-Designed Measurements for Vector Count Data
16 0.058685187 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
17 0.055720929 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.055251207 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
19 0.054891441 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
20 0.054176535 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
topicId topicWeight
[(0, 0.119), (1, 0.053), (2, -0.049), (3, 0.008), (4, 0.048), (5, 0.058), (6, 0.123), (7, 0.019), (8, 0.141), (9, -0.04), (10, 0.012), (11, 0.101), (12, 0.002), (13, -0.022), (14, -0.003), (15, -0.093), (16, 0.159), (17, -0.039), (18, 0.001), (19, -0.026), (20, -0.056), (21, -0.103), (22, 0.042), (23, 0.054), (24, 0.038), (25, -0.131), (26, 0.093), (27, -0.045), (28, -0.116), (29, 0.036), (30, 0.037), (31, 0.037), (32, -0.042), (33, -0.038), (34, 0.045), (35, -0.095), (36, 0.085), (37, 0.014), (38, 0.052), (39, -0.095), (40, 0.012), (41, -0.076), (42, -0.038), (43, 0.078), (44, 0.01), (45, 0.043), (46, 0.053), (47, -0.0), (48, -0.048), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.92270315 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
2 0.77440387 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
3 0.71075404 301 nips-2013-Sparse Additive Text Models with Low Rank Background
Author: Lei Shi
Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1
4 0.70834184 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
Author: Sam Patterson, Yee Whye Teh
Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1
5 0.6449126 70 nips-2013-Contrastive Learning Using Spectral Methods
Author: James Y. Zou, Daniel Hsu, David C. Parkes, Ryan P. Adams
Abstract: In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
7 0.59663814 98 nips-2013-Documents as multiple overlapping windows into grids of counts
8 0.59244841 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
9 0.57789284 317 nips-2013-Streaming Variational Bayes
10 0.53956872 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
11 0.49718446 88 nips-2013-Designed Measurements for Vector Count Data
12 0.49153998 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
13 0.46075243 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
14 0.43926707 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
15 0.42956179 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
16 0.41739666 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
17 0.41119668 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
18 0.3805249 86 nips-2013-Demixing odors - fast inference in olfaction
19 0.37438083 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
20 0.36389101 160 nips-2013-Learning Stochastic Feedforward Neural Networks
topicId topicWeight
[(16, 0.077), (33, 0.127), (34, 0.092), (36, 0.013), (41, 0.019), (49, 0.083), (56, 0.072), (60, 0.232), (70, 0.014), (85, 0.041), (89, 0.024), (93, 0.095), (95, 0.011)]
simIndex simValue paperId paperTitle
1 0.82488579 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
same-paper 2 0.76921868 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
3 0.67629749 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
Author: Shahin Shahrampour, Sasha Rakhlin, Ali Jadbabaie
Abstract: This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time. 1
4 0.66602755 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
5 0.6473012 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
Author: Nikhil Rao, Christopher Cox, Rob Nowak, Timothy T. Rogers
Abstract: Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multisubject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso. 1
6 0.64392704 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
7 0.64227444 121 nips-2013-Firing rate predictions in optimal balanced networks
8 0.63956612 5 nips-2013-A Deep Architecture for Matching Short Texts
9 0.63955653 64 nips-2013-Compete to Compute
10 0.63815349 99 nips-2013-Dropout Training as Adaptive Regularization
11 0.63531977 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
12 0.63380933 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
13 0.63176531 251 nips-2013-Predicting Parameters in Deep Learning
14 0.63149303 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
15 0.63144153 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
16 0.630265 301 nips-2013-Sparse Additive Text Models with Low Rank Background
17 0.63019496 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
18 0.62971377 183 nips-2013-Mapping paradigm ontologies to and from the brain
19 0.62968278 201 nips-2013-Multi-Task Bayesian Optimization
20 0.62810016 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation