nips nips2013 nips2013-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. [sent-2, score-0.106]
2 The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. [sent-3, score-1.046]
3 In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. [sent-6, score-0.641]
4 We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. [sent-7, score-0.106]
5 1 Introduction A collection of documents, each consisting of a disorganized bag of words is often modeled compactly using mixture or admixture models, such as Latent Semantic Analysis (LSA) [4] and Latent Dirichlet Allocation (LDA) [1]. [sent-8, score-0.242]
6 The data is represented by a small number of semantically tight topics, and a document is assumed to have a mix of words from an even smaller subset of these topics. [sent-9, score-0.204]
7 There are no strong constraints in how the topics are mixed [5]. [sent-10, score-0.115]
8 Recently, an orthogonal approach emerged: it has been observed that for many text corpora documents evolve into one another in a smooth way, with some words dropping and new ones being introduced. [sent-11, score-0.189]
9 By using large windows to collate many grid distributions from a large grid, CG model can be a very large mixture without overtraining, as these distributions are highly correlated. [sent-13, score-0.433]
10 LDA model does not have this benefit, and thus has to deal with a smaller number of topics to avoid overtraining. [sent-14, score-0.115]
11 1a we show an excerpt of a grid learned from cooking recipes from around the world. [sent-16, score-0.474]
12 Each position in the grid is characterized by a distribution over the words in a vocabulary and for each position we show the 3 words with higher probability whenever they exceed a threshold. [sent-17, score-0.538]
13 The shaded positions, are characterized by the presence, with a non-zero probability, of the word “bake”1 . [sent-18, score-0.119]
14 On the grid we also show the windows W of size 4 ⇥ 5 for 5 recipes. [sent-19, score-0.433]
15 Nomi (1), an Afghan egg-based bread, is close to the recipe of the usual pugliese bread (2), as indeed they share most of the ingredients and procedure and their windows largely overlap. [sent-20, score-0.418]
16 Note how moving from (1) to (2) the word 1 Which may or may not be in the top three 1 a) grain rice cook i cutlet (1) pour gently spoonful carefully pour panful bread [. [sent-21, score-0.458]
17 ] 'chicken' 'marinade' 'shallow' 'hot' 'coat' 'refrigeration' 'heat' 'crust' 'evenly' 'spread' 'edge' 'pressing' 'center' 'place' 'feta' 'mixture' 'useful' Figure 1: a) A particular of a E = 30 ⇥ 30 componential counting grid ⇡i learned over a corpus of recipes. [sent-30, score-0.815]
18 In each cell we show the 0-3 most probable words greater than a threshold. [sent-31, score-0.083]
19 For each component c = j in position j, we show the words generated in each window P cz · j2Wi ⇡j (z) “egg” is dropped. [sent-38, score-0.289]
20 Moving to the right we encounter the basic pizza (3) whose dough is very similar to the bread’s. [sent-39, score-0.183]
21 Continuing to the right words often associated to desserts like sugar, almond, etc emerge. [sent-40, score-0.133]
22 Finally further up we encounter other desserts which do not require baking, like tiramisu (5), or chocolate crepes. [sent-42, score-0.1]
23 This is an example of a“topical shift”; others appear in different portions of the full grid which is included in the additional material. [sent-43, score-0.256]
24 The major drawback of counting grids is that they are essentially a mixture model, assuming only one source for all features in the bag and the topology of the space highly constrains the document mappings resulting in local minima or suboptimal grids. [sent-44, score-0.584]
25 For example, more structured recipes like Grecian Chicken Gyros Pizza or Tex-Mex pizza would have very low likelihood, as words related to meat, which is abundant in both, are hard to generate in the baking area where the recipes would naturally goes. [sent-45, score-0.553]
26 As first contribution we extend here the counting grid model so that each document can be represented by multiple latent windows, rather than just one. [sent-46, score-0.602]
27 In this way, we create a substantially more flexible admixture model, the componential counting grid (CCG), which becomes a direct generalization of LDA as it does allow multiple sources (e. [sent-47, score-0.894]
28 But, the equivalent of LDA topics are windows in a counting grid, which allows the model to have a very large number of topics that are highly related, as shift in the grid only slightly refines any topic. [sent-50, score-0.908]
29 Starting from the same grid just described, we recomputed the mapping of each recipe which now can be described by multiple windows, if needed. [sent-51, score-0.328]
30 Another example is the Caesar salad which have a component in the salad/vegetable area, and borrows the 2 croutons from the bread area. [sent-56, score-0.131]
31 1b, one can also notice how the embedding produced by CCGs yields to a similarity measure based on the grid usage of each sample. [sent-58, score-0.356]
32 For example, words relative to the three pizzas are generated from windows that overlap, therefore they share words usage and thus they are “similar”. [sent-59, score-0.378]
33 As second contribution we exploited this fact to define a novel generative kernel, whose performance largely outperformed similar classification strategies based on LDA’s topic usage [1,2]. [sent-60, score-0.157]
34 We evaluated componential counting grids and in particular the kernel, on the 20-Newsgroup dataset [6], on a novel dataset of recipes which we will make available to the community, and on the recent “Wikipedia picture of the day” dataset [7]. [sent-61, score-0.9]
35 2 Counting Grids and Componential Counting Grids Pick a location within the window W Pick a window W from The basic Counting Grid ⇡i is a set of distributions over the vocabulary on the N -dimensional b) c) w discrete grid indexed by i where each id 2 a) U T [1 . [sent-64, score-0.693]
36 Ed ] and E describes the extent of the counting grid in d dimensions. [sent-67, score-0.501]
37 The index z inN dexes a particular word in the vocabulary z = ln [1 . [sent-68, score-0.311]
38 For ln=(5,3) w example, ⇡i (0 P izza0 ) is the probability of the word “Pizza” at the location i. [sent-72, score-0.172]
39 Since ⇡ is a grid P kn kn=ln +(0,3) of distributions, z ⇡i (z) = 1 everywhere on the grid. [sent-73, score-0.457]
40 Each bag of words is represented by a k t list of words {wt }T and each word wn takes wn t=1 a value between 1 and Z. [sent-74, score-0.799]
41 Counting Grids assume that each bags follow Figure 2: a) Plate notation representing the CCG a word distribution found somewhere in the model. [sent-76, score-0.186]
42 b) CCG generative process for one word: counting grid; in particular, using windows of Pick a window from ✓, Pick a position within the dimensions W, a bag can be generated by first window, Pick a word. [sent-77, score-0.76]
43 c) Illustration of U W and averaging all counts in the window Wi starting ⇤W relative to the particular ✓ shown in plate b). [sent-78, score-0.17]
44 ✓ at grid location i and extending in each direcP tion d by Wd grid positions to form the histogram hi (z) = Q 1Wd j2Wi ⇡j (z), and then generating d a set of features in the bag (see Fig. [sent-79, score-0.691]
45 The ratio of the two volumes, , is called the capacity of the model in terms of an equivalent number of topics, as this is how many non-overlapping windows can be fit onto the grid. [sent-82, score-0.213]
46 Finally, with Wi we indicate the particular window placed at location i. [sent-83, score-0.223]
47 Componential Counting Grids As seen in the previous section, counting grids generate words from a distribution in a window W , placed at location i in the grid. [sent-84, score-0.732]
48 Windows close in the grid generate similar features because they share many cells: As we move the window on the grid, some new features appear while others are dropped. [sent-85, score-0.426]
49 On the other hand componential models, like [1], represent the standard way of modeling of text corpora. [sent-86, score-0.362]
50 Componential counting grids get the best of both worlds: being based on the counting grid geometry they capture smooth shifts of topics, plus their componential nature, which allows documents to be generated by several windows (akin to LDA’s topics). [sent-88, score-1.476]
51 The number of windows need not be specified a-priori. [sent-89, score-0.177]
52 It worth noticing that when W = 1 ⇥ 1, ln = kn and the model becomes Latent Dirichlet Allocation. [sent-95, score-0.349]
53 Finally p(ln |✓) = ✓(l) is the prior distribution over the windows location, and p(✓|↵) = Dir(✓; ↵) is a Dirichlet distribution of parameters ↵. [sent-99, score-0.177]
54 ⌘ X⇣X X log P F= q t (kn ) · q t (ln ) · log ⇡kn (wn ) · U W (kn ln ) · ✓ln · p(✓|↵) H(q t ) t n ln ,kn where H(q) is the entropy of the distribution q. [sent-103, score-0.296]
55 10 highlights further relationships between CCGs and LDA: CCGs can be thought as an LDA model whose topics live on the space defined by the counting grids geometry. [sent-115, score-0.541]
56 The new updates for the cell distribution q(k) and the window distribution q(l), require only a single convolution and, more importantly, they don’t directly depend on each other. [sent-116, score-0.17]
57 [12] worked on sources positioned in a plane at real-valued locations, with the idea that sources within a radius would be combined to produce topics in an LDA-like model. [sent-121, score-0.205]
58 They used an expensive sampling algorithm that aimed at moving the sources in the plane and determining the circular window size. [sent-122, score-0.215]
59 The grid placement of sources of CCG yields much more efficient algorithms and denser packing. [sent-123, score-0.301]
60 1 A Kernel based on CCG embedding Hybrid generative discriminative classification paradigms have been shown to be a practical and effective way to get the best of both worlds in approaching classification [13–15]. [sent-125, score-0.139]
61 In the context of topic models a simple but effective kernel is defined as the product of the topic proportions of each document. [sent-126, score-0.21]
62 This kernel measures the similarity between topic usage of each sample and it proved to be effective on several tasks [15–17]. [sent-127, score-0.163]
63 We observe in fact that by construction, each point in the grid depends by its neighborhood, defined by W and this information is not captured using ✓, but using ⇤W ✓ which is defined by spreading ✓ in the appropriate window (Eq. [sent-129, score-0.426]
64 More formally, given two samples t and u, we define a kernel based on CCG embedding as X X K(t, u) = S(⇤W (i), ⇤W (i)) where ⇤W (i) = U W (i j) · ✓(j) (11) ✓u ✓ ✓t i j where S(·, ·) is any similarity measure which defines a kernel. [sent-131, score-0.111]
65 Finally, as both its parameters and the latent variables live in a compact space of dimensionality and size chosen by the user, our learning algorithm can be evaluated as an embedding method that yields itself to data visualization applications. [sent-136, score-0.1]
66 In all the tests we considered squared grids of size E = [40 ⇥ 40, 50 ⇥ 50, . [sent-138, score-0.181]
67 , 90 ⇥ 90] and windows of size W = [2 ⇥ 2, 4 ⇥ 4, . [sent-141, score-0.177]
68 We represented the grid size E using gray levels (see the text). [sent-161, score-0.256]
69 Document Classification We compared componential counting grids (CCGs) with counting grids [3] (CGs), latent Dirichlet allocation [1] (LDA) and the spherical admixture model [2] (SAM), following the validation paradigm previously used in [2, 3]. [sent-164, score-1.235]
70 Each data sample consists of a bag of words and a label. [sent-165, score-0.175]
71 The bags were used without labels to train a model that capture covariation in word occurrences, with CGs mostly modeling thematic shifts, LDA and SAM modeling topic mixing and CCGs both aspects. [sent-166, score-0.268]
72 12 which for LDA reduces in using a linear kernel on the topic proportions. [sent-168, score-0.128]
73 , ✓) and for W > [1 ⇥ 1] can be thought as an LDA model whose topics, hi , are much finer as computed from overlapping windows (see also Eq. [sent-176, score-0.211]
74 The same can be obtained with different choices of E and W therefore we represented the grid size E using gray levels, the lighter the marker the bigger the grid. [sent-198, score-0.256]
75 The capacity is roughly equivalent to the number of LDA topics as it represents the number of independent windows that can be fit in the grid and we compared the with LDA using this parallelism [18]. [sent-199, score-0.584]
76 Componential counting grids outperform Latent Dirichlet Allocation across all the spectrum and the accuracy regularly raises with independently from the Grid size3 . [sent-200, score-0.426]
77 As visible, CCG outperforms other models, with a larger margin on the more challenging same and similar datasets, where we would indeed expect that quilting the topics to capture fine-grained similarities and differences would be most helpful. [sent-220, score-0.115]
78 Then we extracted the words of each recipe from its ingredients and cooking instructions and we used the origin of the recipe, to divide the dataset in 15 classes4 . [sent-222, score-0.251]
79 The resulting dataset has a vocabulary size of 12538 unique words and a total of ⇠1M tokens. [sent-223, score-0.127]
80 To classify the recipes we used 10-fold crossevaluation with 5 repetitions, picking 80 random recipes per-class for each repetition. [sent-224, score-0.353]
81 As for the previous test, CCG classification accuracies grows regularly with independently from the grid size E. [sent-227, score-0.256]
82 , LDA and CCGs) performed significantly better as to correctly classify the origin of a recipe, spice palettes, cooking style and procedures must be identified. [sent-230, score-0.124]
83 For example while most Asian cuisines uses similar ingredients and cooking procedures they definitely have different spice palettes. [sent-231, score-0.195]
84 Counting Grids, being mixtures, cannot capture that as they map a recipe in a single location which heavily depends on the ingredients used. [sent-232, score-0.163]
85 Although we used this simple procedure without directly training a multimodal model, CCGs outperform LDA, CorrLDA [19] and the multimodal document random field model presented in [7] and sets a new state of the art. [sent-243, score-0.146]
86 4b, relative to the extract of the counting grid shown in Fig. [sent-254, score-0.501]
87 To define the importance of each word in each position we weighted ⇡i (z) with the inverse document frequency. [sent-257, score-0.221]
88 A user can zoom in to see the content of particular cells/areas, until he reaches the highest level of zoom when most of the words generated in a position are visible, Fig. [sent-260, score-0.221]
89 We also propose a simple search strategy: once a keyword z is selected, each word z in each poˆ sition j, is weighted with a word and position DEEP FRY dependent weights. [sent-262, score-0.274]
90 The first is equal to 1 if z co-occur with z in some document, and 0 otherˆ wise, while the latter is the sum of ⇡i (ˆ) in all the z js given that there exists a window Wk that conSTIR FRY tains both i and j. [sent-263, score-0.17]
91 As result, this strategy highlights some areas and words, related to z on the grid and in ˆ each areas words related (similar topic) to z apˆ pears. [sent-265, score-0.339]
92 5 shows the result of the search for z =“fry”: The general frying is well ˆ separated from “deep frying” and “stir frying” Figure 5: Search result for the word “fry”. [sent-268, score-0.185]
93 Presenting search results as islands on a 2-dimensional grid, apparently improves the standard strategy, a linear list of hits, in which recipes relative to the three frying styles would have be mixed, while tempura have little to do with pan fried noodles. [sent-270, score-0.226]
94 4 Conclusion In this paper we presented the componential counting grid model – which bridges the topic model and counting grid worlds – together with a similarity measure based on it. [sent-271, score-1.432]
95 For every restart, the grids qualitatively always appeared very similar, and some of the more salient similarity relationships were captured by all the runs. [sent-276, score-0.181]
96 The word embedding produced by CCG has also advantages w. [sent-277, score-0.184]
97 Then [21, 23] only embed documents or words while CG/CCGs provide both embeddings. [sent-282, score-0.141]
98 Finally as opposed to previous co-occurrence embedding methods that consider all pairs of words, our representation naturally captures the same word appearing in multiple locations where it has a different meaning based on context. [sent-283, score-0.226]
99 The word “memory” in the Science magazine corpus is a striking example (memory in neruoscience, memory in electronic devices, immunologic memory). [sent-284, score-0.119]
100 : Multidimensional counting grids: Inferring word order from disordered bags of words. [sent-298, score-0.431]
wordName wordTfidf (topN-words)
[('componential', 0.314), ('grid', 0.256), ('ccg', 0.248), ('counting', 0.245), ('ccgs', 0.231), ('wn', 0.211), ('kn', 0.201), ('lda', 0.191), ('grids', 0.181), ('windows', 0.177), ('window', 0.17), ('recipes', 0.16), ('ln', 0.148), ('bread', 0.131), ('word', 0.119), ('pizza', 0.117), ('topics', 0.115), ('bag', 0.092), ('chicken', 0.088), ('words', 0.083), ('fry', 0.083), ('preheat', 0.083), ('removable', 0.083), ('topic', 0.082), ('pour', 0.081), ('sheet', 0.073), ('recipe', 0.072), ('bags', 0.067), ('cgs', 0.067), ('dough', 0.066), ('frying', 0.066), ('perina', 0.066), ('spice', 0.066), ('document', 0.066), ('embedding', 0.065), ('sam', 0.064), ('egg', 0.063), ('afghan', 0.058), ('cooking', 0.058), ('documents', 0.058), ('mix', 0.055), ('jojic', 0.054), ('yx', 0.054), ('location', 0.053), ('zoom', 0.051), ('bake', 0.05), ('boil', 0.05), ('chocolate', 0.05), ('crumb', 0.05), ('desserts', 0.05), ('mastercook', 0.05), ('oven', 0.05), ('soupe', 0.05), ('wok', 0.05), ('text', 0.048), ('pick', 0.047), ('cook', 0.046), ('kernel', 0.046), ('sources', 0.045), ('dish', 0.044), ('inch', 0.044), ('vocabulary', 0.044), ('cg', 0.043), ('locations', 0.042), ('generative', 0.04), ('multimodal', 0.04), ('dirichlet', 0.04), ('ingredients', 0.038), ('wikipedia', 0.037), ('position', 0.036), ('capacity', 0.036), ('latent', 0.035), ('usage', 0.035), ('posts', 0.034), ('admixture', 0.034), ('worlds', 0.034), ('hi', 0.034), ('day', 0.034), ('classi', 0.034), ('airtight', 0.033), ('almond', 0.033), ('baking', 0.033), ('beef', 0.033), ('bowl', 0.033), ('brush', 0.033), ('chili', 0.033), ('cornstarch', 0.033), ('crossevaluation', 0.033), ('cuisines', 0.033), ('devein', 0.033), ('disorganized', 0.033), ('ginger', 0.033), ('grease', 0.033), ('grecian', 0.033), ('griddle', 0.033), ('gyros', 0.033), ('knead', 0.033), ('marinade', 0.033), ('meat', 0.033), ('mixer', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999833 98 nips-2013-Documents as multiple overlapping windows into grids of counts
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
2 0.11826329 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
3 0.10284644 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
4 0.095886484 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
5 0.086159185 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
6 0.085729018 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
7 0.082683936 224 nips-2013-On the Sample Complexity of Subspace Learning
9 0.076647528 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
10 0.075641364 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
11 0.074436173 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
12 0.074409321 301 nips-2013-Sparse Additive Text Models with Low Rank Background
13 0.073653437 317 nips-2013-Streaming Variational Bayes
14 0.072926626 70 nips-2013-Contrastive Learning Using Spectral Methods
15 0.072364584 88 nips-2013-Designed Measurements for Vector Count Data
16 0.072127305 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
17 0.071974546 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
18 0.068759345 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
19 0.065450296 5 nips-2013-A Deep Architecture for Matching Short Texts
20 0.064433798 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
topicId topicWeight
[(0, 0.149), (1, 0.066), (2, -0.051), (3, -0.035), (4, 0.07), (5, -0.017), (6, 0.063), (7, 0.019), (8, 0.055), (9, 0.02), (10, -0.079), (11, 0.095), (12, -0.033), (13, -0.037), (14, 0.051), (15, -0.082), (16, 0.18), (17, 0.051), (18, 0.011), (19, 0.039), (20, -0.039), (21, -0.104), (22, 0.002), (23, 0.063), (24, -0.051), (25, -0.088), (26, 0.044), (27, 0.049), (28, -0.129), (29, -0.011), (30, -0.027), (31, -0.063), (32, -0.06), (33, 0.01), (34, -0.032), (35, -0.031), (36, 0.029), (37, -0.016), (38, 0.047), (39, -0.082), (40, 0.073), (41, 0.01), (42, 0.036), (43, -0.083), (44, -0.023), (45, 0.018), (46, 0.028), (47, -0.012), (48, 0.058), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.95371854 98 nips-2013-Documents as multiple overlapping windows into grids of counts
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
2 0.76627266 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
3 0.69171429 70 nips-2013-Contrastive Learning Using Spectral Methods
Author: James Y. Zou, Daniel Hsu, David C. Parkes, Ryan P. Adams
Abstract: In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
4 0.67087239 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
5 0.63701123 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang
Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1
6 0.61939764 301 nips-2013-Sparse Additive Text Models with Low Rank Background
8 0.54643977 88 nips-2013-Designed Measurements for Vector Count Data
9 0.53051931 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
10 0.51268655 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
11 0.50819874 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
12 0.49405009 317 nips-2013-Streaming Variational Bayes
13 0.4889434 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
14 0.47965884 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
15 0.47425324 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
16 0.4354493 294 nips-2013-Similarity Component Analysis
17 0.42588681 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
18 0.4133721 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
19 0.403936 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
20 0.40368018 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
topicId topicWeight
[(16, 0.031), (33, 0.102), (34, 0.12), (41, 0.023), (49, 0.07), (56, 0.063), (69, 0.314), (70, 0.033), (85, 0.027), (89, 0.043), (93, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.75367308 98 nips-2013-Documents as multiple overlapping windows into grids of counts
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
2 0.59192133 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
3 0.51779234 121 nips-2013-Firing rate predictions in optimal balanced networks
Author: David G. Barrett, Sophie Denève, Christian K. Machens
Abstract: How are firing rates in a spiking network related to neural input, connectivity and network function? This is an important problem because firing rates are a key measure of network activity, in both the study of neural computation and neural network dynamics. However, it is a difficult problem, because the spiking mechanism of individual neurons is highly non-linear, and these individual neurons interact strongly through connectivity. We develop a new technique for calculating firing rates in optimal balanced networks. These are particularly interesting networks because they provide an optimal spike-based signal representation while producing cortex-like spiking activity through a dynamic balance of excitation and inhibition. We can calculate firing rates by treating balanced network dynamics as an algorithm for optimising signal representation. We identify this algorithm and then calculate firing rates by finding the solution to the algorithm. Our firing rate calculation relates network firing rates directly to network input, connectivity and function. This allows us to explain the function and underlying mechanism of tuning curves in a variety of systems. 1
4 0.51740611 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin
Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1
5 0.51657999 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
Author: Nikhil Rao, Christopher Cox, Rob Nowak, Timothy T. Rogers
Abstract: Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multisubject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso. 1
6 0.5133453 64 nips-2013-Compete to Compute
7 0.51272887 148 nips-2013-Latent Maximum Margin Clustering
8 0.51194328 5 nips-2013-A Deep Architecture for Matching Short Texts
9 0.50818819 183 nips-2013-Mapping paradigm ontologies to and from the brain
10 0.50764573 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
11 0.50718677 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
12 0.50714242 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
13 0.50675368 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
14 0.5050773 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.50458837 173 nips-2013-Least Informative Dimensions
16 0.50399351 86 nips-2013-Demixing odors - fast inference in olfaction
17 0.50394434 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
18 0.50275427 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
19 0.50219584 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?
20 0.50174159 99 nips-2013-Dropout Training as Adaptive Regularization