nips nips2013 nips2013-274 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
Reference: text
sentIndex sentText sentNum sentScore
1 cn Abstract Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. [sent-5, score-0.438]
2 To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. [sent-6, score-0.522]
3 In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i. [sent-7, score-0.262]
4 , Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. [sent-9, score-0.547]
5 Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. [sent-10, score-0.751]
6 An efficient variational EM algorithm is presented for model parameter estimation and inference. [sent-11, score-0.095]
7 Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. [sent-12, score-0.255]
8 1 Introduction The explosive growth of web videos makes automatic video classification important for online video search and indexing. [sent-13, score-0.658]
9 Classifying short video clips which contain simple motions and actions has been solved well in standard datasets (such as KTH [1], UCF-Sports [2] and UCF50 [3]). [sent-14, score-0.262]
10 However, detecting complex activities, specially social group activities [4], in web videos is a more difficult task because of unstructured activity context and complex multi-object interaction. [sent-15, score-0.525]
11 In this paper, we focus on the task of automatic classification of unstructured social group activities (e. [sent-16, score-0.269]
12 , wedding dance, birthday party and graduation ceremony in Figure 1), where the low-level features have innate limitations in semantic description of the underlying video data and only a few labeled training videos are available. [sent-18, score-0.831]
13 Thus, a common method is to learn human-defined (or semi-human-defined) semantic concepts as mid-level representations to help video classification [4]. [sent-19, score-0.331]
14 To discover more powerful representations for classification, we propose a novel supervised topic model called “relevance topic model” to automatically extract latent “relevance” topics from bag-of-words (BoW) video representations and simultaneously learn a classifier with sparse weights. [sent-21, score-1.053]
15 Our model is built on Replicated Softmax [5], an undirected topic model which can be viewed as a family of different-sized Restricted Boltzmann Machines that share parameters. [sent-22, score-0.224]
16 Sparse Bayesian learning [6] is incorporated to guide the topic model towards learning more predictive topics which are associated with sparse classifier weights. [sent-23, score-0.475]
17 We refer to those topics corresponding to non-zero weights as “relevance” topics. [sent-24, score-0.278]
18 explaining video data containing complex content and also makes variational inference tractable for the proposed model. [sent-26, score-0.358]
19 Furthermore, by using a simple quadratic bound on the log-sum-exp function [8], an efficient variational EM algorithm is developed for parameter estimation and inference. [sent-27, score-0.095]
20 Our model is able to be naturally extended to deal with multi-modal data without changing the learning and inference procedures, which is beneficial for video classification tasks. [sent-28, score-0.237]
21 2 Related work The problems of activity analysis and recognition have been widely studied. [sent-29, score-0.072]
22 However, most of the existing works [9, 10] were done on constrained videos with limited contents (e. [sent-30, score-0.178]
23 Complex activity recognition in web videos, such as social group activity, is not much explored. [sent-33, score-0.208]
24 Most relevant to our work is a recent work that learns video attributes to analyze social group activity [4]. [sent-34, score-0.455]
25 In [4], a semi-latent attribute space is introduced, which consists of user-defined attributes, class-conditional and background latent attributes, and an extended Latent Dirichlet Allocation (LDA) [11] is used to model those attributes as topics. [sent-35, score-0.142]
26 Different from that, our work discovers a set of discriminative latent topics without extra human annotations on videos. [sent-36, score-0.382]
27 From the view of graphical models, most similar to our model are the maximum entropy discrimination LDA (MedLDA) [12] and the generative Classification Restricted Boltzmann Machines (gClassRBM) [13], both of which have been successfully applied to document semantic analysis. [sent-37, score-0.05]
28 MedLDA integrates the max-margin learning and hierarchical directed topic models by optimizing a single objective function with a set of expected margin constraints. [sent-38, score-0.186]
29 MedLDA tries to estimate parameters and find latent topics in a max-margin sense, which is different from our model relying on the principle of automatic relevance determination [14]. [sent-39, score-0.501]
30 The gClassRBM used to model word count data is actually a supervised Replicated Softmax. [sent-40, score-0.048]
31 Different from the gClassRBM, instead of point estimation of classifier parameters, our proposed model learns a sparse posterior distribution over parameters within a Bayesian paradigm. [sent-41, score-0.038]
32 3 Models and algorithms We start with the description of Replicated Softmax, and then by integrating it with sparse Bayesian learning, propose the relevance topic model for videos. [sent-42, score-0.348]
33 Finally, we develop an efficient variational algorithm for inference and parameter estimation. [sent-43, score-0.095]
34 1 Replicated Softmax The Replicated Softmax model is a two-layer undirected graphical model, which can be used to model sparse word count data and extract latent semantic topics from document collections. [sent-45, score-0.477]
35 The bottom layer represents a multinomial visible unit sampled K times (K is the total number of words in a document) and the top layer represents binary stochastic hidden units. [sent-48, score-0.036]
36 W W2 ηc v y v C Figure 2: Left: Replicated Softmax model: an undirected graphical model. [sent-52, score-0.065]
37 The undirected part models the marginal distribution of video BoW vectors v and the directed part models the conditional distribution of video classes y given latent topics tr by using a hierarchical prior on weights η . [sent-54, score-1.152]
38 Let a word count vector v ∈ NN be the visible unit (N is the size of the vocabulary), and a binary topic vector h ∈ {0, 1}F be the hidden units. [sent-55, score-0.195]
39 Then the energy function of the state {v, h} is defined as follows: N F N Wij vi hj − E(v, h; θ) = − i=1 j=1 F ai vi − K i=1 bj hj , (1) j=1 where θ = {W, a, b}, Wij is the weight connected with vi and hj , ai and bj are the bias terms of visible and hidden units respectively. [sent-56, score-0.391]
40 The joint distribution over the visible and hidden units is defined by: P (v, h; θ) = 1 exp(−E(v, h; θ)), Z(θ) = Z(θ) exp(−E(v, h; θ)), v (2) h where Z(θ) is the partition function. [sent-57, score-0.105]
41 Since exact maximum likelihood learning is intractable, the contrastive divergence [15] approximation is often used to estimate model parameters in practice. [sent-58, score-0.032]
42 2 Relevance topic model The relevance topic model (RTM) is an integration of sparse Bayesian learning and Replicated Softmax, the main idea of which is to jointly learn discriminative topics as mid-level video representations and sparse discriminant function as a video classifier. [sent-60, score-1.375]
43 We represent the video dataset with class labels y ∈ {1, . [sent-61, score-0.237]
44 , C} as D = {(vm , ym )}M , where m=1 each video is represented as a BoW vector v ∈ NN . [sent-64, score-0.41]
45 Consider modeling video BoW vectors using the Replicated Softmax. [sent-65, score-0.237]
46 , tr ] denotes a F-dimensional topic vector of one video. [sent-69, score-0.389]
47 1 F According to Equation 2, the marginal distribution over the BoW vector v is given by: P (v; θ) = 1 Z(θ) exp(−E(v, tr ; θ)), (3) tr Since videos contain more complex and diverse contents than documents, binary topics are far from ideal to explain video data. [sent-70, score-1.179]
48 We replace binary hidden units in the original Replicated Softmax with rectified linear units which are given by: N tr j = max(0, tj ), P (tj |v; θ) = N (tj |Kbj + Wij vi , 1), (4) i=1 where N (·|µ, τ ) denotes a Gaussian distribution with mean µ and variance τ . [sent-71, score-0.453]
49 The rectified linear units taking nonnegative real values can preserve information about relative importance of topics. [sent-72, score-0.069]
50 This facilitates the development of variational algorithms for posterior inference and parameter estimation, which we describe in Section 3. [sent-74, score-0.095]
51 We define the discriminant function as a linear combination of topics: F (y, tr , η ) = η T tr . [sent-77, score-0.46]
52 The conditional distribution of classes is y 3 defined as follows: P (y|tr , η ) = exp(F (y, tr , η )) C y =1 exp(F (y , tr , η )) , (5) and the classifier is given by: y = arg max E[F (y, tr , η )|v]. [sent-78, score-0.722]
53 The hyperpriors over α are given by Gamma distributions: C F α P (α ) = C F −1 c c−1 −dα d αyj e , P (αyj ) = Γ(c) y=1 j=1 (8) y=1 j=1 where Γ(c) is the Gamma function. [sent-80, score-0.033]
54 This hierarchical prior, which is a type of automatic relevance determination prior [14], enables the posterior probability of the weights η to be concentrated at zero and thus effectively switch off the corresponding topics that are considered to be irrelevant to classification. [sent-84, score-0.455]
55 Finally, given the parameters θ, RTM defines the joint distribution: C F P (v, y, tr , η , α ; θ) = P (v; θ)P (y|tr , η ) F P (ηyj |αyj )P (αyj ) . [sent-85, score-0.23]
56 P (tj |v; θ) (9) y=1 j=1 j=1 Figure 2 (right) illustrates RTM as a mixed graphical model with undirected and directed edges. [sent-86, score-0.092]
57 The undirected part models the marginal distribution of video data and the directed part models the conditional distribution of classes given latent topics. [sent-87, score-0.407]
58 We can naturally extend RTM to Multimodal RTM by using the undirected part to model the multimodal data v = {vmodl }L . [sent-88, score-0.154]
59 Since exactly computing P (D; θ) is intractable, we employ variational methods to optimize a lower bound L on the log likelihood by introducing a variational distribution to approximate P ({tm }M , η , α |D; θ): m=1 M F Q({tm }M , η , α ) = m=1 η α q(tmj ) q(η )q(α ). [sent-94, score-0.214]
60 (11) m=1 j=1 Using Jensens inequality, we have: log P (D; θ) log Q({tm }M , η , α ) m=1 M m=1 ηα α P (vm ; θ)P (ym |tr , η )P (tm |vm ; θ) P (η |α )P (α ) m Q({tm }M , η , α ) m=1 d{tm }M dη dα . [sent-95, score-0.048]
61 (12) m=1 η α Note that P (ym |tr , η ) is not conjugate to the Gaussian prior, which makes it intractable to compute m η the variational factors q(η ) and q(tmj ). [sent-96, score-0.095]
62 , (tr )T η C−1 ], ym = I(ym = c) is the one-of-C encoding of class m m m C−1 label ym and lse(x) log(1 + y =1 exp(xy )) (we set η C = 0 to ensure identifiability). [sent-101, score-0.346]
63 Substituting J(ym , tr , η , ϕ m ) into Equation m 11, we can obtain a further lower bound: M log P (D; θ) M L(θ, ϕ ) = J(ym , tr , η , ϕ m ) m log P (vm ; θ) + EQ m=1 m=1 M ηα α log P (tm |vm ; θ) + log P (η |α ) + log P (α ) − Q({tm }M , η , α ) . [sent-104, score-0.58]
64 (17) m=1 + m=1 Now we convert the problem of model training into maximizing the lower bound L(θ, ϕ ) with η α respect to the variational posteriors q(η ), q(α ) and q(t) = {q(tmj )} as well as the parameters ϕ θ and ϕ = {ϕ m }. [sent-105, score-0.156]
65 Given θ, through repeating the updates of Equations 18-20 and 23 to maximize L(θ, ϕ ), we can η α η α obtain the variational posteriors q(η ), q(α ) and q(t). [sent-111, score-0.131]
66 This leads to the following variational EM algorithm: η α E-step: Calculate variational posteriors q(η ), q(α ) and q(t). [sent-113, score-0.226]
67 After the learning is completed, according to Equation 6 the prediction for new videos can be easily obtained: y = arg max η T q(η ) tr p(t|v;θ) . [sent-117, score-0.362]
68 ˆ (27) y η y∈C 4 Experiments We test our models on the Unstructured Social Activity Attribute (USAA) dataset 1 for social group activity recognition. [sent-118, score-0.182]
69 Firstly, we present quantitative evaluations of RTM in the case of different modalities and comparisons with other supervised topic models (namely MedLDA and gClassRBM). [sent-119, score-0.231]
70 Secondly, we compare Multimodal RTM with some baselines in the case of plentiful and sparse training data respectively. [sent-120, score-0.117]
71 In all experiments, the contrastive divergence is used to efficiently approximate the derivatives of the marginal log likelihood and the unsupervised training on Replicated Softmax is used to initialize θ. [sent-121, score-0.081]
72 1 Dataset and video representation The USAA dataset consists of 8 semantic classes of social activity videos collected from the Internet. [sent-123, score-0.597]
73 The eight classes are: birthday party, graduation party, music performance, non-music performance, parade, wedding ceremony, wedding dance and wedding reception. [sent-124, score-0.519]
74 The dataset contains a total of 1466 videos and approximate 100 videos per-class for training and testing respectively. [sent-125, score-0.289]
75 These videos range from 20 seconds to 8 minutes averaging 3 minutes and contain very complex and diverse contents, which brings significant challenges for content analysis. [sent-126, score-0.158]
76 2 Model comparisons To evaluate the discriminative power of video topics learned by RTM, we present quantitative classification results compared with other supervised topic models (MedLDA and gClassRBM) in the case of different modalities. [sent-133, score-0.756]
77 6 Table 1: Classification accuracy of different supervised topic models for single-modal features. [sent-140, score-0.207]
78 Feature SIFT Model Accuracy (%) 20 topics 30 topics 40 topics 50 topics 60 topics Med LDA 44. [sent-141, score-1.39]
79 33 Table 2: Classification accuracy of different methods for multimodal features. [sent-186, score-0.089]
80 Method 100 Inst Accuracy (%) 10 Inst Multimodal RTM 60 topics 90 topics 120 topics 150 topics 180 topics 60 topics 90 topics 120 topics 150 topics 180 topics 60. [sent-187, score-2.78]
81 3 Baseline comparisons We compare Multimodal RTM with the baselines in [4] which are the best results on the USAA dataset: Direct Direct SVM or KNN classification on raw video BoW vectors (14000 dimensions), where SVM is used for experiments with more than 10 instances and KNN otherwise. [sent-216, score-0.261]
82 SVM-UD+LR SVM attribute classifiers learn 69 user-defined attributes, and then a logistic regression (LR) classifier is performed according to the attribute classifier outputs. [sent-217, score-0.12]
83 SLAS+LR Semi-latent attribute space is learned, and then a LR classifier is performed based on the 69 user-defined, 8 class-conditional and 8 latent topics. [sent-218, score-0.106]
84 Besides, we also perform a comparison with another baseline where different modal topics extracted by Replicated Softmax are connected together as video representations, and then a multi-class SVM classifier [21] is learned from the representations. [sent-219, score-0.515]
85 Here the number of topics of each modality is assumed to be the same. [sent-222, score-0.305]
86 When the labeled training data is plentiful (100 instances per class), the classification performance of Multimodal RTM is similar to the baselines in [4]. [sent-223, score-0.102]
87 However, We argue that our model learns a lower dimensional latent semantic space which provides efficient video representations and is able to be better generalized to a larger or new dataset because extra human defined concepts are not required in our model. [sent-224, score-0.401]
88 , 90) of topics because the sparsity of relevance topics learned by RTM can effectively prevent overfitting to specific training instances. [sent-227, score-0.732]
89 In addition, our model outperforms RS+SVM in both cases, which demonstrates the advantage of jointly learning topics and classifier weights through sparse Bayesian learning. [sent-228, score-0.316]
90 It is also interesting to examine the sparsity of relevance topics. [sent-229, score-0.151]
91 Figure 3 illustrates the degree of correlation between topics and two different classes. [sent-230, score-0.278]
92 We can see that the learned relevance topics are very sparse, which leads to good generalisation for new instances and robustness for small datasets. [sent-231, score-0.429]
93 7 Figure 3: Relevance topics discovered by RTM for two different classes. [sent-232, score-0.278]
94 5 Conclusion This paper has proposed a supervised topic model, the relevance topic model (RTM), to jointly learn discriminative latent topical representations and a sparse classifer for recognizing unstructured social group activity. [sent-234, score-0.905]
95 In RTM, sparse Bayisian learning is integrated with an undirected topic model to discover sparse relevance topics. [sent-235, score-0.451]
96 Rectified linear units are employed to better fit complex video data and facilitate the learning of the model. [sent-236, score-0.332]
97 Efficient variational methods are developed for parameter estimation and inference. [sent-237, score-0.095]
98 To further improve video classification performance, RTM is also extended to deal with multimodal data. [sent-238, score-0.326]
99 Experimental results demonstrate that RTM can find more predictive video topics than other supervised topic models and achieve state of the art classification performance, particularly in the scenario of lacking labeled training videos. [sent-239, score-0.77]
100 A sequential topic model for mining recurrent activities from long term video logs. [sent-306, score-0.441]
wordName wordTfidf (topN-words)
[('rtm', 0.48), ('topics', 0.278), ('tmj', 0.277), ('video', 0.237), ('tr', 0.23), ('replicated', 0.185), ('ym', 0.173), ('topic', 0.159), ('relevance', 0.151), ('neg', 0.135), ('pos', 0.134), ('videos', 0.132), ('softmax', 0.13), ('tm', 0.116), ('lse', 0.114), ('medlda', 0.114), ('gclassrbm', 0.111), ('yj', 0.108), ('bow', 0.099), ('recti', 0.098), ('wedding', 0.098), ('variational', 0.095), ('multimodal', 0.089), ('unstructured', 0.088), ('vm', 0.077), ('social', 0.074), ('classi', 0.074), ('graduation', 0.074), ('kbj', 0.074), ('stip', 0.074), ('usaa', 0.074), ('vmi', 0.074), ('activity', 0.072), ('units', 0.069), ('lr', 0.066), ('birthday', 0.065), ('ceremony', 0.065), ('undirected', 0.065), ('party', 0.062), ('attribute', 0.06), ('wij', 0.059), ('erfc', 0.055), ('gclass', 0.055), ('dance', 0.054), ('lda', 0.053), ('semantic', 0.05), ('vi', 0.049), ('mfcc', 0.049), ('atr', 0.049), ('er', 0.048), ('supervised', 0.048), ('latent', 0.046), ('contents', 0.046), ('activities', 0.045), ('svm', 0.045), ('representations', 0.044), ('rbm', 0.043), ('sift', 0.042), ('med', 0.04), ('rs', 0.039), ('sparse', 0.038), ('boltzmann', 0.037), ('dyj', 0.037), ('inst', 0.037), ('modl', 0.037), ('slas', 0.037), ('vmodl', 0.037), ('group', 0.036), ('tj', 0.036), ('attributes', 0.036), ('visible', 0.036), ('posteriors', 0.036), ('discriminative', 0.034), ('hyperpriors', 0.033), ('contrastive', 0.032), ('sm', 0.032), ('classes', 0.032), ('ijcv', 0.031), ('cepstral', 0.03), ('plentiful', 0.03), ('hj', 0.029), ('topical', 0.028), ('modality', 0.027), ('directed', 0.027), ('automatic', 0.026), ('bj', 0.026), ('bayesian', 0.026), ('complex', 0.026), ('web', 0.026), ('training', 0.025), ('motions', 0.025), ('modalities', 0.024), ('jmlr', 0.024), ('human', 0.024), ('baselines', 0.024), ('log', 0.024), ('ic', 0.023), ('knn', 0.023), ('gamma', 0.023), ('labeled', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
2 0.14133644 148 nips-2013-Latent Maximum Margin Clustering
Author: Guang-Tong Zhou, Tian Lan, Arash Vahdat, Greg Mori
Abstract: We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches. 1
Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade
Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identifiability, tensor decomposition.
4 0.11651324 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
5 0.11633131 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.1126247 301 nips-2013-Sparse Additive Text Models with Low Rank Background
7 0.1101885 70 nips-2013-Contrastive Learning Using Spectral Methods
8 0.10837596 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
9 0.094458498 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
10 0.09444163 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
11 0.086322129 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
12 0.085729018 98 nips-2013-Documents as multiple overlapping windows into grids of counts
13 0.081390575 335 nips-2013-Transfer Learning in a Transductive Setting
14 0.074063629 331 nips-2013-Top-Down Regularization of Deep Belief Networks
15 0.073475145 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
16 0.071568519 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
17 0.068579271 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
18 0.065306954 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.065198116 5 nips-2013-A Deep Architecture for Matching Short Texts
20 0.064961366 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
topicId topicWeight
[(0, 0.154), (1, 0.078), (2, -0.101), (3, -0.032), (4, 0.097), (5, -0.06), (6, 0.054), (7, 0.022), (8, 0.09), (9, -0.014), (10, -0.02), (11, 0.067), (12, -0.007), (13, -0.033), (14, -0.01), (15, -0.084), (16, 0.135), (17, 0.024), (18, 0.033), (19, -0.117), (20, -0.043), (21, -0.028), (22, 0.087), (23, 0.063), (24, 0.068), (25, -0.185), (26, 0.06), (27, -0.017), (28, -0.116), (29, -0.016), (30, 0.004), (31, -0.067), (32, -0.002), (33, -0.001), (34, 0.025), (35, 0.005), (36, -0.022), (37, -0.073), (38, -0.061), (39, -0.019), (40, 0.073), (41, 0.095), (42, 0.014), (43, -0.017), (44, -0.024), (45, 0.014), (46, 0.044), (47, 0.086), (48, 0.012), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.94780397 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
2 0.71876639 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
3 0.71523863 301 nips-2013-Sparse Additive Text Models with Low Rank Background
Author: Lei Shi
Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1
4 0.70049638 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
Author: George H. Chen, Stanislav Nikolov, Devavrat Shah
Abstract: For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justification for the effectiveness of nearest-neighbor-like classification of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren’t actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a “weighted majority voting” classification rule that can be approximated by a nearest-neighbor classifier. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classification under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassification rate as nearest-neighbor classification while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such “trending topics” in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. 1
5 0.68090826 98 nips-2013-Documents as multiple overlapping windows into grids of counts
Author: Alessandro Perina, Nebojsa Jojic, Manuele Bicego, Andrzej Truski
Abstract: In text analysis documents are often represented as disorganized bags of words; models of such count features are typically based on mixing a small number of topics [1, 2]. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid [3] models this spatial metaphor literally: it is a grid of word distributions learned in such a way that a document’s own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content must be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome this issue by introducing the Componential Counting Grid which brings the componential nature of topic models to the basic counting grid. We evaluated our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks. 1
7 0.653418 70 nips-2013-Contrastive Learning Using Spectral Methods
8 0.61858571 148 nips-2013-Latent Maximum Margin Clustering
9 0.61203617 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
10 0.5909372 75 nips-2013-Convex Two-Layer Modeling
11 0.5820967 294 nips-2013-Similarity Component Analysis
12 0.55907029 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
13 0.55897528 85 nips-2013-Deep content-based music recommendation
14 0.53753436 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
15 0.48188427 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.47026962 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
17 0.44920534 88 nips-2013-Designed Measurements for Vector Count Data
18 0.44446686 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
19 0.4424282 65 nips-2013-Compressive Feature Learning
20 0.43674454 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
topicId topicWeight
[(16, 0.023), (33, 0.122), (34, 0.075), (41, 0.016), (49, 0.516), (56, 0.052), (70, 0.025), (85, 0.03), (89, 0.014), (93, 0.047)]
simIndex simValue paperId paperTitle
1 0.88266754 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do
Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1
2 0.87622494 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
Author: Jasper Snoek, Richard Zemel, Ryan P. Adams
Abstract: Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data. 1
same-paper 3 0.84262168 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1
4 0.80319172 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani
Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1
5 0.74358624 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
Author: Suvrit Sra, Reshad Hosseini
Abstract: Hermitian positive definite (hpd) matrices recur throughout machine learning, statistics, and optimisation. This paper develops (conic) geometric optimisation on the cone of hpd matrices, which allows us to globally optimise a large class of nonconvex functions of hpd matrices. Specifically, we first use the Riemannian manifold structure of the hpd cone for studying functions that are nonconvex in the Euclidean sense but are geodesically convex (g-convex), hence globally optimisable. We then go beyond g-convexity, and exploit the conic geometry of hpd matrices to identify another class of functions that remain amenable to global optimisation without requiring g-convexity. We present key results that help recognise g-convexity and also the additional structure alluded to above. We illustrate our ideas by applying them to likelihood maximisation for a broad family of elliptically contoured distributions: for this maximisation, we derive novel, parameter free fixed-point algorithms. To our knowledge, ours are the most general results on geometric optimisation of hpd matrices known so far. Experiments show that advantages of using our fixed-point algorithms. 1
6 0.72245151 70 nips-2013-Contrastive Learning Using Spectral Methods
7 0.70012426 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
8 0.64346015 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
9 0.62558293 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
10 0.60026222 121 nips-2013-Firing rate predictions in optimal balanced networks
11 0.54782814 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
12 0.50808525 246 nips-2013-Perfect Associative Learning with Spike-Timing-Dependent Plasticity
13 0.50385356 64 nips-2013-Compete to Compute
14 0.5015524 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
15 0.46259776 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
16 0.45924258 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?
17 0.45896092 148 nips-2013-Latent Maximum Margin Clustering
18 0.45724726 301 nips-2013-Sparse Additive Text Models with Low Rank Background
19 0.45696172 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
20 0.4553932 174 nips-2013-Lexical and Hierarchical Topic Regression