nips nips2013 nips2013-345 knowledge-graph by maker-knowledge-mining

345 nips-2013-Variance Reduction for Stochastic Gradient Optimization


Source: pdf

Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing

Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. [sent-6, score-0.473]

2 However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. [sent-7, score-0.44]

3 In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. [sent-8, score-0.537]

4 We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. [sent-10, score-0.571]

5 One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. [sent-11, score-0.297]

6 Thus, stochastic gradient algorithms can run many more iterations in a limited time budget. [sent-18, score-0.272]

7 However, if the noisy gradient has a large variance, the stochastic gradient algorithm might spend much time bouncing around, leading to slower convergence and worse performance. [sent-19, score-0.619]

8 Taking a mini-batch with a larger size for computing the noisy gradient could help to reduce its variance; but if the mini-batch size is too large, it can undermine the advantage in efficiency of stochastic gradient optimization. [sent-20, score-0.565]

9 In this paper, we propose a general remedy to the “noisy gradient” problem ubiquitous to all stochastic gradient optimization algorithms for different models. [sent-21, score-0.272]

10 Our approach builds on a variance reduction technique, which makes use of control variates [3] to augment the noisy gradient and thereby reduce its variance. [sent-22, score-0.721]

11 For such control variates to be effective and sound, they must satisfy the following key requirements: 1) they have a high correlation with the noisy gradient, and 2) their expectation (with respect to random data samples) is inexpensive to compute. [sent-24, score-0.458]

12 We show that such control variates can be constructed via low-order approximations to the noisy gradient so that their expectation only depends on low-order moments of the data. [sent-25, score-0.634]

13 The intuition is that these low-order moments roughly characterize the empirical data distribution, and can be used to form the control variate to correct the noisy gradient to a better direction. [sent-26, score-0.602]

14 In §2, we describe the general formulation and the theoretical property of variance reduction via control variates in stochastic gradient optimization. [sent-29, score-0.7]

15 1 In §3, we present two examples to show how one can construct control variates for practical algorithms. [sent-30, score-0.287]

16 ) These include a convex problem—the MAP estimation for logistic regression, and a non-convex problem—stochastic variational inference for latent Dirichlet allocation [22]. [sent-32, score-0.352]

17 2 Variance reduction for general stochastic gradient optimization We begin with a description of the general formulation of variance reduction via control variate for stochastic gradient optimization. [sent-35, score-1.058]

18 1 Gradient-based algorithms can be used to maximize L(w) at the expense of computing the gradient over the entire training set. [sent-39, score-0.217]

19 Instead, stochastic gradient (SG) methods use the noisy gradient estimated from random data samples. [sent-40, score-0.55]

20 Suppose data index d is selected uniformly from {1, · · · , D} at step t, g(w; xd ) = w R(w) + w f (w; xd ), wt+1 = wt + ρt g(w; xd ), (1) (2) where g(w; xd ) is the noisy gradient that only depends on xd and ρt is a proper step size. [sent-41, score-1.168]

21 To make notation simple, we use gd (w) g(w; xd ). [sent-42, score-0.713]

22 Following the standard stochastic optimization literature [1, 4], we require the expectation of the noisy gradient gd equals to the true gradient, Ed [gd (w)] = w L(w), (3) to ensure the convergence of the stochastic gradient algorithm. [sent-43, score-1.235]

23 When the variance of gd (w) is large, the algorithm could suffer from slow convergence. [sent-44, score-0.631]

24 The basic idea of using control variates for variance reduction is to construct a new random vector that has the same expectation as the target expectation but with smaller variance. [sent-45, score-0.512]

25 In previous work [5], control variates were used to improve the estimate of the intractable integral in variational Bayesian inference which was then used to compute the gradient of the variational lower bound. [sent-46, score-0.819]

26 In our context, we employ a random vector hd (w) of length p to reduce the variance of the sampled gradient, gd (w) = gd (w) − AT (hd (w) − h(w)), (4) where A is a p × p matrix and h(w) Ed [hd (w)]. [sent-47, score-1.633]

27 (We will show how to choose hd (w) later, but it usually depends on the form of gd (w). [sent-48, score-0.98]

28 ) The random vector gd (w) has the same expectation as the noisy gradient gd (w) in Eq. [sent-49, score-1.383]

29 1, and thus can be used to replace gd (w) in the SG update in Eq. [sent-50, score-0.538]

30 That is, A∗ = argminA Tr (Vard [gd (w)]) −1 = (Vard [hd (w)]) (Covd [gd (w), hd (w)] + Covd [hd (w), gd (w)]) /2. [sent-53, score-0.965]

31 Now we show that gd (w) is a better “stochastic gradient” under the 2 -norm. [sent-56, score-0.538]

32 In the first-order stochastic oracle model, we normally assume that there exists a constant σ such that for any estimate w in its domain [6, 7]: Ed gd (w) − Ed [gd (w)] 2 2 = Tr(Vard [gd (w)]) ≤ σ 2 . [sent-57, score-0.635]

33 Now suppose that we can find a random vector hd (w) and compute A∗ according to Eq. [sent-60, score-0.427]

34 5, Ed gd (w) − Ed [gd (w)] 2 2 = Tr(Vard [˜d (w)]), g where Vard [gd (w)] = Vard [gd (w)] − Covd [gd (w), hd (w)](Vard [hd (w)])−1 Covd [hd (w), gd (w)]. [sent-63, score-1.503]

35 −1 For any estimate w, Covd (gd , hd ) (Covd (hd , hd )) Covd (hd , gd ) is a semi-positive definite matrix. [sent-64, score-1.392]

36 Therefore, its trace, which equals to the sum of the eigenvalues, is positive (or zero when hd and gd are uncorrelated) and hence, Ed gd (w) − Ed [˜d (w)] ˜ g 2 2 ≤ Ed gd (w) − Ed [gd (w)] 2 2 . [sent-65, score-2.041]

37 2 In other words, it is possible to find a constant τ ≤ σ such that Ed gd (w) − Ed [˜d (w)] 2 ≤ τ 2 ˜ g for all w. [sent-66, score-0.538]

38 Therefore, when applying stochastic gradient methods, we could improve the optimal con√ √ vergence rate from O(σ/ t) to O(τ / t) for convex problems; and from O(σ 2 /(µt)) to O(τ 2 /(µt)) for strongly convex problems. [sent-67, score-0.326]

39 [Vard (hd (w))]ii (7) This formulation avoids the computation of the matrix inverse, and leads to significant reduction of computational cost since only the diagonal elements of Covd (gd (w), hd (w)) and Vard (hd (w)), instead of the full matrices, need to be evaluated. [sent-77, score-0.501]

40 5, the optimal a∗ is simply: a∗ = Tr (Covd (gd (w), hd (w)))/Tr (Vard (hd (w))). [sent-83, score-0.427]

41 (9) To estimate the optimal A∗ or its surrogates, we need to evaluate Covd (gd (w), hd (w)) and Vard (hd (w)) (or their diagonal elements), which can be approximated by the sample covariance and variance from mini-batch samples while running the stochastic gradient algorithm. [sent-84, score-0.792]

42 8, we observe that when the Pearson’s correlation coefficient between gd (w) and hd (w) is higher, the control variate hd (w) will lead to a more significant level of variance reduction and hence faster convergence. [sent-87, score-1.884]

43 In the maximal correlation case, one could set hd (w) = gd (w) to obtain zero variance. [sent-88, score-0.999]

44 In practice, one should construct hd (w) such that it is highly correlated with gd (w). [sent-90, score-1.01]

45 In next section, we will show how to construct control variates for both convex and non-convex problems. [sent-91, score-0.314]

46 The solid (red) line is the final gradient direction the algorithm will follow. [sent-94, score-0.205]

47 (a) The exact gradient direction computed using the entire dataset. [sent-95, score-0.225]

48 (b) The noisy gradient direction computed from the sampled subset, which can have high variance. [sent-96, score-0.33]

49 (c) The improved noisy gradient direction with data statistics, such as low-order moments of the entire data. [sent-97, score-0.379]

50 These low-order moments roughly characterize the data distribution, and are used to form the control variate to aid the noisy gradient. [sent-98, score-0.427]

51 As we discussed in §2, the higher the correlation between gd (w) and hd (w), the lower the variance is. [sent-100, score-1.092]

52 Therefore, to apply the variance reduction technique in practice, the key is to construct a random vector hd (w) such that it has high correlations with gd (w), but its expectation h(w) = Ed [hd (w)] is inexpensive to compute. [sent-101, score-1.218]

53 Thus they can be pre-computed when processing the data or estimated online while running the stochastic gradient algorithm. [sent-104, score-0.272]

54 We will use this principle throughout the paper to construct control variates for variance reduction under different scenarios. [sent-106, score-0.454]

55 1 SG with variance reduction for logistic regression Logistic regression is widely used for classification [15]. [sent-108, score-0.275]

56 , D, where yd = 1 or yd = −1 indicates class labels, the probability of yd is p(yd | xd , w) = σ(yd w xd ), where σ(z) = 1/(1 + exp(−z)) is the logistic function. [sent-112, score-0.895]

57 The averaged log likelihood of the training data is (w) = 1 D D d=1 yd w xd − log 1 + exp(yd w xd ) . [sent-113, score-0.552]

58 (10) An SG algorithm employs the following noisy gradient: gd (w) = yd xd σ(−yd w xd ). [sent-114, score-1.154]

59 (11) Now we show how to construct our control variate for logistic regression. [sent-115, score-0.355]

60 Let z = −w xd , and we define our control variate hd (w) for yd = 1 as (1) hd (w) xd σ(ˆ) (1 + σ(−ˆ)(z − z )) = xd σ(ˆ) 1 + σ(−ˆ)(−w xd − z ) . [sent-121, score-1.99]

61 z z ˆ z z ˆ Its expectation given yd = 1 can be computed in closed-form as (1) Ed [hd (w) | yd = 1] = σ(ˆ) x(1) (1 − σ(−ˆ)ˆ) − σ(−ˆ) Var(1) [xd ] + x(1) (¯(1) ) z ¯ z z z ¯ x w , 2 Taylor expansion is not the only way to obtain control variates. [sent-122, score-0.441]

62 We can ˆ ¯ (−1) similarly derive the control variate hd (w) for negative examples and we omit the details. [sent-127, score-0.7]

63 With Taylor approximation, we would expect our control variate is highly correlated with the noisy gradient. [sent-129, score-0.395]

64 2 SVI with variance reduction for latent Dirichlet allocation The stochastic variational inference (SVI) algorithm used for latent Dirichlet allocation (LDA) [22] is also a form of stochastic gradient optimization, therefore it can also benefit from variance reduction. [sent-132, score-0.953]

65 The basic idea is to stochastically optimize the variational objective for LDA, using stochastic mean field updates augmented by control variates derived from low-order moments on the data. [sent-133, score-0.611]

66 Given the observed words w w1:D , we want to estimate the posterior distribution of the latent variables, including topics β β1:K , topic proportions θ θ1:D and topic assignments z z1:D , p(β, θ, z | w) ∝ K k=1 p(βk | η) D d=1 p(θd | α) N n=1 p(zdn | θd )p(wdn | βzdn ). [sent-155, score-0.201]

67 Mean-field variational inference is a popular approach for the approximation [19]. [sent-158, score-0.214]

68 Mean-field variational inference posits a family of distributions (called variational distributions) indexed by free variational parameters and then optimizes these parameters to minimize the KL divergence between the variational distribution and the true posterior. [sent-160, score-0.721]

69 For LDA, the variational distribution is q(β, θ, z) = K k=1 q(βk | λk ) D d=1 q(θd | γd ) N n=1 q(zdn | φdn ), (13) where the variational parameters are λk (Dirichlet), θd (Dirichlet), and φdn (multinomial). [sent-161, score-0.338]

70 This is equivalent to maximizing the lower bound of the log marginal likelihood of the data, log p(w) ≥ Eq [log p(β, θ, z, w)] − Eq [log q(β, θ, z)] L(q), (14) where Eq [·] denotes the expectation with respect to the variational distribution q(β, θ, z). [sent-165, score-0.215]

71 Setting the gradient of the lower bound L(q) with respect to the variational parameters to zero gives the following coordinate ascent algorithm [17]. [sent-166, score-0.383]

72 , D}, we run local variational inference using the following updates until convergence, φk ∝ exp {Ψ(γdk ) + Ψ(λk,v ) − Ψ ( dv γd = α + V v=1 v λkv )} ndv φdv . [sent-170, score-0.374]

73 After finding the variational parameters for each document, we update the variational Dirichlet for each topic, D d=1 λkv = η + 5 ndv φk . [sent-177, score-0.405]

74 dv (17) The whole coordinate ascent variational algorithm iterates over Eq. [sent-178, score-0.301]

75 Stochastic variational inference solves this problem using stochastic optimization. [sent-183, score-0.311]

76 Instead of using the coordinate ascent algorithm, SVI optimizes the variational lower bound L(q) using stochastic optimization [22]. [sent-185, score-0.305]

77 It draws random samples from the corpus and use these samples to form the noisy estimate of the natural gradient [20]. [sent-186, score-0.322]

78 Then the algorithm follows that noisy natural gradient with a decreasing step size until convergence. [sent-187, score-0.278]

79 The noisy gradient only depends on the sampled data and it is inexpensive to compute. [sent-188, score-0.346]

80 This leads to a much faster algorithm than the traditional coordinate ascent variational inference algorithm. [sent-189, score-0.271]

81 According to [22], for LDA the noisy natural gradient with respect to the topic variational parameters is gd (λkv ) −λkv + η + Dndv φk , dv (18) φk dv 3 where the are obtained from the local variational inference by iterating over Eq. [sent-196, score-1.451]

82 With a step size ρt , SVI uses the following update λkv ← λkv + ρt gd (λkv ). [sent-198, score-0.538]

83 However, the sampled natural gradient gd (λkv ) in Eq. [sent-199, score-0.735]

84 Now we show how to construct control variates for the noisy gradient to reduce its variance. [sent-203, score-0.58]

85 18, the noisy gradient gd (λkv ) is a function of topic assignment parameters φdv , which in turn depends on wd , the words in document d, through the iterative updates in Eq. [sent-205, score-0.97]

86 In logistic regression, the gradient is an analytical function of the training data (Eq. [sent-209, score-0.253]

87 11), while in LDA, the natural gradient directly depends on the optimal local variational parameters (Eq. [sent-210, score-0.359]

88 18), which then depends on the training data through the local variational inference (Eq. [sent-211, score-0.251]

89 15 and 16 leads to the update, ( V fdu φk(0) )φk(0) ( V fdu φk(0) )φk(0) k(1) u v u v u=1 φdv = K u=1 , (19) k(0) k(0) ≈ K V V ¯ k(0) k(0) k=1 u=1 fdu φu φv k=1 u=1 fu φu φv where fdv = ndv /nd is the empirical frequency of term v in document d. [sent-222, score-0.419]

90 We define our control variate as k(1) hd (λkv ) whose expectation is Ed [hd (λkv )] = Dndv φdv , (1) D/mv V u=1 k(0) nv fu φu k(0) φv , where nv fu (1/D) d ndu fdv = (1/D) d ndu ndv /nd . [sent-226, score-1.021]

91 Similar ideas can be used in deriving control variates for hierarchical Dirichlet process [12, 13] and nonnegative matrix factorization [14]. [sent-230, score-0.261]

92 It is usually high, indicating the control variate is highly correlated with the noisy gradient, leading to a large variance reduction. [sent-276, score-0.488]

93 4 Experiments In this section, we conducted experiments on the MAP estimation for logistic regression and stochastic variational inference for LDA. [sent-278, score-0.393]

94 1 Logistic regression We evaluate our algorithm on stochastic gradient (SG) for logistic regression. [sent-282, score-0.354]

95 Our method consistently performs better than the standard stochastic variational inference. [sent-330, score-0.266]

96 Figure 3 shows the mean of Pearson’s correlation coefficient between the control variate and noisy gradient8 , which is quite high—the control variate is highly correlated with the noisy gradient, leading to a large variance reduction. [sent-334, score-0.898]

97 2 Stochastic variational inference for LDA We evaluate our algorithm on stochastic variational inference for LDA. [sent-336, score-0.525]

98 5 Discussions and future work In this paper, we show that variance reduction with control variates can be used to improve stochastic gradient optimization. [sent-358, score-0.7]

99 8 Since the control variate and noisy gradient are vectors, we use the mean of the Pearson’s coefficients computed for each dimension between these two vectors. [sent-367, score-0.551]

100 A variational approach to Bayesian logistic regression models and their extensions. [sent-458, score-0.251]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gd', 0.538), ('hd', 0.427), ('vard', 0.305), ('covd', 0.229), ('variate', 0.187), ('xd', 0.175), ('variates', 0.175), ('gradient', 0.175), ('variational', 0.169), ('yd', 0.163), ('kv', 0.144), ('noisy', 0.103), ('sg', 0.102), ('stochastic', 0.097), ('variance', 0.093), ('dv', 0.093), ('control', 0.086), ('svi', 0.081), ('fdu', 0.076), ('reduction', 0.074), ('lda', 0.074), ('ndv', 0.067), ('topic', 0.066), ('dirichlet', 0.064), ('zdn', 0.062), ('dk', 0.056), ('logistic', 0.056), ('pearson', 0.054), ('moments', 0.051), ('document', 0.047), ('fu', 0.046), ('dtest', 0.046), ('inference', 0.045), ('corpus', 0.044), ('blei', 0.038), ('tr', 0.037), ('wdn', 0.035), ('decayed', 0.035), ('correlation', 0.034), ('inexpensive', 0.031), ('bouncing', 0.031), ('dndv', 0.031), ('fdv', 0.031), ('ndu', 0.031), ('stripe', 0.031), ('direction', 0.03), ('documents', 0.03), ('expectation', 0.029), ('allocation', 0.028), ('dn', 0.028), ('unreachable', 0.027), ('dtrain', 0.027), ('vocabulary', 0.027), ('latent', 0.027), ('convex', 0.027), ('construct', 0.026), ('regression', 0.026), ('wd', 0.026), ('minus', 0.026), ('taylor', 0.024), ('topics', 0.024), ('ascent', 0.023), ('predictive', 0.023), ('training', 0.022), ('sampled', 0.022), ('ii', 0.021), ('mult', 0.021), ('convergence', 0.021), ('eq', 0.021), ('ld', 0.02), ('chong', 0.02), ('paisley', 0.02), ('wang', 0.02), ('entire', 0.02), ('var', 0.02), ('cing', 0.02), ('nv', 0.02), ('legend', 0.019), ('sacri', 0.019), ('correlated', 0.019), ('faster', 0.018), ('coef', 0.018), ('lan', 0.018), ('proportions', 0.018), ('spend', 0.017), ('wikipedia', 0.017), ('augmented', 0.017), ('draw', 0.017), ('likelihood', 0.017), ('explore', 0.017), ('rates', 0.016), ('corpora', 0.016), ('objective', 0.016), ('coordinate', 0.016), ('eric', 0.016), ('jaakkola', 0.016), ('reduce', 0.015), ('depends', 0.015), ('plugging', 0.015), ('hoffman', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing

Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1

2 0.17347072 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

3 0.12455993 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1

4 0.12429318 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1

5 0.10522963 174 nips-2013-Lexical and Hierarchical Topic Regression

Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik

Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ‫ݐ‬ௗ௦ ‫ݖ‬ௗ௦௡ ‫ݓ‬ௗ௦௡ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ‫ݕ‬ௗ ‫ܦ‬ ߱ ߟ௞ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶௞ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ ‫1=ݐ‬ ‫2=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫3=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫ݐ‬ௗ௦ ‫2=ݏ 1=ݏ‬ ‫ܵ = ݏ‬ଵ ‫3=ݏ 2=ݏ 1=ݏ‬ ݀=1 ݇ௗ௦௡ ‫ܵ = ݏ‬ଶ ‫ܵ = ݏ‬஽ ݀=2 ߶ଵ ߟଵ ݀=‫ܦ‬ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 ‫1=ݏ‬ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) =  1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x }  Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝       L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1  γl∗    −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9

6 0.10505725 301 nips-2013-Sparse Additive Text Models with Low Rank Background

7 0.095763423 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

8 0.09175837 88 nips-2013-Designed Measurements for Vector Count Data

9 0.085298084 89 nips-2013-Dimension-Free Exponentiated Gradient

10 0.085216254 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

11 0.077991292 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

12 0.069156706 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

13 0.066700481 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

14 0.064431094 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

15 0.064359523 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

16 0.063651122 75 nips-2013-Convex Two-Layer Modeling

17 0.062981255 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

18 0.062913351 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

19 0.059687573 143 nips-2013-Integrated Non-Factorized Variational Inference

20 0.05846348 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.038), (2, 0.002), (3, 0.011), (4, 0.024), (5, 0.056), (6, 0.024), (7, 0.106), (8, 0.172), (9, 0.004), (10, -0.036), (11, 0.156), (12, -0.028), (13, -0.083), (14, 0.021), (15, -0.081), (16, 0.099), (17, 0.047), (18, -0.051), (19, -0.101), (20, 0.02), (21, -0.073), (22, -0.021), (23, 0.026), (24, -0.062), (25, -0.102), (26, 0.098), (27, 0.008), (28, 0.04), (29, -0.001), (30, 0.044), (31, 0.027), (32, -0.064), (33, 0.056), (34, 0.051), (35, 0.008), (36, 0.048), (37, 0.095), (38, -0.01), (39, -0.048), (40, -0.013), (41, -0.034), (42, -0.076), (43, 0.043), (44, 0.036), (45, -0.022), (46, -0.042), (47, 0.068), (48, 0.107), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93566966 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing

Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1

2 0.71875459 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

3 0.69467276 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

Author: Sam Patterson, Yee Whye Teh

Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1

4 0.67266631 301 nips-2013-Sparse Additive Text Models with Low Rank Background

Author: Lei Shi

Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1

5 0.58548796 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

Author: Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang

Abstract: Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. 1

6 0.57819355 174 nips-2013-Lexical and Hierarchical Topic Regression

7 0.54118335 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

8 0.5253495 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

9 0.50102824 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.49741802 70 nips-2013-Contrastive Learning Using Spectral Methods

11 0.49527347 98 nips-2013-Documents as multiple overlapping windows into grids of counts

12 0.48177937 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

13 0.47321111 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

14 0.46396679 88 nips-2013-Designed Measurements for Vector Count Data

15 0.46036041 86 nips-2013-Demixing odors - fast inference in olfaction

16 0.45995969 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

17 0.45181173 143 nips-2013-Integrated Non-Factorized Variational Inference

18 0.44719711 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

19 0.43810159 193 nips-2013-Mixed Optimization for Smooth Functions

20 0.4289344 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.037), (33, 0.145), (34, 0.1), (36, 0.03), (41, 0.024), (49, 0.358), (56, 0.094), (60, 0.013), (70, 0.02), (85, 0.016), (89, 0.011), (93, 0.044), (95, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95762962 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models

Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do

Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1

2 0.95480603 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data

Author: Jasper Snoek, Richard Zemel, Ryan P. Adams

Abstract: Point processes are popular models of neural spiking behavior as they provide a statistical distribution over temporal sequences of spikes and help to reveal the complexities underlying a series of recorded action potentials. However, the most common neural point process models, the Poisson process and the gamma renewal process, do not capture interactions and correlations that are critical to modeling populations of neurons. We develop a novel model based on a determinantal point process over latent embeddings of neurons that effectively captures and helps visualize complex inhibitory and competitive interaction. We show that this model is a natural extension of the popular generalized linear model to sets of interacting neurons. The model is extended to incorporate gain control or divisive normalization, and the modulation of neural spiking based on periodic phenomena. Applied to neural spike recordings from the rat hippocampus, we see that the model captures inhibitory relationships, a dichotomy of classes of neurons, and a periodic modulation by the theta rhythm known to be present in the data. 1

3 0.93170846 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan

Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1

4 0.91255629 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations

Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani

Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1

5 0.86950189 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

Author: Suvrit Sra, Reshad Hosseini

Abstract: Hermitian positive definite (hpd) matrices recur throughout machine learning, statistics, and optimisation. This paper develops (conic) geometric optimisation on the cone of hpd matrices, which allows us to globally optimise a large class of nonconvex functions of hpd matrices. Specifically, we first use the Riemannian manifold structure of the hpd cone for studying functions that are nonconvex in the Euclidean sense but are geodesically convex (g-convex), hence globally optimisable. We then go beyond g-convexity, and exploit the conic geometry of hpd matrices to identify another class of functions that remain amenable to global optimisation without requiring g-convexity. We present key results that help recognise g-convexity and also the additional structure alluded to above. We illustrate our ideas by applying them to likelihood maximisation for a broad family of elliptically contoured distributions: for this maximisation, we derive novel, parameter free fixed-point algorithms. To our knowledge, ours are the most general results on geometric optimisation of hpd matrices known so far. Experiments show that advantages of using our fixed-point algorithms. 1

6 0.8545056 70 nips-2013-Contrastive Learning Using Spectral Methods

7 0.83933514 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

8 0.79550159 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

same-paper 9 0.78489369 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

10 0.75222868 121 nips-2013-Firing rate predictions in optimal balanced networks

11 0.7136656 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

12 0.67138958 64 nips-2013-Compete to Compute

13 0.66870749 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit

14 0.64933938 246 nips-2013-Perfect Associative Learning with Spike-Timing-Dependent Plasticity

15 0.63826686 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

16 0.63790309 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

17 0.63570899 301 nips-2013-Sparse Additive Text Models with Low Rank Background

18 0.63129532 148 nips-2013-Latent Maximum Margin Clustering

19 0.63017935 208 nips-2013-Neural representation of action sequences: how far can a simple snippet-matching model take us?

20 0.62986517 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity