nips nips2013 nips2013-298 knowledge-graph by maker-knowledge-mining

298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models


Source: pdf

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. [sent-5, score-0.424]

2 We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. [sent-6, score-0.176]

3 This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. [sent-8, score-0.164]

4 We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. [sent-9, score-0.403]

5 A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. [sent-10, score-0.152]

6 A recent thread of research has considered small-variance asymptotics of latent-variable models as a way to capture the benefits of rich graphical models while also providing a framework for designing more scalable combinatorial optimization algorithms. [sent-14, score-0.388]

7 This approach has recently been extended beyond the standard Gaussian mixture in various ways—to Dirichlet process mixtures and hierarchical Dirichlet processes [8], to non-Gaussian observations in the nonparametric setting [7], and to feature learning via the Beta process [5]. [sent-16, score-0.274]

8 In essence, small-variance asymptotics provides a link connecting some probabilistic graphical models with non-probabilistic combinatorial counterparts. [sent-18, score-0.396]

9 Thus far, small-variance asymptotics has been applied only to fairly simple latent-variable models. [sent-19, score-0.328]

10 In particular, to our knowledge there has been no such analysis for sequential data models such as the Hidden Markov Model (HMM) nor its nonparametric counterpart, the infinite-state HMM (iHMM). [sent-20, score-0.154]

11 HMMs are one of the most widely used probabilistic models for discrete sequence data, with diverse applications including DNA sequence analysis, natural language processing and speech recognition [4]. [sent-21, score-0.132]

12 The HMMs consist of a discrete hidden state sequence that evolves according 1 to Markov assumptions, along with independent observations at each time step depending on the hidden state. [sent-22, score-0.433]

13 To develop scalable algorithms for sequential data, we begin by applying small-variance analysis to the standard parametric HMM. [sent-24, score-0.215]

14 In the small-variance limit, we obtain a penalized k-means problem where the penalties capture the state transitions. [sent-25, score-0.23]

15 Further, a special case of the resulting model yields segmental k-means [9]. [sent-26, score-0.205]

16 For the nonparametric model we obtain an objective that effectively combines the asymptotics from the parametric HMM with the asymptotics for the Hierarchical Dirichlet Process [8]. [sent-27, score-0.954]

17 We obtain a k-means-like objective with three penalties: one for state transitions, one for the number of reachable states out of each state, and one for the number of total states. [sent-28, score-0.262]

18 The key aspect of our resulting formulation is that, unlike the standard sampler for the infinite-state HMM, dynamic programming can be used. [sent-29, score-0.125]

19 , the standard HMM), several algorithms exist for maximum likelihood (ML) estimation, such as the Baum-Welch algorithm (a special instance of the EM algorithm) and the segmental k-means algorithm [9]. [sent-35, score-0.235]

20 Infinite-state HMMs [3, 12] are nonparametric Bayesian extensions of the finite HMMs where hierarchical Dirichlet process (HDP) priors are used to allow for an unspecified number of states. [sent-36, score-0.173]

21 [6] presents a Beam sampling method which bypasses this obstacle via slice sampling, where only a finite number of hidden states are considered in each iteration. [sent-39, score-0.179]

22 Thus infinite-state HMMs, while desirable from a modeling perspective, have been limited by their inability to scale to large data sets—this is precisely the situation in which small-variance asymptotics has the potential to be beneficial. [sent-41, score-0.328]

23 2 Asymptotics of the finite-state HMM We begin, as a warm-up, with the simpler parametric (finite-state) HMM model, and show that small-variance asymptotics on the joint log likelihood yields a penalized k-means objective, and on the EM algorithm yields a generalized segmental k-means algorithm. [sent-44, score-0.802]

24 The tools developed in this section will then be used for the more involved nonparametric model. [sent-45, score-0.116]

25 1 The Model The Hidden Markov Model assumes a hidden state sequence Z = {z1 , . [sent-47, score-0.267]

26 , zN } drawn from a finite discrete state space {1, . [sent-50, score-0.12]

27 The resulting generative model defines a probability distribution over the hidden state sequence Z and the observation sequence X . [sent-57, score-0.366]

28 Let T ∈ RK×K be the stationary transition probability matrix of the hidden state sequence with Ti. [sent-58, score-0.425]

29 For clarity of presentation, we will use a binary 1-of-K representation for the latent state assignments. [sent-60, score-0.163]

30 That is, we will write the event of the latent state at time step t being k as ztk = 1 and ztl = 0 ∀l = 1. [sent-61, score-0.362]

31 Then the transition probabilities can be written as Tij = Pr(ztj = 1|zt−1,i = 1). [sent-65, score-0.22]

32 The Markov structure dictates that zt ∼ Mult(πzt−1 ), and the observations follow xt ∼ Φ(θzt ). [sent-67, score-0.587]

33 In the following, we present the asymptotic treament for the finite HMM with Gaussian emission densities Pr(xt |ztk = 1) = N (xt |µk , σ 2 Id ). [sent-69, score-0.208]

34 The first approach is to examine small-variance asymptotics directly on the the joint probability distribution of the HMM, as done in [5] for clustering and feature learning problems. [sent-73, score-0.37]

35 1 Exponential Family Transitions Our main analysis relies on appropriate manipulation of the transition probabilities, where we will use the bijection between exponential families and Bregman divergences established in [2]. [sent-79, score-0.214]

36 This, with an appropriate scaling to enable small-variance asymptotics as mentioned in [7], allows us to combine the emission and transition distributions into a simple objective function. [sent-81, score-0.655]

37 We denote Tjk = Pr(ztk = 1|zt−1,j = 1) as before, and the multinomial distribution for the latent state at time step t as K ztk Tjk . [sent-82, score-0.362]

38 Pr(zt |zt−1,j = 1) = (1) k=1 In order to apply small-variance asymptotics, we must allow the variance in the transition probabilities to go to zero in a reasonable way. [sent-83, score-0.251]

39 In particular, if we introduce a new parameter β, and generalize the transition probabilities to be ˆ Pr(zt |zt−1,j = 1) = exp(−βdφ (zt , mj ))bφ (zt ), ˜ ˜ ˆ where φ = βφ, then the mean of the distribution is the same in the scaled distribution (namely, mj ) ˆ while the variance is scaled. [sent-88, score-0.56]

40 The next step is to link the emission and transition probabilities so that the variance is scaled apˆ propriately in both. [sent-90, score-0.41]

41 The Gaussian emission densities can now be written as Pr(xt |ztk = 1) = exp(−β xt − µk 2 )f (β) and 2 the transition probabilities as Pr(zt |zt−1,j = 1) = exp(−λβdφ (zt , mj ))bφ (zt ). [sent-92, score-0.588]

42 2 Joint Probability Asymptotics We now have all the background development required to perform small-variance asymptotics on the HMM joint probability, and derive the segmental k-means algorithm. [sent-96, score-0.542]

43 , µK ] means, and the transition parameter matrix T . [sent-103, score-0.158]

44 The exp-family probabilities above allow us to rewrite this joint likelihood as N p(X , Z) ∝ exp −β N xt − µzt 2 2 +λ t=1 KL(zt , mzt−1 ) t=2 3 + log p(z1 ) . [sent-105, score-0.242]

45 (3) To obtain the corresponding non-probabilistic objective from small-variance asymptotics, we consider the MAP estimate obtained by maximizing the joint likelihood with respect to the parameters asymptotically as σ 2 goes to zero (β goes to ∞). [sent-106, score-0.224]

46 t=1 (5) t=2 or equivalently, N Z,µ,T N xt − µzt min 2 2 +λ t=1 t=2 K Note that, as mentioned above, mj = Tjk k=1 . [sent-109, score-0.22]

47 We can view the above objective function as a “penalized” k-means problem, where the penalties are given by the transitions from state to state. [sent-110, score-0.371]

48 Keeping Z and T fixed, the update on µk is easily seen to be the equiweighted average of the data points which have been assigned to latent state k in the MAP estimate (it is the same minimization as in k-means for updating cluster means). [sent-114, score-0.163]

49 Finally, since K KL(zt , mj ) ∝ − log Tjk , minimization with respect to T simply yields the empirical transition k=1 of transitions from state j probabilities, that is Tjk,new = ## of transitions from state to k , both counts from the MAP path computed j during maximization w. [sent-115, score-0.842]

50 We observe that, when λ = 1, the iterative minimization algorithm to solve (5) is exactly the segmental k-means algorithm, also known as Viterbi re-estimation. [sent-118, score-0.172]

51 3 EM algorithm asymptotics We can reach the same algorithm alternatively by writing down the steps of the EM algorithm and exploring the small-variance limit of these steps, analogous to the approach of [8] for a Dirichlet process mixture. [sent-121, score-0.328]

52 3 Asymptotics of the Infinite Hidden Markov Model We now tackle the more complex nonparametric model. [sent-123, score-0.116]

53 We will derive the objective function directly as in the parametric case but, unlike the parametric version, we will not apply asymptotics to the existing sampler algorithms. [sent-124, score-0.756]

54 1 The Model The iHMM, also known as the HDP-HMM [3, 12] is a nonparametric Bayesian extension to the HMM, where an HDP prior is used to allow for an unspecified number of states. [sent-128, score-0.116]

55 To apply HDPs to sequential data, the iHMM can be formulated as follows: β ∼ GEM(γ), πk |β ∼ DP(α, β), zt |zt−1 ∼ Mult(πzt−1 ), θk ∼ H, xt ∼ Φ(θzt ). [sent-132, score-0.585]

56 4 Following the discussion in the parametric case, our goal is to write down the full joint likelihood in the above model. [sent-134, score-0.196]

57 As discussed in [12], the Hierarchical Dirichlet Process yields assignments that follow the Chinese Restaurant Franchise (CRF), and thus the iHMM model additionally incorporates a term in the joint likelihood involving the prior probability of a set of state assignments arising from the CRF. [sent-135, score-0.229]

58 Suppose an assignment of observations to states has K different states (i. [sent-136, score-0.225]

59 , number of restaurants in the franchise), si is the number of states that can be reached from state i in one step (i. [sent-138, score-0.227]

60 , number of tables in each restaurant i), and ni is the number observations in each state i (i. [sent-140, score-0.201]

61 Then the probability of an assignment in the HDP can be written as (after integrating out mixture weights [1, 11], and if we only consider terms that would not be constants after we do the asymptotic analysis [5]): p(Z|α, γ, λ) ∝ γ K−1 Γ(γ + 1) K K k=1 sk ) k=1 Γ(γ + Γ(α + 1) . [sent-143, score-0.158]

62 Γ(α + ni ) αsk −1 For the likelihood, we follow the same assumption as in the parametric case: the observation densities are Gaussians with a shared covariance matrix σ 2 Id . [sent-144, score-0.185]

63 (6) Therefore, maximizing the generative probability is asymptotically equivalent to the following optimization problem: N K,Z,µ,T N xt − µzt min t=1 2 +λ K KL(zt , mzt−1 ) + λ1 t=2 (sk − 1) + λ2 (K − 1). [sent-152, score-0.129]

64 The first is the same as in the parametric case—a penalty based on the transitions from state to state. [sent-154, score-0.392]

65 The second penalty penalizes the number of transitions out of each state, and the third penalty penalizes the total number of states. [sent-155, score-0.241]

66 We follow the alternating minimization framework as in the parametric case, with some slight tweaks. [sent-161, score-0.12]

67 , K), check if the set of observations to any state j that are reached by transitioning out of i can form a new dedicated hidden state and lower the objective function in the process. [sent-172, score-0.485]

68 There are two key changes to the algorithm beyond the standard parametric case. [sent-176, score-0.149]

69 In the forwardbackward routine (step 2), we compute the usual K × N matrix α, where α(c, t) represents the minimum cost over paths of length t from the beginning of the sequence and that reach state c at time step t. [sent-177, score-0.309]

70 We use the term “cost” to refer to the sum of the distances of points to state means, as well as the additive penalties incurred. [sent-178, score-0.19]

71 However, to see why it is difficult to compute the exact value of α in the nonparametric case, suppose we have computed the minimum cost of paths up to step t − 1 and we would like to compute the values of α for step t. [sent-179, score-0.17]

72 The cost of a path that ends in state c is obtained by examining, for all states i, the cost of a path that ends at i at step t − 1 and then transitions to state c at step t. [sent-180, score-0.605]

73 Thus, we must consider the transition from i to c. [sent-181, score-0.158]

74 If there are existing transitions from state i to state c, then we proceed as in a standard forward-backward algorithm. [sent-182, score-0.418]

75 However, we are also interested in two other cases—one where there are no existing transitions from i to c but we consider this transition along with a penalty λ1 , and another where an entirely new state is formed and we pay a penalty λ2 . [sent-183, score-0.493]

76 In the first case, the standard forwardbackward routine faces an immediate problem, since when we try to compute the cost of the path given by α(c, t), the cost will be infinite as there is a − log(0) term from the transition probability. [sent-184, score-0.332]

77 We must therefore alter the forward-backward routine, or there will never be new states created nor transitions to an existing state which previously had no transitions. [sent-185, score-0.349]

78 The main idea is to derive and use bounds on how much the transition matrix can change under the above scenarios. [sent-186, score-0.158]

79 As long as we can show that the values we obtain for α are upper bounds, then we can show that the objective function will decrease after the forward-backward routine, as the existing sequence of states is also a valid path (with no new incurred penalties). [sent-187, score-0.251]

80 This step determines if the objective will decrease if we create a new global state in a certain fashion; in particular, for each existing state j, we compute the change in objective that occurs when data points that transition from j to some state k are given their own new global state. [sent-189, score-0.699]

81 Namely, we will show that our methods have benefits over the existing parametric and nonparametric HMM algorithms in terms of speed and accuracy. [sent-193, score-0.266]

82 First we compare our nonparametric algorithm with the Beam Sampler for the iHMM1 . [sent-195, score-0.116]

83 A sequence of length 3000 was generated over a varying number of hidden states with the all-zeros transition matrix except that Ti,i+1 = 0. [sent-196, score-0.385]

84 For our nonparametric algorithm, we performed a grid search over all three parameters and selected the parameters using a heuristic (see the supplementary material for a discussion of this heuristic). [sent-206, score-0.146]

85 The accuracy of the Beam sampler is given by the highest among all the samples selected. [sent-222, score-0.125]

86 The accuracy is shown in terms of the normalized mutual information (NMI) score (in the range of [0,1]), since the sampler may output different number of states than the ground truth and NMI can handle this situation. [sent-223, score-0.205]

87 For these datasets, we also observe that the EM algorithm for the standard HMM (not reported in the figure) can easily output a smaller number of states than the ground truth, which yields a smaller NMI score. [sent-225, score-0.142]

88 Next we demonstrate the effect of the compensation parameter λ in the parametric asymptotic model, along with comparisons to the standard HMM. [sent-228, score-0.252]

89 We will call the generalized segmental k-means of Section 2 the ‘asymp-HMM’ algorithm, shortened to ‘AHMM’ as appropriate. [sent-229, score-0.172]

90 In our ground-truth transition kernel, state i had an 80% prob. [sent-232, score-0.278]

91 For λ = 1, corresponding to standard segmental k-means, we obtain results similar to the standard HMM, which obtains an NMI of . [sent-242, score-0.23]

92 Thus, the parametric method offers some added flexibility via the new λ parameter. [sent-245, score-0.12]

93 As before, the mixture means were initialized with k-means and the transition kernels were given random initial values. [sent-256, score-0.256]

94 For the asymp-HMM and the standard HMM, the number of latent states was empirically chosen to be 5. [sent-257, score-0.152]

95 For predicting observation T + 1 given observations up to step T , we used the weighted average of the learned state means, weighted by the transition probabilities given by the state of the observation at time T . [sent-259, score-0.575]

96 We ran the standard HMM along with both the parametric and non-parametric asymptotic algorithms on this data (the Beam sampler was too slow to run over this data, as each individual prediction took on the order of minutes). [sent-260, score-0.335]

97 Both the parametric and non-parametric asymptotic algorithms perform noticably better than the standard HMM; they are able to better approximate the actual curve across all kinds of temporal fluctuations. [sent-262, score-0.258]

98 The goal was to obtain non-probabilistic formulations inspired by the HMM, in order to expand small-variance asymptotics to sequential models. [sent-275, score-0.366]

99 We view our main contribution as a novel dynamic-programming-based algorithm for sequential data with a non-fixed number of states that is derived from the iHMM model. [sent-276, score-0.118]

100 Small-variance asymptotics for exponential family Dirichlet process mixture models. [sent-327, score-0.385]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zt', 0.469), ('hmm', 0.384), ('asymptotics', 0.328), ('beam', 0.178), ('mzt', 0.172), ('segmental', 0.172), ('ztk', 0.172), ('transition', 0.158), ('mj', 0.142), ('nmi', 0.14), ('ihmm', 0.123), ('tjk', 0.123), ('parametric', 0.12), ('state', 0.12), ('transitions', 0.119), ('nonparametric', 0.116), ('emission', 0.107), ('hidden', 0.099), ('hdp', 0.098), ('sampler', 0.096), ('hmms', 0.082), ('dirichlet', 0.081), ('states', 0.08), ('xt', 0.078), ('penalties', 0.07), ('routine', 0.065), ('probabilities', 0.062), ('objective', 0.062), ('kulis', 0.061), ('pr', 0.061), ('asymptotic', 0.06), ('id', 0.059), ('kl', 0.056), ('gaussians', 0.055), ('bregman', 0.051), ('em', 0.05), ('markov', 0.049), ('asymihmm', 0.049), ('forwardbackward', 0.049), ('noticably', 0.049), ('sequence', 0.048), ('transitioning', 0.044), ('franchise', 0.043), ('compensation', 0.043), ('latent', 0.043), ('joint', 0.042), ('densities', 0.041), ('restaurant', 0.041), ('observations', 0.04), ('penalized', 0.04), ('sk', 0.04), ('initialized', 0.038), ('gk', 0.038), ('bayesian', 0.038), ('sequential', 0.038), ('probabilistic', 0.036), ('index', 0.036), ('likelihood', 0.034), ('mult', 0.034), ('penalty', 0.033), ('mixture', 0.033), ('yields', 0.033), ('map', 0.032), ('combinatorial', 0.032), ('unspeci', 0.032), ('bijection', 0.032), ('treatment', 0.031), ('goes', 0.031), ('path', 0.031), ('dp', 0.031), ('variance', 0.031), ('supplementary', 0.03), ('existing', 0.03), ('prediction', 0.03), ('base', 0.029), ('hierarchical', 0.029), ('standard', 0.029), ('accuracy', 0.029), ('priors', 0.028), ('scalable', 0.028), ('beal', 0.028), ('trading', 0.028), ('penalizes', 0.028), ('step', 0.027), ('mixtures', 0.027), ('means', 0.027), ('generative', 0.027), ('jiang', 0.026), ('days', 0.026), ('exp', 0.026), ('predicted', 0.026), ('ts', 0.026), ('ends', 0.025), ('scaled', 0.025), ('pca', 0.025), ('assignment', 0.025), ('observation', 0.024), ('asymptotically', 0.024), ('rk', 0.024), ('exponential', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

2 0.14871103 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

Author: Tzu-Kuo Huang, Jeff Schneider

Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1

3 0.11492986 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

4 0.11236294 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

5 0.099988282 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1

6 0.099091671 89 nips-2013-Dimension-Free Exponentiated Gradient

7 0.098090038 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

8 0.093853213 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

9 0.091535665 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.087135501 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

11 0.079429559 158 nips-2013-Learning Multiple Models via Regularized Weighting

12 0.077270411 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

13 0.073745921 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

14 0.070462078 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

15 0.068901353 188 nips-2013-Memory Limited, Streaming PCA

16 0.068736427 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

17 0.066208988 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

18 0.063804545 70 nips-2013-Contrastive Learning Using Spectral Methods

19 0.063613407 299 nips-2013-Solving inverse problem of Markov chain with partial observations

20 0.058965661 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, 0.02), (2, 0.004), (3, 0.017), (4, -0.018), (5, 0.059), (6, 0.107), (7, 0.065), (8, 0.074), (9, -0.078), (10, -0.029), (11, -0.052), (12, -0.009), (13, 0.051), (14, -0.03), (15, 0.02), (16, 0.08), (17, 0.001), (18, 0.045), (19, 0.024), (20, 0.002), (21, 0.037), (22, 0.002), (23, -0.017), (24, 0.033), (25, 0.027), (26, -0.035), (27, -0.072), (28, -0.029), (29, 0.025), (30, 0.02), (31, 0.098), (32, 0.003), (33, -0.009), (34, -0.088), (35, -0.029), (36, -0.05), (37, 0.087), (38, 0.0), (39, 0.017), (40, 0.023), (41, 0.056), (42, -0.009), (43, 0.001), (44, -0.024), (45, 0.014), (46, -0.018), (47, -0.036), (48, -0.01), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94719881 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

2 0.70279962 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

3 0.6843152 299 nips-2013-Solving inverse problem of Markov chain with partial observations

Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide

Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.

4 0.66717529 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1

5 0.6556868 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations

Author: Botond Cseke, Manfred Opper, Guido Sanguinetti

Abstract: We propose an approximate inference algorithm for continuous time Gaussian Markov process models with both discrete and continuous time likelihoods. We show that the continuous time limit of the expectation propagation algorithm exists and results in a hybrid fixed point iteration consisting of (1) expectation propagation updates for discrete time terms and (2) variational updates for the continuous time term. We introduce postinference corrections methods that improve on the marginals of the approximation. This approach extends the classical Kalman-Bucy smoothing procedure to non-Gaussian observations, enabling continuous-time inference in a variety of models, including spiking neuronal models (state-space models with point process observations) and box likelihood models. Experimental results on real and simulated data demonstrate high distributional accuracy and significant computational savings compared to discrete-time approaches in a neural application. 1

6 0.65162331 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

7 0.65148026 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

8 0.65088028 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

9 0.64635259 252 nips-2013-Predictive PAC Learning and Process Decompositions

10 0.63627076 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

11 0.61787987 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

12 0.59497333 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

13 0.59326714 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

14 0.59088129 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

15 0.57975453 277 nips-2013-Restricting exchangeable nonparametric distributions

16 0.57668304 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

17 0.5741483 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

18 0.5570218 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

19 0.5537914 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

20 0.55168408 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.051), (33, 0.145), (34, 0.163), (36, 0.017), (41, 0.019), (49, 0.017), (55, 0.232), (56, 0.138), (70, 0.025), (85, 0.036), (89, 0.043), (93, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94772577 174 nips-2013-Lexical and Hierarchical Topic Regression

Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik

Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ‫ݐ‬ௗ௦ ‫ݖ‬ௗ௦௡ ‫ݓ‬ௗ௦௡ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ‫ݕ‬ௗ ‫ܦ‬ ߱ ߟ௞ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶௞ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ ‫1=ݐ‬ ‫2=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫3=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫ݐ‬ௗ௦ ‫2=ݏ 1=ݏ‬ ‫ܵ = ݏ‬ଵ ‫3=ݏ 2=ݏ 1=ݏ‬ ݀=1 ݇ௗ௦௡ ‫ܵ = ݏ‬ଶ ‫ܵ = ݏ‬஽ ݀=2 ߶ଵ ߟଵ ݀=‫ܦ‬ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 ‫1=ݏ‬ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) =  1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x }  Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝       L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1  γl∗    −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9

2 0.87547892 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova

Abstract: Compressed sensing (CS) is a concept that allows to acquire compressible signals with a small number of measurements. As such it is very attractive for hardware implementations. Therefore, correct calibration of the hardware is a central issue. In this paper we study the so-called blind calibration, i.e. when the training signals that are available to perform the calibration are sparse but unknown. We extend the approximate message passing (AMP) algorithm used in CS to the case of blind calibration. In the calibration-AMP, both the gains on the sensors and the elements of the signals are treated as unknowns. Our algorithm is also applicable to settings in which the sensors distort the measurements in other ways than multiplication by a gain, unlike previously suggested blind calibration algorithms based on convex relaxations. We study numerically the phase diagram of the blind calibration problem, and show that even in cases where convex relaxation is possible, our algorithm requires a smaller number of measurements and/or signals in order to perform well. 1

same-paper 3 0.83211714 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

4 0.80908883 301 nips-2013-Sparse Additive Text Models with Low Rank Background

Author: Lei Shi

Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efficient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we find that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efficiently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efficiency of SAM-LRB, compared with a few state-of-the-art models. 1

5 0.74707174 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

Author: David J. Weiss, Ben Taskar

Abstract: Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task. 1

6 0.74594355 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

7 0.7456997 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

8 0.74521732 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

9 0.74357551 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

10 0.74235928 348 nips-2013-Variational Policy Search via Trajectory Optimization

11 0.74173504 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

12 0.7405135 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

13 0.73983169 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

14 0.73899591 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

15 0.73834598 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

16 0.73791438 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

17 0.73779607 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

18 0.73756874 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

19 0.73722857 158 nips-2013-Learning Multiple Models via Regularized Weighting

20 0.73673171 249 nips-2013-Polar Operators for Structured Sparse Estimation