nips nips2012 nips2012-354 knowledge-graph by maker-knowledge-mining

354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes


Source: pdf

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In practice, however, conventional batch and online variational methods quickly become trapped in local optima. [sent-6, score-0.567]

2 In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. [sent-7, score-0.991]

3 We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. [sent-8, score-1.321]

4 For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics. [sent-9, score-0.257]

5 For example, the hierarchical Dirichlet process (HDP) [1] provides a general approach to joint clustering of grouped data, and leads to effective nonparametric topic models. [sent-11, score-0.299]

6 While nonparametric methods are best motivated by their potential to capture the details of large datasets, practical applications have been limited by the poor computational scaling of conventional Monte Carlo learning algorithms. [sent-12, score-0.132]

7 Mean field variational methods provide an alternative, optimization-based framework for nonparametric learning [2, 3]. [sent-13, score-0.419]

8 Aiming at larger-scale applications, recent work [4] has extended online variational methods [5] for the parametric, latent Dirichlet allocation (LDA) topic model [6] to the HDP. [sent-14, score-0.629]

9 While this online approach can produce reasonable models of large data streams, we show that the variational posteriors of existing algorithms often converge to poor local optima. [sent-15, score-0.49]

10 Furthermore, by applying a fixed truncation to the number of posterior topics or clusters, conventional variational methods limit the ability of purportedly nonparametric models to fully adapt to the data. [sent-17, score-0.89]

11 In this paper, we propose novel split-merge moves for online variational inference for the HDP (oHDP) which result in much better predictive performance. [sent-18, score-0.577]

12 Additionally, the inclusion of split-merge moves during posterior inference allows us to dynamically vary the truncation level throughout learning. [sent-21, score-0.223]

13 While conservative truncations can be theoretically justifed for batch analysis of fixed-size datasets [2], our data-driven adaptation of the trunction level is far better suited to large-scale analysis of streaming data. [sent-22, score-0.179]

14 They have also been used for maximum likelihood and variational analysis of 1 πj α zjn wjn Nj γ η β D φk ∞ Figure 1: Directed graphical representation of a hierarchical Dirichlet process topic model, in which an unbounded collection of topics φk model the Nj words in each of D documents. [sent-24, score-1.091]

15 These deterministic algorithms validate split-merge proposals by evaluating a batch objective on the entire dataset, an approach which is unexplored for nonparametric models and infeasible for online learning. [sent-27, score-0.332]

16 We instead optimize the variational objective via stochastic gradient ascent, and split or merge based on only a noisy estimate of the variational lower bound. [sent-28, score-0.985]

17 Over time, these local decisions lead to global estimates of the number of topics present in a given corpus. [sent-29, score-0.334]

18 We review the HDP and conventional variational methods in Sec. [sent-30, score-0.405]

19 Note that ψjn = φzjn for some discrete indicator zjn . [sent-46, score-0.144]

20 The topics φk ∼ Dirichlet(η) are distributions on a vocabulary of W words. [sent-48, score-0.338]

21 The global topic weights, β ∼ GEM(γ), are still drawn from a stick-breaking prior. [sent-49, score-0.244]

22 For each document j, document-specific topic frequencies are drawn πj ∼ DP(αβ). [sent-50, score-0.271]

23 Then for each word index n in document j, a topic indicator is drawn zjn ∼ Categorical(πj ), and finally a word is drawn wjn ∼ Categorical(φzjn ). [sent-51, score-0.626]

24 1 In our update derivations below, we use ϕjw to denote the shared ϕjn for all word tokens in document j of type w. [sent-56, score-0.217]

25 Selection of an appropriate truncation strategy is crucial to the accuracy of variational methods for nonparametric models. [sent-57, score-0.506]

26 Here, we truncate the topic indicator distributions by fixing q(zjn = k) = 0 for k > K, where K is a threshold which varies dynamically in our later algorithms. [sent-58, score-0.213]

27 With this assumption, the topic distributions with indices greater than K are conditionally independent of the observed data; we may thus ignore them and tractably update the remaining parameters with respect to the true, infinite model. [sent-59, score-0.214]

28 A similar truncation has been previously used in the context of an otherwise more complex collapsed variational method [3]. [sent-60, score-0.462]

29 Desirably, this truncation is nested such that increasing K always gives potentially improved bounds, but does not require the computation of infinite sums, as in [16]. [sent-61, score-0.122]

30 In contrast, approximations based on truncations of the stick-breaking topic frequency prior [2, 4] are not nested, and their artifactual placement of extra mass on the final topic K is less suitable for our split-merge online variational inference. [sent-62, score-0.843]

31 The expectations are with respect to the variational distribution. [sent-64, score-0.346]

32 Each expectation is dependent on only a subset of the variational parameters; we leave off particular subscripts for notational clarity. [sent-65, score-0.346]

33 Note that the expansion of the variational lower bound in (2) contains all terms inside a summation over documents. [sent-66, score-0.346]

34 This is the key observation that allowed [5] to develop an online inference algorithm for LDA. [sent-67, score-0.162]

35 A full expansion of the variational objective is given in the supplemental material. [sent-68, score-0.372]

36 Drawing a parallel between variational inference and the expectation maximization (EM) algorithm, we label the document-specific updates of (ϕj , θj ) the E-step, and the corpus-wide updates of (λ, β) the M-step. [sent-74, score-0.479]

37 1 We expect β to have small posterior variance in large datasets, and using a point estimate β ∗ simplifies variational derivations for our direct assignment formulation. [sent-75, score-0.346]

38 As empirically explored for the HDP-PCFG [15], updates to the global topic weights have much less predictive impact than improvements to topic distributions. [sent-76, score-0.482]

39 3 Online Variational Inference Batch variational inference requires a full pass through the data at each iteration, making it computationally infeasible for large datasets and impossible for streaming data. [sent-78, score-0.528]

40 To remedy this, we adapt and improve recent work on online variational inference algorithms [4, 5]. [sent-79, score-0.53]

41 The form of the lower bound in (2), as a scaled expectation with respect to the document collection, suggests an online learning algorithm. [sent-80, score-0.157]

42 Given a learning rate ρt satisfying ∞ ρt = ∞ and t=0 ∞ 2 t=0 ρt < ∞, we can optimize the variational objective stochastically. [sent-81, score-0.372]

43 (8) j∈S ˆ The candidate topic weights β are found via gradient-based optimization on S. [sent-84, score-0.207]

44 The resulting inference algorithm is similar to conventional batch methods, but is applicable to streaming, big data. [sent-85, score-0.185]

45 3 Split-Merge Updates for Online Variational Inference We develop a data-driven split-merge algorithm for online variational inference for the HDP, referred to as oHDP-SM. [sent-86, score-0.508]

46 The algorithm dynamically expands and contracts the truncation level K by splitting and merging topics during specialized moves which are interleaved with standard online variational updates. [sent-87, score-0.97]

47 The resulting model truly allows the number of topics to grow with the data. [sent-88, score-0.303]

48 As such, we do not have to employ the technique of [4, 3] and other truncated variational approaches of setting K above the expected number of topics and relying on the inference to infer a smaller number. [sent-89, score-0.712]

49 Instead, we initialize with small K and let the inference discover new topics as it progresses, similar to the approach used in [17]. [sent-90, score-0.406]

50 One can see how this property would be desirable in an online setting, as documents seen after many inference steps may still create new topics. [sent-91, score-0.248]

51 1 Split: Creation of New Topics |S| Given the result of analyzing one mini-batch q ∗ = (ϕj , θj )j=1 , λ, β ∗ , and the corresponding value of the lower bound L(q ∗ ), we consider splitting topic k into two topics k ′ , k ′′ . [sent-93, score-0.518]

52 Initialize new variational posteriors To break symmetry, we initialize the new topic posteriors ∗ ∗ (λk′ , λk′′ ), and topic weights (βk′ , βk′′ ), using sufficient statistics from the previous iteration: ˆ λk′′ = ρt λk , ˆ β ∗′′ = ρt βk . [sent-95, score-0.867]

53 λk′ = (1 − ρt )λk , ∗ ∗ βk′ = (1 − ρt )βk , k Intuitively, we expect the sufficient statistics to provide insight into how a topic was actually used |S| during the E-step. [sent-96, score-0.184]

54 2 Technically, we replace topic k with topic k′ and add k′′ as a new topic. [sent-98, score-0.368]

55 In practice, we found that the order of topics in the global stick-breaking distribution had little effect on overall algorithm performance. [sent-99, score-0.334]

56 The restricted iteration consists of restricted analogues to both the E-step and the M-step, where all parameters except those for the new topics are held constant. [sent-101, score-0.397]

57 It is important to note that even though / these values are not updated, they are still used in the calculations for both the variational expectation of πj and the normalization of ϕ. [sent-104, score-0.346]

58 The expected log word probabilities Eq [log φk′ w ] and Eq [log φk′′ w ] are computed using the newly updated λ values. [sent-106, score-0.136]

59 Let θ split , λsplit and βsplit be defined similarly. [sent-108, score-0.147]

60 If we wish to allow the model to expand the number of topics more quickly, we can increase this number. [sent-113, score-0.303]

61 Finally, it is important to note that all aspects of the split procedure are driven by the data — the new topics are initialized using data-driven proposals, refined by re-running the variational E-step, and accepted based on an unbiased estimate of the change in the variational objective. [sent-114, score-1.142]

62 2 Merge: Removal of Redundant Topics Consider a candidate merge of two topics, k ′ and k ′′ , into a new topic k. [sent-116, score-0.304]

63 Because many terms cancel, computing this bound change is fairly computationally inexpensive, but it can still be computationally infeasible to consider all pairs of topics for large K. [sent-118, score-0.336]

64 Instead, we identify potential merge candidates by looking at the sample covariance of the θj vectors across the corpus (or minibatch). [sent-119, score-0.158]

65 Intuitively, if there are two copies of a topic or a topic is split into two pieces, they should tend to be used together, and therefore have positive covariance. [sent-121, score-0.515]

66 For consistency in ′ ′′ notation, we call the model state with topics k ′ and k ′′ merged q merge(k ,k ) . [sent-122, score-0.303]

67 Combining this merge procedure with the previous split proposals leads to the online variational method of Algorithm 2. [sent-123, score-0.75]

68 In an online setting, we can only compute unbiased noisy estimates of the true difference in the variational objective; split or merge moves that increase the expected variational objective are not guaranteed to do so for the objective evaluated over the entire corpus. [sent-124, score-1.154]

69 The 5 Algorithm 2 Online variational inference for the HDP + split-merge 1: initialize (λ, β ∗ ) 2: for t = 1, 2, . [sent-125, score-0.449]

70 , K do 16: compute L q split(k) via restricted iteration 17: if L q split(k) > L(q) then 18: q ← q split(k) 19: end if 20: end for 21: end for uncertainty associated with the online method can be mitigated to some extent by using large minibatches. [sent-131, score-0.226]

71 Confidence intervals for the expected change in the variational objective can be computed, and might be useful in a more sophisticated acceptance rule. [sent-132, score-0.402]

72 Note that our usage of a nested family of variational bounds is key to the accuracy and stability of our split-merge acceptance rules. [sent-133, score-0.411]

73 4 Experimental Results To demonstrate the effectiveness of our split-merge moves, we compare three algorithms: batch variational inference (bHDP), online variational inference without split-merge (oHDP), and online variational inference with split-merge (oHDP-SM). [sent-134, score-1.488]

74 3 We test the models on one synthetic and two real datasets: Bars A 20-topic bars dataset of the type introduced in [18], where topics can be viewed as bars on a 10 × 10 grid. [sent-136, score-0.429]

75 While the truncation level K is different from the actual number of topics assigned non-negligible mass, the split-merge model tends to merge away unused topics, so these numbers are usually fairly close. [sent-147, score-0.51]

76 We randomly split each test document in Dtest into 80%-20% pieces, wj1 and wj2 . [sent-170, score-0.205]

77 ¯ Then, using φ as the variational expectation of the topics from training, we learn πj on wj1 and ¯ approximate the probability of wj2 as w∈wj2 k πjk φkw . [sent-171, score-0.649]

78 1 Bars For the bars data, we initialize eight oHDP-SM runs with K = {2, 5, 10, 20, 40, 50, 80, 100}, eight runs of oHDP with K = 20, and eight runs with K = 50. [sent-173, score-0.306]

79 Note that the data-driven split-merge procedure allows splitting and merging of topics to mostly cease once the inference has converged (Figure 2(d)). [sent-176, score-0.452]

80 Since oHDP and bHDP will use only a subset of topics under the truncation, setting K much higher results in comparable numbers of topics as oHDP-SM. [sent-181, score-0.606]

81 Even though the split-merge algorithms improve in part by adding topics, they are using their topics much more effectively (Figure 2(h)). [sent-186, score-0.303]

82 We speculate that for the NIPS corpus especially, the reason that models achieve better predictive likelihoods with more topics is due to the bursty properties of text data [20]. [sent-187, score-0.366]

83 Figure 3 illustrates the topic refinement and specialization which occurs in successful split proposals. [sent-188, score-0.331]

84 3 New York Times As batch variational methods and samplers are not feasible for such a large dataset, we compare two runs of oHDP with K = {300, 500} to a run of oHDP-SM with K = 200 initial topics. [sent-190, score-0.457]

85 Figure 2(c) shows an inherent problem with oHDP for very large datasets — when truncated to K = 500, the algorithms uses all of its available topics and exhibits overfitting. [sent-192, score-0.332]

86 5 Discussion We have developed a novel split-merge online variational algorithm for the hierarchical DP. [sent-194, score-0.487]

87 This approach leads to more accurate models and better predictive performance, as well as a model that is able to adapt the number of topics more freely than conventional approximations based on fixed truncations. [sent-195, score-0.409]

88 Our moves are similar in spirit to split-merge samplers, but by evaluating their quality stochastically using streaming data, we can rapidly adapt model structure to large-scale datasets. [sent-196, score-0.123]

89 While many papers have tried to improve conventional mean field methods via higher-order variational expansions [21], local optima can make the resulting algorithms compare unfavorably to Monte Carlo methods [3]. [sent-197, score-0.437]

90 Here we pursue the complementary goal of more robust, scalable optimization of simple variational objectives. [sent-198, score-0.346]

91 We believe similar online learning methods will prove effective for the combinatorial structures of other Bayesian nonparametric models. [sent-200, score-0.172]

92 5 0 50 100 150 200 250 300 350 0 400 per−word log likelihood per−word log likelihood per−word log likelihood −7. [sent-214, score-0.303]

93 5 documents seen 5 x 10 (b) 4 6 x 10 (c) 550 600 80 500 70 500 oHDP−SM, K=2,100 oHDP, K=50 oHDP, K=20 60 450 # topics used 50 40 30 # topics used 400 # topics used −7. [sent-231, score-0.995]

94 6 per−word log likelihood per−word log likelihood per−word log likelihood −7. [sent-245, score-0.303]

95 8 150 200 250 300 # topics used # topics used 350 400 450 500 550 # topics used (g) (h) (i) Figure 2: Trace plots of heldout likelihood and number of topics used. [sent-262, score-1.28]

96 Bottom: A plot of per-word log likelihood against number of topics used. [sent-266, score-0.404]

97 Note particularly plot (h), where for every cardinality of used topics shown, there is a split-merge method outperforming a conventional method. [sent-267, score-0.362]

98 The left column shows the topic directly prior to the split. [sent-269, score-0.184]

99 After 240,000 more documents have been analyzed, subtle differences become apparent: the top topic covers terms relating to general neuronal behavior, while the bottom topic deals more specifically with neuron firing. [sent-270, score-0.561]

100 Bayesian model search for mixture models based on optimizing variational bounds. [sent-338, score-0.346]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ohdp', 0.454), ('variational', 0.346), ('topics', 0.303), ('jwk', 0.272), ('eq', 0.251), ('jk', 0.189), ('topic', 0.184), ('hdp', 0.158), ('split', 0.147), ('cgs', 0.145), ('zjn', 0.144), ('kw', 0.139), ('merge', 0.12), ('dirichlet', 0.111), ('bhdp', 0.109), ('jn', 0.109), ('neuronal', 0.107), ('online', 0.099), ('cortex', 0.093), ('truncation', 0.087), ('documents', 0.086), ('word', 0.075), ('nonparametric', 0.073), ('neurons', 0.065), ('activation', 0.063), ('batch', 0.063), ('inference', 0.063), ('log', 0.061), ('conventional', 0.059), ('minibatch', 0.059), ('peak', 0.058), ('document', 0.058), ('streaming', 0.057), ('dp', 0.056), ('pyramidal', 0.053), ('gj', 0.053), ('nw', 0.051), ('responses', 0.048), ('sm', 0.048), ('msec', 0.048), ('patterns', 0.048), ('runs', 0.048), ('bars', 0.047), ('inputs', 0.046), ('posteriors', 0.045), ('moves', 0.044), ('hierarchical', 0.042), ('initialize', 0.04), ('kurihara', 0.04), ('likelihood', 0.04), ('corpus', 0.038), ('dendritic', 0.038), ('proposals', 0.038), ('nested', 0.035), ('updates', 0.035), ('vocabulary', 0.035), ('restricted', 0.033), ('infeasible', 0.033), ('wjn', 0.032), ('type', 0.032), ('optima', 0.032), ('global', 0.031), ('categorical', 0.031), ('merging', 0.031), ('splitting', 0.031), ('acceptance', 0.03), ('update', 0.03), ('blei', 0.03), ('gem', 0.03), ('truncations', 0.03), ('dendrite', 0.03), ('ueda', 0.03), ('jw', 0.03), ('dynamically', 0.029), ('behavioral', 0.029), ('collapsed', 0.029), ('datasets', 0.029), ('drawn', 0.029), ('nj', 0.029), ('iteration', 0.028), ('heldout', 0.028), ('pattern', 0.028), ('dps', 0.026), ('postsynaptic', 0.026), ('carlo', 0.026), ('zj', 0.026), ('objective', 0.026), ('corpora', 0.025), ('monte', 0.025), ('eight', 0.025), ('predictive', 0.025), ('york', 0.024), ('converged', 0.024), ('weights', 0.023), ('sudderth', 0.023), ('preferred', 0.022), ('end', 0.022), ('welling', 0.022), ('tokens', 0.022), ('adapt', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

2 0.35180396 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

3 0.17633583 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

4 0.16629681 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

5 0.15140951 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

6 0.14677496 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

7 0.13248257 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

8 0.12727492 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

9 0.12013762 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

10 0.12008685 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

11 0.1189958 12 nips-2012-A Neural Autoregressive Topic Model

12 0.11398416 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

13 0.11318778 298 nips-2012-Scalable Inference of Overlapping Communities

14 0.10984499 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

15 0.10852841 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

16 0.10657474 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

17 0.10628379 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

18 0.10170985 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

19 0.098027356 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

20 0.088107713 126 nips-2012-FastEx: Hash Clustering with Exponential Families


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.18), (1, 0.075), (2, -0.051), (3, 0.038), (4, -0.3), (5, -0.009), (6, -0.011), (7, -0.021), (8, 0.154), (9, -0.091), (10, 0.211), (11, 0.183), (12, 0.028), (13, 0.011), (14, 0.037), (15, 0.036), (16, 0.006), (17, 0.035), (18, -0.019), (19, 0.098), (20, -0.02), (21, -0.033), (22, 0.013), (23, 0.007), (24, -0.036), (25, -0.081), (26, -0.007), (27, 0.075), (28, -0.022), (29, 0.045), (30, -0.024), (31, -0.073), (32, -0.062), (33, -0.078), (34, 0.018), (35, 0.055), (36, 0.093), (37, -0.041), (38, -0.143), (39, 0.044), (40, 0.129), (41, 0.012), (42, -0.09), (43, 0.016), (44, 0.049), (45, -0.063), (46, 0.066), (47, 0.001), (48, -0.072), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97316647 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

2 0.86515617 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

Author: Chong Wang, David M. Blei

Abstract: We present a truncation-free stochastic variational inference algorithm for Bayesian nonparametric models. While traditional variational inference algorithms require truncations for the model or the variational distribution, our method adapts model complexity on the fly. We studied our method with Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large data sets. Our method performs better than previous stochastic variational inference algorithms. 1

3 0.78067756 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

4 0.69063395 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

5 0.65336907 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

Author: Jeff Beck, Alexandre Pouget, Katherine A. Heller

Abstract: Recent experiments have demonstrated that humans and animals typically reason probabilistically about their environment. This ability requires a neural code that represents probability distributions and neural circuits that are capable of implementing the operations of probabilistic inference. The proposed probabilistic population coding (PPC) framework provides a statistically efficient neural representation of probability distributions that is both broadly consistent with physiological measurements and capable of implementing some of the basic operations of probabilistic inference in a biologically plausible way. However, these experiments and the corresponding neural models have largely focused on simple (tractable) probabilistic computations such as cue combination, coordinate transformations, and decision making. As a result it remains unclear how to generalize this framework to more complex probabilistic computations. Here we address this short coming by showing that a very general approximate inference algorithm known as Variational Bayesian Expectation Maximization can be naturally implemented within the linear PPC framework. We apply this approach to a generic problem faced by any given layer of cortex, namely the identification of latent causes of complex mixtures of spikes. We identify a formal equivalent between this spike pattern demixing problem and topic models used for document classification, in particular Latent Dirichlet Allocation (LDA). We then construct a neural network implementation of variational inference and learning for LDA that utilizes a linear PPC. This network relies critically on two non-linear operations: divisive normalization and super-linear facilitation, both of which are ubiquitously observed in neural circuits. We also demonstrate how online learning can be achieved using a variation of Hebb’s rule and describe an extension of this work which allows us to deal with time varying and correlated latent causes. 1 Introduction to Probabilistic Inference in Cortex Probabilistic (Bayesian) reasoning provides a coherent and, in many ways, optimal framework for dealing with complex problems in an uncertain world. It is, therefore, somewhat reassuring that behavioural experiments reliably demonstrate that humans and animals behave in a manner consistent with optimal probabilistic reasoning when performing a wide variety of perceptual [1, 2, 3], motor [4, 5, 6], and cognitive tasks[7]. This remarkable ability requires a neural code that represents probability distribution functions of task relevant stimuli rather than just single values. While there 1 are many ways to represent functions, Bayes rule tells us that when it comes to probability distribution functions, there is only one statistically optimal way to do it. More precisely, Bayes Rule states that any pattern of activity, r, that efficiently represents a probability distribution over some task relevant quantity s, must satisfy the relationship p(s|r) ∝ p(r|s)p(s), where p(r|s) is the stimulus conditioned likelihood function that specifies the form of neural variability, p(s) gives the prior belief regarding the stimulus, and p(s|r) gives the posterior distribution over values of the stimulus, s given the representation r . Of course, it is unlikely that the nervous system consistently achieves this level of optimality. None-the-less, Bayes rule suggests the existence of a link between neural variability as characterized by the likelihood function p(r|s) and the state of belief of a mature statistical learning machine such as the brain. The so called Probabilistic Population Coding (or PPC) framework[8, 9, 10] takes this link seriously by proposing that the function encoded by a pattern of neural activity r is, in fact, the likelihood function p(r|s). When this is the case, the precise form of the neural variability informs the nature of the neural code. For example, the exponential family of statistical models with linear sufficient statistics has been shown to be flexible enough to model the first and second order statistics of in vivo recordings in awake behaving monkeys[9, 11, 12] and anesthetized cats[13]. When the likelihood function is modeled in this way, the log posterior probability over the stimulus is linearly encoded by neural activity, i.e. log p(s|r) = h(s) · r − log Z(r) (1) Here, the stimulus dependent kernel, h(s), is a vector of functions of s, the dot represents a standard dot product, and Z(r) is the partition function which serves to normalize the posterior. This log linear form for a posterior distribution is highly computationally convenient and allows for evidence integration to be implemented via linear operations on neural activity[14, 8]. Proponents of this kind of linear PPC have demonstrated how to build biologically plausible neural networks capable of implementing the operations of probabilistic inference that are needed to optimally perform the behavioural tasks listed above. This includes, linear PPC implementations of cue combination[8], evidence integration over time, maximum likelihood and maximum a posterior estimation[9], coordinate transformation/auditory localization[10], object tracking/Kalman filtering[10], explaining away[10], and visual search[15]. Moreover, each of these neural computations has required only a single recurrently connected layer of neurons that is capable of just two non-linear operations: coincidence detection and divisive normalization, both of which are widely observed in cortex[16, 17]. Unfortunately, this research program has been a piecemeal effort that has largely proceeded by building neural networks designed deal with particular problems. As a result, there have been no proposals for a general principle by which neural network implementations of linear PPCs might be generated and no suggestions regarding how to deal with complex (intractable) problems of probabilistic inference. In this work, we will partially address this short coming by showing that Variation Bayesian Expectation Maximization (VBEM) algorithm provides a general scheme for approximate inference and learning with linear PPCs. In section 2, we briefly review the VBEM algorithm and show how it naturally leads to a linear PPC representation of the posterior as well as constraints on the neural network dynamics which build that PPC representation. Because this section describes the VB-PPC approach rather abstractly, the remainder of the paper is dedicated to concrete applications. As a motivating example, we consider the problem of inferring the concentrations of odors in an olfactory scene from a complex pattern of spikes in a population of olfactory receptor neurons (ORNs). In section 3, we argue that this requires solving a spike pattern demixing problem which is indicative of the generic problem faced by many layers of cortex. We then show that this demixing problem is equivalent to the problem addressed by a class of models for text documents know as probabilistic topic models, in particular Latent Dirichlet Allocation or LDA[18]. In section 4, we apply the VB-PPC approach to build a neural network implementation of probabilistic inference and learning for LDA. This derivation shows that causal inference with linear PPC’s also critically relies on divisive normalization. This result suggests that this particular non-linearity may be involved in very general and fundamental probabilistic computation, rather than simply playing a role in gain modulation. In this section, we also show how this formulation allows for a probabilistic treatment of learning and show that a simple variation of Hebb’s rule can implement Bayesian learning in neural circuits. 2 We conclude this work by generalizing this approach to time varying inputs by introducing the Dynamic Document Model (DDM) which can infer short term fluctuations in the concentrations of individual topics/odors and can be used to model foraging and other tracking tasks. 2 Variational Bayesian Inference with linear Probabilistic Population Codes Variational Bayesian (VB) inference refers to a class of deterministic methods for approximating the intractable integrals which arise in the context of probabilistic reasoning. Properly implemented it can result a fast alternative to sampling based methods of inference such as MCMC[19] sampling. Generically, the goal of any Bayesian inference algorithm is to infer a posterior distribution over behaviourally relevant latent variables Z given observations X and a generative model which specifies the joint distribution p(X, Θ, Z). This task is confounded by the fact that the generative model includes latent parameters Θ which must be marginalized out, i.e. we wish to compute, p(Z|X) ∝ p(X, Θ, Z)dΘ (2) When the number of latent parameters is large this integral can be quite unwieldy. The VB algorithms simplify this marginalization by approximating the complex joint distribution over behaviourally relevant latents and parameters, p(Θ, Z|X), with a distribution q(Θ, Z) for which integrals of this form are easier to deal with in some sense. There is some art to choosing the particular form for the approximating distribution to make the above integral tractable, however, a factorized approximation is common, i.e. q(Θ, Z) = qΘ (Θ)qZ (Z). Regardless, for any given observation X, the approximate posterior is found by minimizing the Kullback-Leibler divergence between q(Θ, Z) and p(Θ, Z|X). When a factorized posterior is assumed, the Variational Bayesian Expectation Maximization (VBEM) algorithm finds a local minimum of the KL divergence by iteratively updating, qΘ (Θ) and qZ (Z) according to the scheme n log qΘ (Θ) ∼ log p(X, Θ, Z) n qZ (Z) and n+1 log qZ (Z) ∼ log p(X, Θ, Z) n qΘ (Θ) (3) Here the brackets indicate an expected value taken with respect to the subscripted probability distribution function and the tilde indicates equality up to a constant which is independent of Θ and Z. The key property to note here is that the approximate posterior which results from this procedure is in an exponential family form and is therefore representable by a linear PPC (Eq. 1). This feature allows for the straightforward construction of networks which implement the VBEM algorithm with linear PPC’s in the following way. If rn and rn are patterns of activity that use a linear PPC representation Θ Z of the relevant posteriors, then n log qΘ (Θ) ∼ hΘ (Θ) · rn Θ and n+1 log qZ (Z) ∼ hZ (Z) · rn+1 . Z (4) Here the stimulus dependent kernels hZ (Z) and hΘ (Θ) are chosen so that their outer product results in a basis that spans the function space on Z × Θ given by log p(X, Θ, Z) for every X. This choice guarantees that there exist functions fΘ (X, rn ) and fZ (X, rn ) such that Z Θ rn = fΘ (X, rn ) Θ Z and rn+1 = fZ (X, rn ) Θ Z (5) satisfy Eq. 3. When this is the case, simply iterating the discrete dynamical system described by Eq. 5 until convergence will find the VBEM approximation to the posterior. This is one way to build a neural network implementation of the VB algorithm. However, its not the only way. In general, any dynamical system which has stable fixed points in common with Eq. 5 can also be said to implement the VBEM algorithm. In the example below we will take advantage of this flexibility in order to build biologically plausible neural network implementations. 3 Response! to Mixture ! of Odors! Single Odor Response Cause Intensity Figure 1: (Left) Each cause (e.g. coffee) in isolation results in a pattern of neural activity (top). When multiple causes contribute to a scene this results in an overall pattern of neural activity which is a mixture of these patterns weighted by the intensities (bottom). (Right) The resulting pattern can be represented by a raster, where each spike is colored by its corresponding latent cause. 3 Probabilistic Topic Models for Spike Train Demixing Consider the problem of odor identification depicted in Fig. 1. A typical mammalian olfactory system consists of a few hundred different types of olfactory receptor neurons (ORNs), each of which responds to a wide range of volatile chemicals. This results in a highly distributed code for each odor. Since, a typical olfactory scene consists of many different odors at different concentrations, the pattern of ORN spike trains represents a complex mixture. Described in this way, it is easy to see that the problem faced by early olfactory cortex can be described as the task of demixing spike trains to infer latent causes (odor intensities). In many ways this olfactory problem is a generic problem faced by each cortical layer as it tries to make sense of the activity of the neurons in the layer below. The input patterns of activity consist of spikes (or spike counts) labeled by the axons which deliver them and summarized by a histogram which indicates how many spikes come from each input neuron. Of course, just because a spike came from a particular neuron does not mean that it had a particular cause, just as any particular ORN spike could have been caused by any one of a large number of volatile chemicals. Like olfactory codes, cortical codes are often distributed and multiple latent causes can be present at the same time. Regardless, this spike or histogram demixing problem is formally equivalent to a class of demixing problems which arise in the context of probabilistic topic models used for document modeling. A simple but successful example of this kind of topic model is called Latent Dirichlet Allocation (LDA) [18]. LDA assumes that word order in documents is irrelevant and, therefore, models documents as histograms of word counts. It also assumes that there are K topics and that each of these topics appears in different proportions in each document, e.g. 80% of the words in a document might be concerned with coffee and 20% with strawberries. Words from a given topic are themselves drawn from a distribution over words associated with that topic, e.g. when talking about coffee you have a 5% chance of using the word ’bitter’. The goal of LDA is to infer both the distribution over topics discussed in each document and the distribution of words associated with each topic. We can map the generative model for LDA onto the task of spike demixing in cortex by letting topics become latent causes or odors, words become neurons, word occurrences become spikes, word distributions associated with each topic become patterns of neural activity associated with each cause, and different documents become the observed patterns of neural activity on different trials. This equivalence is made explicit in Fig. 2 which describes the standard generative model for LDA applied to documents on the left and mixtures of spikes on the right. 4 LDA Inference and Network Implementation In this section we will apply the VB-PPC formulation to build a biologically plausible network capable of approximating probabilistic inference for spike pattern demixing. For simplicity, we will use the equivalent Gamma-Poisson formulation of LDA which directly models word and topic counts 4 1. For each topic k = 1, . . . , K, (a) Distribution over words βk ∼ Dirichlet(η0 ) 2. For document d = 1, . . . , D, (a) Distribution over topics θd ∼ Dirichlet(α0 ) (b) For word m = 1, . . . , Ωd i. Topic assignment zd,m ∼ Multinomial(θd ) ii. Word assignment ωd,m ∼ Multinomial(βzm ) 1. For latent cause k = 1, . . . , K, (a) Pattern of neural activity βk ∼ Dirichlet(η0 ) 2. For scene d = 1, . . . , D, (a) Relative intensity of each cause θd ∼ Dirichlet(α0 ) (b) For spike m = 1, . . . , Ωd i. Cause assignment zd,m ∼ Multinomial(θd ) ii. Neuron assignment ωd,m ∼ Multinomial(βzm ) Figure 2: (Left) The LDA generative model in the context of document modeling. (Right) The corresponding LDA generative model mapped onto the problem of spike demixing. Text related attributes on the left, in red, have been replaced with neural attributes on the right, in green. rather than topic assignments. Specifically, we define, Rd,j to be the number of times neuron j fires during trial d. Similarly, we let Nd,j,k to be the number of times a spike in neuron j comes from cause k in trial d. These new variables play the roles of the cause and neuron assignment variables, zd,m and ωd,m by simply counting them up. If we let cd,k be an un-normalized intensity of cause j such that θd,k = cd,k / k cd,k then the generative model, Rd,j = k Nd,j,k Nd,j,k ∼ Poisson(βj,k cd,k ) 0 cd,k ∼ Gamma(αk , C −1 ). (6) is equivalent to the topic models described above. Here the parameter C is a scale parameter which sets the expected total number of spikes from the population on each trial. Note that, the problem of inferring the wj,k and cd,k is a non-negative matrix factorization problem similar to that considered by Lee and Seung[20]. The primary difference is that, here, we are attempting to infer a probability distribution over these quantities rather than maximum likelihood estimates. See supplement for details. Following the prescription laid out in section 2, we approximate the posterior over latent variables given a set of input patterns, Rd , d = 1, . . . , D, with a factorized distribution of the form, qN (N)qc (c)qβ (β). This results in marginal posterior distributions q (β:,k |η:,k ), q cd,k |αd,k , C −1 + 1 ), and q (Nd,j,: | log pd,j,: , Rd,i ) which are Dirichlet, Gamma, and Multinomial respectively. Here, the parameters η:,k , αd,k , and log pd,j,: are the natural parameters of these distributions. The VBEM update algorithm yields update rules for these parameters which are summarized in Fig. 3 Algorithm1. Algorithm 1: Batch VB updates 1: while ηj,k not converged do 2: for d = 1, · · · , D do 3: while pd,j,k , αd,k not converged do 4: αd,k → α0 + j Rd,j pd,j,k 5: pd,j,k → Algorithm 2: Online VB updates 1: for d = 1, · · · , D do 2: reinitialize pj,k , αk ∀j, k 3: while pj,k , αk not converged do 4: αk → α0 + j Rd,j pj,k 5: pj,k → exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αk ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αi ) exp (ψ(ηj,k )−ψ(¯k )) exp ψ(αd,k ) η η i exp (ψ(ηj,i )−ψ(¯i )) exp ψ(αd,i ) 6: end while 7: end for 8: ηj,k = η 0 + 9: end while end while ηj,k → (1 − dt)ηj,k + dt(η 0 + Rd,j pj,k ) 8: end for 6: 7: d Rd,j pd,j,k Figure 3: Here ηk = j ηj,k and ψ(x) is the digamma function so that exp ψ(x) is a smoothed ¯ threshold linear function. Before we move on to the neural network implementation, note that this standard formulation of variational inference for LDA utilizes a batch learning scheme that is not biologically plausible. Fortunately, an online version of this variational algorithm was recently proposed and shown to give 5 superior results when compared to the batch learning algorithm[21]. This algorithm replaces the sum over d in update equation for ηj,k with an incremental update based upon only the most recently observed pattern of spikes. See Fig. 3 Algorithm 2. 4.1 Neural Network Implementation Recall that the goal was to build a neural network that implements the VBEM algorithm for the underlying latent causes of a mixture of spikes using a neural code that represents the posterior distribution via a linear PPC. A linear PPC represents the natural parameters of a posterior distribution via a linear operation on neural activity. Since the primary quantity of interest here is the posterior distribution over odor concentrations, qc (c|α), this means that we need a pattern of activity rα which is linearly related to the αk ’s in the equations above. One way to accomplish this is to simply assume that the firing rates of output neurons are equal to the positive valued αk parameters. Fig. 4 depicts the overall network architecture. Input patterns of activity, R, are transmitted to the synapses of a population of output neurons which represent the αk ’s. The output activity is pooled to ¯ form an un-normalized prediction of the activity of each input neuron, Rj , given the output layer’s current state of belief about the latent causes of the Rj . The activity at each synapse targeted by input neuron j is then inhibited divisively by this prediction. This results in a dendrite that reports to the ¯ soma a quantity, Nj,k , which represents the fraction of unexplained spikes from input neuron j that could be explained by latent cause k. A continuous time dynamical system with this feature and the property that it shares its fixed points with the LDA algorithm is given by d ¯ Nj,k dt d αk dt ¯ ¯ = wj,k Rj − Rj Nj,k = (7) ¯ Nj,k exp (ψ (¯k )) (α0 − αk ) + exp (ψ (αk )) η (8) i ¯ where Rj = k wj,k exp (ψ (αk )), and wj,k = exp (ψ (ηj,k )). Note that, despite its form, it is Eq. 7 which implements the required divisive normalization operation since, in the steady state, ¯ ¯ Nj,k = wj,k Rj /Rj . Regardless, this network has a variety of interesting properties that align well with biology. It predicts that a balance of excitation and inhibition is maintained in the dendrites via divisive normalization and that the role of inhibitory neurons is to predict the input spikes which target individual dendrites. It also predicts superlinear facilitation. Specifically, the final term on the right of Eq. 8 indicates that more active cells will be more sensitive to their dendritic inputs. Alternatively, this could be implemented via recurrent excitation at the population level. In either case, this is the mechanism by which the network implements a sparse prior on topic concentrations and stands in stark contrast to the winner take all mechanisms which rely on competitive mutual inhibition mechanisms. Additionally, the ηj in Eq. 8 represents a cell wide ’leak’ parameter that indicates that the total leak should be ¯ roughly proportional to the sum total weight of the synapses which drive the neuron. This predicts that cells that are highly sensitive to input should also decay back to baseline more quickly. This implementation also predicts Hebbian learning of synaptic weights. To observe this fact, note that the online update rule for the ηj,k parameters can be implemented by simply correlating the activity at ¯ each synapse, Nj,k with activity at the soma αj via the equation: τL d ¯ wj,k = exp (ψ (¯k )) (η0 − 1/2 − wj,k ) + Nj,k exp ψ (αk ) η dt (9) where τL is a long time constant for learning and we have used the fact that exp (ψ (ηjk )) ≈ ηjk −1/2 for x > 1. For a detailed derivation see the supplementary material. 5 Dynamic Document Model LDA is a rather simple generative model that makes several unrealistic assumptions about mixtures of sensory and cortical spikes. In particular, it assumes both that there are no correlations between the 6 Targeted Divisive Normalization Targeted Divisive Normalization αj Ri Input Neurons Recurrent Connections ÷ ÷ -1 -1 Σ μj Nij Ri Synapses Output Neurons Figure 4: The LDA network model. Dendritically targeted inhibition is pooled from the activity of all neurons in the output layer and acts divisively. Σ jj' Nij Input Neurons Synapses Output Neurons Figure 5: DDM network model also includes recurrent connections which target the soma with both a linear excitatory signal and an inhibitory signal that also takes the form of a divisive normalization. intensities of latent causes and that there are no correlations between the intensities of latent causes in temporally adjacent trials or scenes. This makes LDA a rather poor computational model for a task like olfactory foraging which requires the animal to track the rise a fall of odor intensities as it navigates its environment. We can model this more complicated task by replacing the static cause or odor intensity parameters with dynamic odor intensity parameters whose behavior is governed by an exponentiated Ornstein-Uhlenbeck process with drift and diffusion matrices given by (Λ and ΣD ). We call this variant of LDA the Dynamic Document Model (DDM) as it could be used to model smooth changes in the distribution of topics over the course of a single document. 5.1 DDM Model Thus the generative model for the DDM is as follows: 1. For latent cause k = 1, . . . , K, (a) Cause distribution over spikes βk ∼ Dirichlet(η0 ) 2. For scene t = 1, . . . , T , (a) Log intensity of causes c(t) ∼ Normal(Λct−1 , ΣD ) (b) Number of spikes in neuron j resulting from cause k, Nj,k (t) ∼ Poisson(βj,k exp ck (t)) (c) Number of spikes in neuron j, Rj (t) = k Nj,k (t) This model bears many similarities to the Correlated and Dynamic topic models[22], but models dynamics over a short time scale, where the dynamic relationship (Λ, ΣD ) is important. 5.2 Network Implementation Once again the quantity of interest is the current distribution of latent causes, p(c(t)|R(τ ), τ = 0..T ). If no spikes occur then no evidence is presented and posterior inference over c(t) is simply given by an undriven Kalman filter with parameters (Λ, ΣD ). A recurrent neural network which uses a linear PPC to encode a posterior that evolves according to a Kalman filter has the property that neural responses are linearly related to the inverse covariance matrix of the posterior as well as that inverse covariance matrix times the posterior mean. In the absence of evidence, it is easy to show that these quantities must evolve according to recurrent dynamics which implement divisive normalization[10]. Thus, the patterns of neural activity which linearly encode them must do so as well. When a new spike arrives, optimal inference is no longer possible and a variational approximation must be utilized. As is shown in the supplement, this variational approximation is similar to the variational approximation used for LDA. As a result, a network which can divisively inhibit its synapses is able to implement approximate Bayesian inference. Curiously, this implies that the addition of spatial and temporal correlations to the latent causes adds very little complexity to the VB-PPC network implementation of probabilistic inference. All that is required is an additional inhibitory population which targets the somata in the output population. See Fig. 5. 7 Natural Parameters Natural Parameters (α) 0.4 200 450 180 0.3 Network Estimate Network Estimate 500 400 350 300 250 200 150 100 0.1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 140 120 0.4 0.3 100 0.2 80 0.1 0 60 40 0.4 20 50 0 0 0.2 160 0 0 0.3 0.2 20 40 60 80 100 120 VBEM Estimate VBEM Estimate 140 160 180 200 0.1 0 Figure 6: (Left) Neural network approximation to the natural parameters of the posterior distribution over topics (the α’s) as a function of the VBEM estimate of those same parameters for a variety of ’documents’. (Center) Same as left, but for the natural parameters of the DDM (i.e the entries of the matrix Σ−1 (t) and Σ−1 µ(t) of the distribution over log topic intensities. (Right) Three example traces for cause intensity in the DDM. Black shows true concentration, blue and red (indistinguishable) show MAP estimates for the network and VBEM algorithms. 6 Experimental Results We compared the PPC neural network implementations of the variational inference with the standard VBEM algorithm. This comparison is necessary because the two algorithms are not guaranteed to converge to the same solution due to the fact that we only required that the neural network dynamics have the same fixed points as the standard VBEM algorithm. As a result, it is possible for the two algorithms to converge to different local minima of the KL divergence. For the network implementation of LDA we find good agreement between the neural network and VBEM estimates of the natural parameters of the posterior. See Fig. 6(left) which shows the two algorithms estimates of the shape parameter of the posterior distribution over topic (odor) concentrations (a quantity which is proportional to the expected concentration). This agreement, however, is not perfect, especially when posterior predicted concentrations are low. In part, this is due to the fact we are presenting the network with difficult inference problems for which the true posterior distribution over topics (odors) is highly correlated and multimodal. As a result, the objective function (KL divergence) is littered with local minima. Additionally, the discrete iterations of the VBEM algorithm can take very large steps in the space of natural parameters while the neural network implementation cannot. In contrast, the network implementation of the DDM is in much better agreement with the VBEM estimation. See Fig. 6(right). This is because the smooth temporal dynamics of the topics eliminate the need for the VBEM algorithm to take large steps. As a result, the smooth network dynamics are better able to accurately track the VBEM algorithms output. For simulation details please see the supplement. 7 Discussion and Conclusion In this work we presented a general framework for inference and learning with linear Probabilistic Population codes. This framework takes advantage of the fact that the Variational Bayesian Expectation Maximization algorithm generates approximate posterior distributions which are in an exponential family form. This is precisely the form needed in order to make probability distributions representable by a linear PPC. We then outlined a general means by which one can build a neural network implementation of the VB algorithm using this kind of neural code. We applied this VB-PPC framework to generate a biologically plausible neural network for spike train demixing. We chose this problem because it has many of the features of the canonical problem faced by nearly every layer of cortex, i.e. that of inferring the latent causes of complex mixtures of spike trains in the layer below. Curiously, this very complicated problem of probabilistic inference and learning ended up having a remarkably simple network solution, requiring only that neurons be capable of implementing divisive normalization via dendritically targeted inhibition and superlinear facilitation. Moreover, we showed that extending this approach to the more complex dynamic case in which latent causes change in intensity over time does not substantially increase the complexity of the neural circuit. Finally, we would like to note that, while we utilized a rate coding scheme for our linear PPC, the basic equations would still apply to any spike based log probability codes such as that considered Beorlin and Deneve[23]. 8 References [1] Daniel Kersten, Pascal Mamassian, and Alan Yuille. Object perception as Bayesian inference. Annual review of psychology, 55:271–304, January 2004. [2] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, 2002. [3] Yair Weiss, Eero P Simoncelli, and Edward H Adelson. Motion illusions as optimal percepts. Nature neuroscience, 5(6):598–604, 2002. [4] P N Sabes. The planning and control of reaching movements. Current opinion in neurobiology, 10(6): 740–6, 2000. o [5] Konrad P K¨ rding and Daniel M Wolpert. Bayesian integration in sensorimotor learning. Nature, 427 (6971):244–7, 2004. [6] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–15, 2004. [7] Erno T´ gl´ s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. e a Pure reasoning in 12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033): 1054–9, 2011. [8] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nature Neuroscience, 2006. [9] Jeffrey M Beck, Wei Ji Ma, Roozbeh Kiani, Tim Hanks, Anne K Churchland, Jamie Roitman, Michael N Shadlen, Peter E Latham, and Alexandre Pouget. Probabilistic population codes for Bayesian decision making. Neuron, 60(6):1142–52, 2008. [10] J. M. Beck, P. E. Latham, and a. Pouget. Marginalization in Neural Circuits with Divisive Normalization. Journal of Neuroscience, 31(43):15310–15319, 2011. [11] Tianming Yang and Michael N Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075–80, 2007. [12] RHS Carpenter and MLL Williams. Neural computation of log likelihood in control of saccadic eye movements. Nature, 1995. [13] Arnulf B a Graf, Adam Kohn, Mehrdad Jazayeri, and J Anthony Movshon. Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience, 14(2):239–45, 2011. [14] HB Barlow. Pattern Recognition and the Responses of Sensory Neurons. Annals of the New York Academy of Sciences, 1969. [15] Wei Ji Ma, Vidhya Navalpakkam, Jeffrey M Beck, Ronald Van Den Berg, and Alexandre Pouget. Behavior and neural basis of near-optimal visual search. Nature Neuroscience, (May), 2011. [16] DJ Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 1992. [17] M Carandini, D J Heeger, and J a Movshon. Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 17(21):8621–44, 1997. [18] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 2003. [19] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, UCL, 2003. [20] D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788–91, 1999. [21] M. Hoffman, D. Blei, and F. Bach. Online learning for Latent Dirichlet Allocation. In NIPS, 2010. [22] D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006. [23] M. Boerlin and S. Deneve. Spike-based population coding and working memory. PLOS computational biology, 2011. 9

6 0.62102258 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

7 0.60085553 12 nips-2012-A Neural Autoregressive Topic Model

8 0.58968741 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

9 0.58443171 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

10 0.54551589 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

11 0.5187251 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.50903028 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

13 0.49897039 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

14 0.49008453 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

15 0.48553187 298 nips-2012-Scalable Inference of Overlapping Communities

16 0.47371489 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

17 0.46546382 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

18 0.4598473 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

19 0.45961794 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

20 0.44346926 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.101), (17, 0.019), (21, 0.041), (35, 0.189), (38, 0.118), (39, 0.036), (42, 0.021), (53, 0.011), (54, 0.017), (55, 0.022), (63, 0.019), (74, 0.052), (76, 0.13), (80, 0.093), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82038194 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

2 0.80183744 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

Author: Jason Pacheco, Erik B. Sudderth

Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1

3 0.80104935 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

Author: Nitish Srivastava, Ruslan Salakhutdinov

Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1

4 0.77044427 291 nips-2012-Reducing statistical time-series problems to binary classification

Author: Daniil Ryabko, Jeremie Mary

Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1

5 0.75919229 192 nips-2012-Learning the Dependency Structure of Latent Factors

Author: Yunlong He, Yanjun Qi, Koray Kavukcuoglu, Haesun Park

Abstract: In this paper, we study latent factor models with dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main benefit (novelty) of the model is that we can simultaneously learn the lowerdimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classification performance. 1

6 0.75868148 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

7 0.74689829 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

8 0.74650836 233 nips-2012-Multiresolution Gaussian Processes

9 0.74278456 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

10 0.74231553 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

11 0.74141318 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

12 0.74040711 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

13 0.74003404 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

14 0.73771787 12 nips-2012-A Neural Autoregressive Topic Model

15 0.73715138 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

16 0.73707902 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

17 0.73461378 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

18 0.73429018 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

19 0.73353678 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

20 0.73117357 282 nips-2012-Proximal Newton-type methods for convex optimization