nips nips2013 nips2013-317 knowledge-graph by maker-knowledge-mining

317 nips-2013-Streaming Variational Bayes


Source: pdf

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. [sent-8, score-0.497]

2 We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. [sent-9, score-0.206]

3 We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. [sent-10, score-0.401]

4 1 Introduction Large, streaming data sets are increasingly the norm in science and technology. [sent-11, score-0.223]

5 Simple descriptive statistics can often be readily computed with a constant number of operations for each data point in the streaming setting, without the need to revisit past data or have advance knowledge of future data. [sent-12, score-0.309]

6 An exception to this statement is provided by [3–5], who have shown that a class of approximation methods known as variational Bayes (VB) [6] can be usefully deployed for large-scale data sets. [sent-21, score-0.115]

7 They have applied their approach, referred to as stochastic variational inference (SVI), to the domain of topic modeling of document collections, an area with a major need for scalable inference algorithms. [sent-22, score-0.294]

8 VB traditionally uses the variational lower bound on the marginal likelihood as an objective function, and the idea of SVI is to apply a variant of stochastic gradient descent to this objective. [sent-23, score-0.145]

9 , documents in the topic model setting), for a fixed value of D. [sent-26, score-0.144]

10 Although the stochastic gradient is computed for a single, small subset of data points (documents) at a time, the posterior being targeted is a posterior for D data points. [sent-27, score-0.443]

11 We view this lack of a link between the number of documents that have been processed thus far and the posterior that is being targeted as undesirable in many settings involving streaming data. [sent-30, score-0.539]

12 In this paper we aim at an approximate Bayesian inference algorithm that is scalable like SVI but 1 is also truly a streaming procedure, in that it yields an approximate posterior for each processed collection of D data points—and not just a pre-specified “final” number of data points D. [sent-31, score-0.57]

13 Although the empirical success of SVI is the main motivation for our work, we are also motivated by recent developments in computer architectures, which permit distributed and asynchronous computations in addition to streaming computations. [sent-37, score-0.36]

14 As we will show, a streaming VB algorithm naturally lends itself to distributed and asynchronous implementations. [sent-38, score-0.33]

15 2 Streaming, distributed, asynchronous Bayesian updating Streaming Bayesian updating. [sent-39, score-0.107]

16 Then Bayes theorem gives us the posterior distribution of Θ given a collection of S data points, C1 := (x1 , . [sent-45, score-0.191]

17 , Cb−1 ), we can calculate the posterior after the bth minibatch: p(Θ | C1 , . [sent-56, score-0.271]

18 (1) That is, we treat the posterior after b − 1 minibatches as the new prior for the incoming data points. [sent-63, score-0.267]

19 If we can save the posterior from b − 1 minibatches and calculate the normalizing constant for the bth posterior, repeated application of Eq. [sent-64, score-0.352]

20 (1) is streaming; it automatically gives us the new posterior without needing to revisit old data points. [sent-65, score-0.233]

21 In complex models, it is often infeasible to calculate the posterior exactly, and an approximation must be used. [sent-66, score-0.263]

22 Suppose that, given a prior p(Θ) and data minibatch C, we have an approximation algorithm A that calculates an approximate posterior q: q(Θ) = A(C, p(Θ)). [sent-67, score-0.495]

23 Then, setting q0 (Θ) = p(Θ), one way to recursively calculate an approximation to the posterior is p(Θ | C1 , . [sent-68, score-0.263]

24 (2) When A yields the posterior from Bayes theorem, this calculation is exact. [sent-72, score-0.191]

25 (2) handle streaming data in theory, but in practice, the A calculation might take longer than the time interval between minibatch arrivals or simply take longer than desired. [sent-81, score-0.449]

26 (3) That is, we can calculate the individual minibatch posteriors p(Θ | Cb ), perhaps in parallel, and then combine them to find the full posterior p(Θ | C1 , . [sent-88, score-0.517]

27 We suppose further that A always returns a distribution in the same exponential family; in particular, we suppose that there exists some parameter ξb such that qb (Θ) ∝ exp{ξb · T (Θ)} for qb (Θ) = A(Cb , p(Θ)). [sent-99, score-0.175]

28 , CB ) ≈ q(Θ) ∝ exp ξ0 + b=1 (ξb − ξ0 ) · T (Θ) , (6) where the normalizing constant is readily obtained from the exponential family form. [sent-104, score-0.107]

29 In what follows we use the shorthand ξ ← A(C, ξ0 ) to denote that A takes as input a minibatch C and a prior with exponential family parameter ξ0 and that it returns a distribution in the same exponential family with parameter ξ. [sent-105, score-0.455]

30 , CB ), we first calculate ξb via the approximation primitive A for each minibatch Cb ; note that these calculations may be performed in parallel. [sent-109, score-0.358]

31 Then we sum together the quantities ξb − ξ0 across b, along with the initial ξ0 from the prior, to find the final exponential family parameter to the full posterior approximation q. [sent-110, score-0.287]

32 We previously saw that the general Bayes sequential update can be made streaming by iterating with the old posterior as the new prior (Eq. [sent-111, score-0.442]

33 Similarly, here we see that the full posterior approximation q is in the same exponential family as the prior, so one may iterate these parallel computations to arrive at a parallelized algorithm for streaming posterior computation. [sent-113, score-0.731]

34 We emphasize that while these updates are reminiscent of prior-posterior conjugacy, it is actually the approximate posteriors and single, original prior that we assume belong to the same exponential family. [sent-114, score-0.176]

35 It is not necessary to assume any conjugacy in the generative model itself nor that any true intermediate or final posterior take any particular limited form. [sent-115, score-0.191]

36 When a worker finishes, it reports its solution to a single master processor. [sent-121, score-0.196]

37 If the master gives the worker a new subproblem without waiting for the other workers to finish, it can decrease downtime in the system. [sent-122, score-0.244]

38 Our asynchronous algorithm is in the spirit of Hogwild! [sent-123, score-0.107]

39 To present the algorithm we first describe an asynchronous computation that we will not use in practice, but which will serve as a conceptual stepping stone. [sent-125, score-0.107]

40 Have each worker continuously iterate between three steps: (1) collect a new minibatch C, (2) compute the local approximate posterior ξ ← A(C, ξ0 ), and (3) return ∆ξ := ξ − ξ0 to the master. [sent-128, score-0.565]

41 The master, in turn, starts by assigning the posterior to equal the prior: ξ (post) ← ξ0 . [sent-129, score-0.191]

42 Each time the master receives a quantity ∆ξ from any worker, it updates the posterior synchronously: ξ (post) ← ξ (post) + ∆ξ. [sent-130, score-0.334]

43 If A returns the exponential family parameter of the true posterior (rather than an approximation), then the posterior at the master is exact by Eq. [sent-131, score-0.592]

44 The master initializes its posterior estimate to the prior: ξ (post) ← ξ0 . [sent-134, score-0.303]

45 Each worker continuously iterates between four steps: (1) collect a new minibatch C, (2) copy the master posterior value locally ξ (local) ← ξ (post) , (3) compute the local approximate posterior ξ ← A(C, ξ (local) ), and (4) return ∆ξ := ξ − ξ (local) to the master. [sent-135, score-0.868]

46 Each time the master receives a quantity ∆ξ from any worker, it updates the posterior synchronously: ξ (post) ← ξ (post) + ∆ξ. [sent-136, score-0.334]

47 The key difference between the first and second frameworks proposed above is that, in the second, the latest posterior is used as a prior. [sent-137, score-0.191]

48 This latter framework is more in line with the streaming update of Eq. [sent-138, score-0.223]

49 Since ξ (post) might change at the master 3 while the worker is computing ∆ξ, it is no longer the case that the posterior at the master is exact when A returns the exponential family parameter of the true posterior. [sent-140, score-0.597]

50 3 Case study: latent Dirichlet allocation In what follows, we consider examples of the choices for the Θ prior and primitive A in the context of latent Dirichlet allocation (LDA) [13]. [sent-146, score-0.198]

51 Each word wdn belongs to a latent topic zdn chosen according to a documentspecific distribution of topics θd = (θdk )K . [sent-153, score-0.229]

52 The posterior for LDA, p(β, θ, z | d=1 C, η, α), is equal to the following expression up to proportionality: K ∝ k=1 D Dirichlet(βk | ηk ) · d=1 D Dirichlet(θd | α) · Nd θdzdn βzdn ,wdn . [sent-159, score-0.191]

53 (7) d=1 n=1 The posterior for just the global parameters p(β | C, η, α) can be obtained from p(β, θ, z | C, η, α) by integrating out the local, document-specific parameters θ, z. [sent-160, score-0.191]

54 (7) is intractable to compute, so the posterior must be approximated. [sent-162, score-0.191]

55 We consider two possibilities here: variational Bayes (VB) and expectation propagation (EP). [sent-166, score-0.116]

56 Both primitives take Dirichlet distributions as priors for β and both return Dirichlet distributions for the approximate posterior of the topic parameters β; thus the prior and approximate posterior are in the same exponential family. [sent-167, score-0.562]

57 do Collect new data minibatch C foreach document indexed d in C do (γd , φd ) ← LocalVB(d, λ) η, α, D, (ρt )T t=1 Input: Hyperparameters Output: λ Initialize λ for t = 1, . [sent-201, score-0.31]

58 Here, ndv represents the number of words v in document d. [sent-206, score-0.164]

59 An EP [7] algorithm for approximating the LDA posterior appears in Alg. [sent-208, score-0.191]

60 6 differs from [14], which does not provide an approximate posterior for the topic parameters, and is instead our own derivation. [sent-213, score-0.276]

61 Stochastic variational inference (SVI) [3, 4] is exactly the application of a particular version of stochastic gradient descent to the same optimization problem. [sent-224, score-0.174]

62 While stochastic gradient descent can often be viewed as a streaming algorithm, the optimization problem itself here depends on D via pD , the posterior on D data points. [sent-225, score-0.466]

63 Nonetheless, while one may choose to visit D = D data points or revisit data points when using SVI to estimate pD [3, 4], SVI can be made single-pass by visiting each of D data points exactly once and then has constant memory requirements. [sent-228, score-0.141]

64 An alternative single-pass (and indeed streaming) option would be to update the local parameters for each minibatch of documents as they arrive and then add the corresponding terms φdvk ndv to the current estimate of λ for each document d in the minibatch. [sent-234, score-0.477]

65 27 Table 1: A comparison of (1) log predictive probability of held-out data and (2) running time of four algorithms: SDA-Bayes with 32 threads, SDA-Bayes with 1 thread, SVI, and SSU. [sent-256, score-0.118]

66 We calculate predictive probability by first setting aside held-out testing documents C (test) from the full corpus and then further setting aside a subset of held-out testing words Wd,test in each testing document d. [sent-259, score-0.287]

67 The remaining (training) documents C (train) are used to estimate the global parameter posterior q(β), and the remaining (training) words Wd,train within the dth testing document are used to estimate the document-specific parameter posterior q(θd ). [sent-260, score-0.527]

68 1 To calculate predictive probability, an approximation is necessary since we do not know the predictive distribution—just as we seek to learn the posterior distribution. [sent-261, score-0.447]

69 We hold out 10,000 Wikipedia documents and 1,024 Nature documents (not included in the counts above) for testing. [sent-267, score-0.174]

70 3(a) and 3(d) demonstrate that both SVI and SDA are sensitive to minibatch size when ηkv = 0. [sent-273, score-0.226]

71 Moreover, in the remaining experiments, we use a large minibatch size of 215 = 32,768. [sent-279, score-0.226]

72 One would expect that with additional streaming capabilities, SDA-Bayes should show a performance loss relative to SVI. [sent-283, score-0.223]

73 45 1 2 4 8 16 32 number of threads (a) Wikipedia −7. [sent-286, score-0.108]

74 2 1 sync async 40 30 20 10 0 2 4 8 16 32 number of threads (b) Nature 1 2 4 8 16 number of threads (c) Wikipedia 32 sync async 10 run time (hours) −7. [sent-288, score-0.474]

75 35 log predictive probability log predictive probability sync async −7. [sent-291, score-0.353]

76 3 8 6 4 2 0 1 2 4 8 16 number of threads 32 (d) Nature Figure 2: SDA-Bayes log predictive probability (two left plots) and run time (two right plots) as a function of number of threads. [sent-292, score-0.25]

77 In Table 1, we also report performance of SDABayes with 32 threads and the same minibatch size. [sent-297, score-0.334]

78 In the synchronous case, we consider minibatch size to equal the total number of data points processed per round; therefore, the minibatch size equals the number of data points sent to each thread per round times the total number of threads. [sent-298, score-0.665]

79 In the asynchronous case, we analogously report minibatch size as this product. [sent-299, score-0.333]

80 2 shows the performance of SDA-Bayes when we run with {1, 2, 4, 8, 16, 32} threads while keeping the minibatch size constant. [sent-301, score-0.358]

81 Indeed, we see dramatic run time improvement as the number of threads grows and in fact some slight performance improvement as well. [sent-303, score-0.132]

82 We tried both a parallel version and a full distributed, asynchronous version of the algorithm; Fig. [sent-304, score-0.107]

83 In our experiments, the processing time at each thread seems to be approximately equal across threads and dominate any communication time at the master, so synchronous and asynchronous performance and running time are essentially identical. [sent-309, score-0.324]

84 The evaluations above are for a single posterior over D data points. [sent-312, score-0.191]

85 Of greater concern to us in this work is the evaluation of algorithms in the streaming setting. [sent-313, score-0.223]

86 We have seen that SVI is designed to find the posterior for a particular, prechosen number of data points D. [sent-314, score-0.224]

87 A practitioner in the streaming setting will typically not know D in advance, or multiple values of D may be of interest. [sent-321, score-0.249]

88 This cross-validation requires multiple runs over the data and thus is not suited to the streaming setting. [sent-328, score-0.223]

89 [3] have observed that the optimal (τ0 , κ) may interact with minibatch size, and we further observe that the optimal values may vary with D as well. [sent-332, score-0.226]

90 5 5 10 15 log batch size (base 2) log predictive probability log predictive probability (a) Sensitivity to minibatch size on (b) SVI sensitivity to D Wikipedia Wikipedia −7 −7 −7. [sent-355, score-0.562]

91 5 1e6 2e6 3e6 number of examples seen on (c) SVI sensitivity to stepsize parameters on Wikipedia log predictive probability −7. [sent-372, score-0.185]

92 6 log predictive probability log predictive probability −7 −7. [sent-373, score-0.236]

93 5 1e5 2e5 3e5 number of examples seen (d) Sensitivity to minibatch size on (e) SVI sensitivity to D on Nature (f) SVI sensitivity to stepsize paNature rameters on Nature Figure 3: Sensitivity of SVI and SDA-Bayes to some respective parameters. [sent-384, score-0.337]

94 7×104 data points), log predictive probability had stabilized at around −7. [sent-388, score-0.148]

95 7 × 104 data points), log predictive probability had stabilized at around −8. [sent-390, score-0.148]

96 5 Discussion We have introduced SDA-Bayes, a framework for streaming, distributed, asynchronous computation of an approximate Bayesian posterior. [sent-393, score-0.135]

97 Our framework makes streaming updates to the estimated posterior according to a user-specified approximation primitive. [sent-394, score-0.467]

98 We have demonstrated the usefulness of our framework, with variational Bayes as the primitive, by fitting the latent Dirichlet allocation topic model to the Wikipedia and Nature corpora. [sent-395, score-0.205]

99 2 We chose 8 threads since any fewer was too slow to get results and anything larger created too high of a memory demand on our system. [sent-403, score-0.108]

100 A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. [sent-502, score-0.152]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svi', 0.607), ('kv', 0.25), ('minibatch', 0.226), ('streaming', 0.223), ('posterior', 0.191), ('cb', 0.183), ('vb', 0.171), ('qd', 0.168), ('dvk', 0.138), ('wikipedia', 0.136), ('lda', 0.133), ('master', 0.112), ('threads', 0.108), ('asynchronous', 0.107), ('ndv', 0.106), ('ep', 0.104), ('ssu', 0.103), ('variational', 0.093), ('predictive', 0.092), ('sda', 0.091), ('post', 0.089), ('documents', 0.087), ('worker', 0.084), ('dirichlet', 0.083), ('zdn', 0.07), ('localvb', 0.069), ('async', 0.061), ('primitive', 0.06), ('document', 0.058), ('topic', 0.057), ('qb', 0.056), ('sync', 0.056), ('thread', 0.056), ('dk', 0.055), ('bayes', 0.053), ('synchronous', 0.053), ('dwdn', 0.052), ('kwtest', 0.052), ('wtest', 0.052), ('calculate', 0.05), ('posteriors', 0.05), ('pd', 0.048), ('minibatches', 0.048), ('topics', 0.046), ('sensitivity', 0.044), ('advance', 0.044), ('nature', 0.044), ('revisit', 0.042), ('bayesian', 0.04), ('exponential', 0.039), ('processed', 0.038), ('collect', 0.036), ('hours', 0.035), ('family', 0.035), ('paisley', 0.035), ('ashia', 0.034), ('istributed', 0.034), ('treaming', 0.034), ('hogwild', 0.033), ('normalizing', 0.033), ('points', 0.033), ('eq', 0.031), ('updates', 0.031), ('parallelizing', 0.031), ('batch', 0.03), ('bth', 0.03), ('asynchrony', 0.03), ('wibisono', 0.03), ('synchronously', 0.03), ('stabilized', 0.03), ('computations', 0.03), ('latent', 0.03), ('shorthand', 0.029), ('berkeley', 0.029), ('inference', 0.029), ('blei', 0.029), ('prior', 0.028), ('stochastic', 0.028), ('approximations', 0.028), ('approximate', 0.028), ('pass', 0.028), ('foreach', 0.026), ('wdn', 0.026), ('practitioner', 0.026), ('log', 0.026), ('hyperparameters', 0.026), ('allocation', 0.025), ('hoffman', 0.025), ('workers', 0.025), ('broderick', 0.025), ('descent', 0.024), ('run', 0.024), ('returns', 0.024), ('propagation', 0.023), ('xs', 0.023), ('waiting', 0.023), ('stepsize', 0.023), ('stat', 0.023), ('kl', 0.023), ('approximation', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

2 0.24215403 143 nips-2013-Integrated Non-Factorized Variational Inference

Author: Shaobo Han, Xuejun Liao, Lawrence Carin

Abstract: We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the 1 norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method. 1

3 0.17347072 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing

Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1

4 0.12348704 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

5 0.11183513 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

Author: Sam Patterson, Yee Whye Teh

Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1

6 0.10467951 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

7 0.093499348 188 nips-2013-Memory Limited, Streaming PCA

8 0.092680812 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

9 0.089434318 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

10 0.079842433 60 nips-2013-Buy-in-Bulk Active Learning

11 0.079353236 174 nips-2013-Lexical and Hierarchical Topic Regression

12 0.078813441 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

13 0.07788039 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

14 0.074084565 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

15 0.073653437 98 nips-2013-Documents as multiple overlapping windows into grids of counts

16 0.068246752 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

17 0.068034396 301 nips-2013-Sparse Additive Text Models with Low Rank Background

18 0.067387074 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

19 0.065932751 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

20 0.061247129 75 nips-2013-Convex Two-Layer Modeling


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.047), (2, -0.034), (3, -0.007), (4, 0.028), (5, 0.105), (6, 0.13), (7, 0.054), (8, 0.157), (9, -0.033), (10, -0.03), (11, 0.174), (12, -0.039), (13, 0.036), (14, -0.03), (15, -0.179), (16, 0.096), (17, -0.007), (18, -0.018), (19, -0.108), (20, -0.044), (21, -0.1), (22, -0.013), (23, 0.065), (24, -0.128), (25, -0.074), (26, 0.076), (27, -0.026), (28, 0.078), (29, -0.025), (30, 0.075), (31, -0.066), (32, -0.133), (33, 0.078), (34, -0.005), (35, 0.048), (36, -0.033), (37, 0.046), (38, -0.054), (39, -0.069), (40, -0.035), (41, -0.005), (42, -0.029), (43, 0.052), (44, -0.012), (45, 0.037), (46, -0.099), (47, -0.046), (48, -0.053), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93153965 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

2 0.78292102 143 nips-2013-Integrated Non-Factorized Variational Inference

Author: Shaobo Han, Xuejun Liao, Lawrence Carin

Abstract: We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the 1 norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method. 1

3 0.75125551 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek

Abstract: The Bayesian online change point detection (BOCPD) algorithm provides an efficient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model’s posterior predictives, which can only be computed online in O(1) time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efficient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution. Change point detection has been applied to many applications [5; 7]. In recent years there have been great improvements to the Bayesian approaches via the Bayesian online change point detection algorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its use of feature-aided tracking [10]: methods that use auxiliary information from radar returns such as signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems would often filter only noisy position (and perhaps Doppler) measurements while newer systems use more information to improve performance. We use BOCPD for modeling the RCS feature. Whereas BOCPD inference could be done exactly when finding change points in conjugate exponential family models the physics of RCS measurements often causes them to be distributed in non-exponential family ways, often following a Rice distribution. To do inference efficiently we call upon variational Bayes (VB) to find approximate posterior (predictive) distributions. Furthermore, the nature of both BOCPD and tracking require the use of online updating. We improve upon the existing and limited approaches to online VB [24; 13]. This paper produces contributions to, and builds upon background from, three independent areas: change point detection, variational Bayes, and radar tracking. Although the emphasis in machine learning is on filtering, a substantial part of tracking with radar data involves data association, illustrated in Figure 1. Observations of radar returns contain measurements from multiple objects (targets) in the sky. If we knew which radar return corresponded to which target we would be presented with NT ∈ N0 independent filtering problems; Kalman filters [14] (or their nonlinear extensions) are applied to “average out” the kinematic errors in the measurements (typically positions) using the measurements associated with each target. The data association problem is to determine which measurement goes to which track. In the classical setup, once a particular measurement is associated with a certain target, that measurement is plugged into the filter for that target as if we knew with certainty it was the correct assignment. The association algorithms, in effect, find the maximum a posteriori (MAP) estimate on the measurement-to-track association. However, approaches such as the joint probabilistic data association (JPDA) filter [2] and the probability hypothesis density (PHD) filter [16] have deviated from this. 1 To find the MAP estimate a log likelihood of the data under each possible assignment vector a must be computed. These are then used to construct cost matrices that reduce the assignment problem to a particular kind of optimization problem (the details of which are beyond the scope of this paper). The motivation behind feature-aided tracking is that additional features increase the probability that the MAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature (SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or gamma distributed [26]. The parameters of the RCS distribution are determined by factors such as the shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS characteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS features may help distinguish one aircraft from another if they cross paths or come near one another, for example. RCS also helps distinguish genuine aircraft returns from clutter: a flock of birds or random electrical noise, for example. However, the parameters of the RCS distributions may also change for the same aircraft due to a change in angle or ground conditions. These must be taken into account for accurate association. Providing good predictions in light of a possible sudden change in the parameters of a time series is “right up the alley” of BOCPD and change point methods. The original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family models for time series. In this paper, we expand the set of applications of BOCPD to radar SNR data which often has the same change point structure found in other applications, and requires online predictions. The BOCPD model is highly modular in that it looks for changes in the parameters of any underlying process model (UPM). The UPM merely needs to provide posterior predictive probabilities, the UPM can otherwise be a “black box.” The BOCPD queries the UPM for a prediction of the next data point under each possible run length, the number of points since the last change point. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the posterior is computed by accumulating the sufficient statistics since the last potential change point. This allows for O(1) UPM updates in both computation and memory as the run length increases. We motivate the use of VB for implementing UPMs when the data within a regime is believed to follow a distribution that is not exponential family. The methods presented in this paper can be used to find variational run length posteriors for general non-exponential family UPMs in addition to the Rice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are applicable in areas outside of change point detection. Likelihood clutter (birds) track 1 (747) track 2 (EMB 110) 0 5 10 15 20 SNR Figure 1: Illustrative example of a tracking scenario: The black lines (−) show the true tracks while the red stars (∗) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions on the states are shown as blue ellipses. The current (+) and previous (×) measurements are connected to their associated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (·). The distributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right. To our knowledge this paper is the first to demonstrate how to compute Bayesian posterior distributions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15], which computes a MAP estimate. Other novel factors of this paper include: demonstrating the usefulness (and advantages over existing techniques) of change point detection for RCS estimation and tracking; and applying variational inference for UPMs where analytic posterior predictives are not possible. This paper provides four main technical contributions: 1) VB inference for inferring the parameters of a Rice distribution. 2) General improvements to online VB (which is then applied to updating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the UPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice distribution) in VB. This is key for processing missed detections in data association. 2 1 Background In this section we briefly review the three areas of background: BOCPD, VB, and tracking. 1.1 Bayesian Online Change Point Detection We briefly summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a detailed description. We assume we have a time series with n observations so far y1 , . . . , yn ∈ Y. In effect, BOCPD performs message passing to do online inference on the run length rn ∈ 0:n − 1, the number of observations since the last change point. Given an underlying predictive model (UPM) and a hazard function h, we can compute an exact posterior over the run length rn . Conditional on a run length, the UPM produces a sequential prediction on the next data point using all the data since the last change point: p(yn |y(r) , Θm ) where (r) := (n − r):(n − 1). The UPM is a simpler model where the parameters θ change at every change point and are modeled as being sampled from a prior with hyper-parameters Θm . The canonical example of a UPM would be a Gaussian whose mean and variance change at every change point. The online updates are summarized as: P (rn |rn−1 ) p(yn |rn−1 , y(r) ) p(rn−1 , y1:n−1 ) . msgn := p(rn , y1:n ) = rn−1 hazard UPM (1) msgn−1 Unless rn = 0, the sum in (1) only contains one term since the only possibility is that rn−1 = rn −1. The indexing convention is such that if rn = 0 then yn+1 is the first observation sampled from the new parameters θ. The marginal posterior predictive on the next data point is easily calculated as: p(yn+1 |y1:n ) = p(yn+1 |y(r) )P (rn |y1:n ) . (2) rn Thus, the predictions from BOCPD fully integrate out any uncertainty in θ. The message updates (1) perform exact inference under a model where the number of change points is not known a priori. BOCPD RCS Model We show the Rice UPM as an example as it is required for our application. The data within a regime are assumed to be iid Rice observations, with a normal-gamma prior: yn ∼ Rice(ν, σ) , ν ∼ N (µ0 , σ 2 /λ0 ) , σ −2 =: τ ∼ Gamma(α0 , β0 ) (3) 2 =⇒ p(yn |ν, σ) = yn τ exp(−τ (yn + ν 2 )/2)I0 (yn ντ )I{yn ≥ 0} (4) where I0 (·) is a modified Bessel function of order zero, which is what excludes the Rice distribution from the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable us to use the VB-EM algorithm. The UPM parameters are the Rice shape1 ν ∈ R and scale σ ∈ R+ , θ := {ν, σ}, and the hyper-parameters are the normal-gamma parameters Θm := {µ0 , λ0 , α0 , β0 }. Every change point results in a new value for ν and σ being sampled. A posterior on θ is maintained for each run length, i.e. every possible starting point for the current regime, and is updated at each new data point. Therefore, BOCPD maintains n distinct posteriors on θ, and although this can be reduced with pruning, it necessitates posterior updates on θ that are computationally efficient. Note that the run length updates in (1) require the UPM to provide predictive log likelihoods at all sample sizes rn (including zero). Therefore, UPM implementations using such approximations as plug-in MLE predictions will not work very well. The MLE may not even be defined for run lengths smaller than the number of UPM parameters |θ|. For a Rice UPM, the efficient O(1) updating in exponential family models by using a conjugate prior and accumulating sufficient statistics is not possible. This motivates the use of VB methods for approximating the UPM predictions. 1.2 Variational Bayes We follow the framework of VB where when computation of the exact posterior distribution p(θ|y1:n ) is intractable it is often possible to create a variational approximation q(θ) that is locally optimal in terms of the Kullback-Leibler (KL) divergence KL(q p) while constraining q to be in a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on the evidence log p(y1:n ), using either gradient based methods or standard fixed point equations. 1 The shape ν is usually assumed to be positive (∈ R+ ); however, there is nothing wrong with using a negative ν as Rice(x|ν, σ) = Rice(x|−ν, σ). It also allows for use of a normal-gamma prior. 3 The VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB fixed point equations can be simplified by applying the VB-EM algorithm [3]. VB-EM is applicable to models that are conjugate-exponential (CE) after being augmented with latent variables x1:n . A model is CE if: 1) The complete data likelihood p(x1:n , y1:n |θ) is an exponential family distribution; and 2) the prior p(θ) is a conjugate prior for the complete data likelihood p(x1:n , y1:n |θ). We only have to constrain the posterior q(θ, x1:n ) = q(θ)q(x1:n ) to factorize between the latent variables and the parameters; we do not constrain the posterior to be of any particular parametric form. Requiring the complete likelihood to be CE is a much weaker condition than requiring the marginal on the observed data p(y1:n |θ) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented with latent variables (class labels). This is also the case for the Rice distribution (Section 2.1). Like the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the posterior of the latent variables treating the expected natural parameters η := Eq(θ) [η] as correct: ¯ q(xi ) ← p(xi |yi , η = η ). 2) Find the posterior of the parameters using the expected sufficient statis¯ ¯ tics S := Eq(x1:n ) [S(x1:n , y1:n )] as if they were the sufficient statistics for the complete data set: ¯ q(θ) ← p(θ|S(x1:n , y1:n ) = S). The posterior will be of the same exponential family as the prior. 1.3 Tracking In this section we review data association, which along with filtering constitutes tracking. In data association we estimate the association vectors a which map measurements to tracks. At each time NZ (n) step, n ∈ N1 , we observe NZ (n) ∈ N0 measurements, Zn = {zi,n }i=1 , which includes returns from both real targets and clutter (spurious measurements). Here, zi,n ∈ Z is a vector of kinematic measurements (positions in R3 , or R4 with a Doppler), augmented with an RCS component R ∈ R+ for the measured SNR, at time tn ∈ R. The assignment vector at time tn is such that an (i) = j if measurement i is associated with track j > 0; an (i) = 0 if measurement i is clutter. The inverse mapping a−1 maps tracks to measurements: meaning a−1 (an (i)) = i if an (i) = 0; and n n a−1 (i) = 0 ⇔ an (j) = i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5, n Nc = 2, and a−1 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa. In N D data association we jointly find the MAP estimate of the association vectors over a sliding window of the last N − 1 time steps. We assume we have NT (n) ∈ N0 total tracks as a known parameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative process each track places a probability distribution on the next N − 1 measurements, with both kinematic and RCS components. However, if the random RCS R for a measurement is below R0 then it will not be observed. There are Nc (n) ∈ N0 clutter measurements from a Poisson process with λ := E[Nc (n)] (often with uniform intensity). The ordering of measurements in Zn is assumed to be uniformly random. For 3D data association the model joint p(Zn−1:n , an−1 , an |Z1:n−2 ) is: NT |Zi | n pi (za−1 (i),n , za−1 n n−1 i=1 (i),n−1 ) × λNc (i) exp(−λ)/|Zi |! i=n−1 p0 (zj,i )I{ai (j)=0} , (5) j=1 where pi is the probability of the measurement sequence under track i; p0 is the clutter distribution. The probability pi is the product of the RCS component predictions (BOCPD) and the kinematic components (filter); informally, pi (z) = pi (positions) × pi (RCS). If there is a missed detection, i.e. a−1 (i) = 0, we then use pi (za−1 (i),n ) = P (R < R0 ) under the RCS model for track i with no conn n tribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic predictor to be used as a UPM, any black box model of measurement sequences can used in (5). The estimation of association vectors for the 3D case becomes an optimization problem of the form: ˆ (ˆn−1 , an ) = argmax log P (an−1 , an |Z1:n ) = argmax log p(Zn−1:n , an−1 , an |Z1:n−2 ) , (6) a (an−1 ,an ) (an−1 ,an ) which is effectively optimizing (5) with respect to the assignment vectors. The optimization given in (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved efficiently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approximate, yet typically very accurate, solvers must be used for real-time operation, which is usually required for tracking systems [20]. If a radar scan occurs at each time step and a target is not detected, we assume the SNR has not exceeded the threshold, implying 0 ≤ R < R0 . This is a (left) censored measurement and is treated differently than a missing data point. Censoring is accounted for in Section 2.3. 4 2 Online Variational UPMs We cover the four technical challenges for implementing non-exponential family UPMs in an efficient and online manner. We drop the index of the data point i when it is clear from context. 2.1 Variational Posterior for a Rice Distribution The Rice distribution has the property that x ∼ N (ν, σ 2 ) , y ∼ N (0, σ 2 ) =⇒ R = x2 + y 2 ∼ Rice(ν, σ) . (7) For simplicity we perform inference using R2 , as opposed to R, and transform accordingly: x ∼ N (ν, σ 2 ) , 1 R2 − x2 ∼ Gamma( 2 , τ ) , 2 τ := 1/σ 2 ∈ R+ =⇒ p(R2 , x) = p(R2 |x)p(x) = Gamma(R2 − x2 | 1 , τ )N (x|ν, σ 2 ) . 2 2 (8) The complete likelihood (8) is the product of two exponential family models and is exponential family itself, parameterized with base measure h and partition factor g: η = [ντ, −τ /2] , S = [x, R2 ] , h(R2 , x) = (2π R2 − x2 )−1 , g(ν, τ ) = τ exp(−ν 2 τ /2) . By inspection we see that the natural parameters η and sufficient statistics S are the same as a Gaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (ν, τ ) as it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM 2 algorithm. We use yi := Ri as the VB observation, not Ri as in (3). In (5), z·,· (end) is the RCS R. VB M-Step We derive the posterior updates to the parameters given expected sufficient statistics: n λ0 µ0 + i E[xi ] , λn = λ0 + n , αn = α0 + n , λ0 + n i=1 n n 1 1 nλ0 1 βn = β0 + (E[xi ] − x)2 + ¯ (¯ − µ0 )2 + x R2 − E[xi ]2 . 2 i=1 2 λ0 + n 2 i=1 i x := ¯ E[xi ]/n , µn = (9) (10) This is the same as an observation from a Gaussian and a gamma that share a (inverse) scale τ . 2 2 ¯ VB E-Step We then must find both expected sufficient statistics S. The expectation E[Ri |Ri ] = 2 2 Ri trivially; leaving E[xi |Ri ]. Recall that the joint on (x, y ) is a bivariate normal; if we constrain the radius to R, the angle ω will be distributed by a von Mises (VM) distribution. Therefore, ω := arccos(x/R) ∼ VM(0, κ) , κ = R E[ντ ] =⇒ E[x] = R E[cos ω] = RI1 (κ)/I0 (κ) , (11) where computing κ constitutes the VB E-step and we have used the trigonometric moment on ω [18]. This completes the computations required to do the VB updates on the Rice posterior. Variational Lower Bound For completeness, and to assess convergence, we derive the VB lower bound L(q). Using the standard formula [4] for L(q) = Eq [log p(y1:n , x1:n , θ)] + H[q] we get: n 2 1 E[log τ /2] − 1 E[τ ]Ri + (E[ντ ] − κi /Ri )E[xi ] − 2 E[ν 2 τ ] + log I0 (κi ) − KL(q p) , 2 (12) i=1 where p in the KL is the prior on (ν, τ ) which is easy to compute as q and p are both normal-gamma. Equivalently, (12) can be optimized directly instead of using the VB-EM updates. 2.2 Online Variational Inference In Section 2.1 we derived an efficient way to compute the variational posterior for a Rice distribution for a fixed data set. However, as is apparent from (1) we need online predictions from the UPM; we must be able to update the posterior one data point at a time. When the UPM is exponential family and we can compute the posterior exactly, we merely use the posterior from the previous step as the prior. However, since we are only computing a variational approximation to the posterior, using the previous posterior as the prior does not give the exact same answer as re-computing the posterior from batch. This gives two obvious options: 1) recompute the posterior from batch every update at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy. 5 The difference between the options is encapsulated by looking at the expected sufficient statistics: n ¯ S = i=1 Eq(xi |y1:n ) [S(xi , yi )]. Naive online updating uses old expected sufficient statistics whose n ¯ posterior effectively uses S = i=1 Eq(xi |y1:i ) [S(xi , yi )]. We get the best of both worlds if we adjust those estimates over time. We in fact can do this if we project the expected sufficient statistics into a “feature space” in terms of the expected natural parameters. For some function f , q(xi ) = p(xi |yi , η = η ) =⇒ Eq(xi |y1:n ) [S(xi , yi )] = f (yi , η ) . ¯ ¯ If f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6] n n ¯ f (yi , η ) = φ(¯) ψ(yi ) =⇒ S = ¯ η φ(¯) ψ(yi ) = φ(¯) η η ψ(yi ) , i=1 i=1 (13) (14) where an infinite dimensional φ and ψ may be required for exact representation, but can be approximated by a finite inner product. In the Rice distribution case we use (11) f (yi , η ) = E[xi ] = Ri I (Ri E[ντ ]) = Ri I ((Ri /µ0 ) µ0 E[ντ ]) , ¯ I (·) := I1 (·)/I0 (·) , (15) 2 Ri where recall that yi = and η1 = E[ντ ]. We can easily represent f with an inner product if we can ¯ represent I as an inner product: I (uv) = φ(u) ψ(v). We use unitless φi (u) = I (ci u) with c1:G as a log-linear grid from 10−2 to 103 and G = 50. We use a lookup table for ψ(v) that was trained to match I using non-negative least squares, which left us with a sparse lookup table. Online updating for VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting factors to forget the contributions from old data points that might be detrimental to accuracy. Since the VB predictions are “embedded” in a change point method, they are automatically phased out if the posterior predictions become inaccurate making the forgetting factors unnecessary. 2.3 Censored Data As mentioned in Section 1.3, we must handle censored RCS observations during a missed detection. In the VB-EM framework we merely have to compute the expected sufficient statistics given the censored measurement: E[S|R < R0 ]. The expected sufficient statistic from (11) is now: R0 E[x|R < R0 ] = 0 ν ν E[x|R]p(R)dR RiceCDF (R0 |ν, τ ) = ν(1 − Q2 ( σ , R0 ))/(1 − Q1 ( σ , R0 )) , σ σ where QM is the Marcum Q function [17] of order M . Similar updates for E[S|R < R0 ] are possible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive. 2.4 Variational Run Length Posteriors: Predictive Log Likelihoods Both updating the BOCPD run length posterior (1) and finding the marginal predictive log likelihood of the next point (2) require calculating the UPM’s posterior predictive log likelihood log p(yn+1 |rn , y(r) ). The marginal posterior predictive from (2) is used in data association (6) and benchmarking BOCPD against other methods. However, the exact posterior predictive distribution obtained by integrating the Rice likelihood against the VB posterior is difficult to compute. We can break the BOCPD update (1) into a time and measurement update. The measurement update corresponds to a Bayesian model comparison (BMC) calculation with prior p(rn |y1:n ): p(rn |y1:n+1 ) ∝ p(yn+1 |rn , y(r) )p(rn |y1:n ) . (16) Using the BMC results in Bishop [4, Sec. 10.1.4] we find a variational posterior on the run length by using the variational lower bound for each run length Li (q) ≤ log p(yn+1 |rn = i, y(r) ), calculated using (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB posterior if the approximating family Q is of the form: q(rn , θ, x) = qUPM (θ, x|rn )q(rn ) =⇒ q(rn = i) = exp(Li (q))p(rn = i|y1:n )/ exp(L(q)) , (17) where qUPM contains whatever constraints we used to compute Li (q). The normalizer on q(rn ) serves as a joint VB lower bound: L(q) = log i exp(Li (q))p(rn = i|y1:n ) ≤ log p(yn+1 |y1:n ). Note that the conditional factorization is different than the typical independence constraint on q. Furthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use a similar conditional constraint on the latent BOCPD variables given the assignment and constrain the assignment posterior to be a point mass. In the 2D assignment case, for example, ˆ q(an , X1:NT ) = q(X1:NT |an )q(an ) = q(X1:NT |an )I{an = an } , (18) 6 2 10 0 10 −1 10 −2 10 10 20 30 40 50 RCS RMSE (dBsm) RCS RMSE (dBsm) 10 KL (nats) 5 10 1 8 6 4 2 3 2 1 0 0 0 100 200 Sample Size (a) Online Updating 4 300 Time (b) Exponential RCS 400 0 100 200 300 400 Time (c) Rice RCS Figure 2: Left: KL from naive updating ( ), Sato’s method [24] ( ), and improved online VB (◦) to the batch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true ν in the generating Rice distribution: ν = 3.16 (red), ν = 10.0 (green), ν = 31.6 (blue) and τ = 1. Middle: The RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn ] is plotted for an exponential RCS model. The curves are BOCPD (blue), IMM (black), identity (magenta), α-filter (green), and median filter (red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% confidence intervals. where each track’s Xi represents all the latent variables used to compute the variational lower bound on log p(zj,n |an (j) = i). In the BOCPD case, Xi := {rn , x, θ}. The resulting VB fixed point ˆ equations find the posterior on the latent variables Xi by taking an as the true assignment and solving ˆ the VB problem of (17); the assignment an is found by using (6) and taking the joint BOCPD lower bound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5). 3 3.1 Results Improved Online Solution We first demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estimation example; here, we only test the VB posterior as no change point detection is applied. Figure 2(a) compares naive online updating, Sato’s method [24], and our improved online updating in KL(online batch) of the posteriors for three different true parameters ν as sample size n increases. The performance curves are the KL divergence between these online approximations to the posterior and the batch VB solution (i.e. restarting VB from “scratch” every new data point) vs sample size. The error for our method stays around a modest 10−2 nats while naive updating incurs large errors of 1 to 50 nats [19, Ch. 4]. Sato’s method tends to settle in around a 1 nat approximation error. The recommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updating. We did a grid search over annealing exponents and show the results for the best performing schedule of n−0.52 . By contrast, our method does not require the tuning of an annealing schedule. 3.2 RCS Estimation Benchmarking We now compare BOCPD with other methods for RCS estimation. We use the same experimental example as Slocumb and Klusman III [25], which uses an augmented interacting multiple model (IMM) based method for estimating the RCS; we also compare against the same α-filter and median filter used in [25]. As a reference point, we also consider the “identity filter” which is merely an unbiased filter that uses only yn to estimate the mean RCS E[Rn ] at time step n. We extend this example to look at Rice RCS in addition to the exponential RCS case. The bias correction constants in the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4]. The results on exponential distributions used in [25] and the Rice distribution case are shown in Figures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples of ±10 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB). In [25] the authors mention that the IMM reaches an RMSE “floor” at 2 dB, yet BOCPD continues to drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other methods upon a change in E[Rn ]. The α-filter and median filter appear worse than both the IMM and BOCPD. The RMSE and confidence intervals are calculated from 5000 runs of the experiment. 7 45 80 40 30 Northing (km) Improvement (%) 35 25 20 15 10 5 60 40 20 0 0 −5 1 2 3 4 −20 5 Difficulty 0 20 40 60 80 100 Easting (km) (a) SIAP Metrics (b) Heathrow (LHR) Figure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red ), velocity accuracy (green ), and spurious tracks (blue ◦) across difficulty levels. Right: LHR: true trajectories shown as black lines (−), estimates using a BOCPD RCS model for association shown as blue stars (∗), and the standard tracker as red circles (◦). The standard tracker has spurious tracks over east London and near Ipswich. Background map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies) 3.3 Flightradar24 Tracking Problem Finally, we used real flight trajectories from flightradar24 and plugged them into our 3D tracking algorithm. We compare tracking performance between using our BOCPD model and the relatively standard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated air picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP metrics are a standard set of metrics used to compare tracking systems. We broke the data into 30 regions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km area centered around the world’s 30 busiest airports [22]. Commercial airport traffic is typically very orderly and does not allow aircraft to fly close to one another or cross paths. Feature-aided tracking is most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets of 10 flight paths and randomly shifted their start time to allow for scenarios of greater interest. The resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by a difficulty metric: the number of times in a scenario any two aircraft come within ∼400 m of each other. The biggest improvements are seen for difficulties above three where positional accuracy increases by 30%. Significant improvements are also seen for velocity accuracy (11%) and the frequency of spurious tracks (6%). Significant performance gains are seen at all difficulty levels considered. The larger improvements at level three over level five are possibly due to some level five scenarios that are not resolvable simply through more sophisticated models. We demonstrate how our RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b). 4 Conclusions We have demonstrated that it is possible to use sophisticated and recent developments in machine learning such as BOCPD, and use the modern inference method of VB, to produce demonstrable improvements in the much more mature field of radar tracking. We first closed a “hole” in the literature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution, with its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible to use these variational UPMs for non-exponential family models in BOCPD without sacrificing its modular or online nature. The improvements in online VB are extendable to UPMs besides a Rice distribution and more generally beyond change point detection. We can use the variational lower bound from the UPM and obtain a principled variational approximation to the run length posterior. Furthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is in large contrast to the tracking literature. More algorithms from the tracking literature can possibly be cast in various machine learning frameworks, such as VB, and improved upon from there. 8 References [1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. [2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms. YBS Publishing. [3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453–464. [4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [5] Braun, J. V., Braun, R., and M¨ ller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with u application to DNA sequence segmentation. Biometrika, 87(2):301–314. [6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029, DTIC. [7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the Americal Statistical Association, 92(438):739–747. [8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience. [9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. [10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve measurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International Conference on, pages 1–8. IEEE. [11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B, 69(4):589–605. [12] Hipp, C. (1974). Sufficient statistics and exponential families. The Annals of Statistics, 2(6):1283–1292. [13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803–808. [14] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45. [15] Lauwers, L., Barb´ , K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice e distribution: A Bayesian approach. In Instrumentation and Measurement Technology Conference, 2009. I2MTC’09. IEEE, pages 114–117. IEEE. [16] Mahler, R. (2003). Multi-target Bayes filtering via first-order multi-target moments. IEEE Trans. AES, 39(4):1152–1178. [17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand Corporation, Santa Monica, CA. [18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York. [19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, London, UK. [20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as multidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in Aerospace Sensing, pages 172–182. International Society for Optics and Photonics. [21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Principles. SciTech Pub. [22] Rogers, S. (2012). The world’s top 100 airports: listed, ranked and mapped. The Guardian. [23] Saatci, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th ¸ International Conference on Machine Learning, pages 927–934, Haifa, Israel. Omnipress. [24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681. [25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for radar-based feature-aided tracking. In Optics & Photonics 2005, pages 59131N–59131N. International Society for Optics and Photonics. [26] Swerling, P. (1954). Probability of detection for fluctuating targets. Technical Report RM-1217, Rand Corporation. [27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis, University of Cambridge, Cambridge, UK. 9

4 0.67940515 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

Author: Sam Patterson, Yee Whye Teh

Abstract: In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied to large scale data. We apply this method to latent Dirichlet allocation in an online minibatch setting, and demonstrate that it achieves substantial performance improvements over the state of the art online variational Bayesian methods. 1

5 0.67681473 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

Author: Chong Wang, Xi Chen, Alex Smola, Eric Xing

Abstract: Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex—the MAP estimation for logistic regression, and the other is non-convex—stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach. 1

6 0.59943545 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

7 0.59363711 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

8 0.58113307 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

9 0.57172948 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

10 0.5362038 301 nips-2013-Sparse Additive Text Models with Low Rank Background

11 0.51023948 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

12 0.48299387 174 nips-2013-Lexical and Hierarchical Topic Regression

13 0.48111466 86 nips-2013-Demixing odors - fast inference in olfaction

14 0.47730663 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

15 0.47054288 98 nips-2013-Documents as multiple overlapping windows into grids of counts

16 0.45791936 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

17 0.43280205 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

18 0.42380837 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

19 0.4076221 88 nips-2013-Designed Measurements for Vector Count Data

20 0.40365151 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.047), (33, 0.104), (34, 0.142), (36, 0.334), (41, 0.027), (49, 0.043), (56, 0.091), (70, 0.022), (85, 0.039), (89, 0.017), (93, 0.044), (95, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76744431 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

2 0.69059134 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

3 0.6881423 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

4 0.67558509 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1

5 0.62471408 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1

6 0.560619 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

7 0.55844253 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

8 0.55452818 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

9 0.55026287 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.54465264 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

11 0.54165703 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

12 0.53655648 233 nips-2013-Online Robust PCA via Stochastic Optimization

13 0.53216386 249 nips-2013-Polar Operators for Structured Sparse Estimation

14 0.53195971 232 nips-2013-Online PCA for Contaminated Data

15 0.5318644 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

16 0.52957755 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

17 0.52577764 25 nips-2013-Adaptive Anonymity via $b$-Matching

18 0.52503067 324 nips-2013-The Fast Convergence of Incremental PCA

19 0.52453893 347 nips-2013-Variational Planning for Graph-based MDPs

20 0.52439106 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction