nips nips2013 nips2013-143 knowledge-graph by maker-knowledge-mining

143 nips-2013-Integrated Non-Factorized Variational Inference

Source: pdf

Author: Shaobo Han, Xuejun Liao, Lawrence Carin

Abstract: We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efﬁcient and possibly parallel computation. Our approach uniﬁes the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the 1 norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-ﬁeld variational Bayes (VB) method. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efﬁcient and possibly parallel computation. [sent-5, score-0.555]

2 Our approach uniﬁes the integrated nested Laplace approximation (INLA) under the variational framework. [sent-6, score-0.327]

3 We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. [sent-8, score-0.055]

4 Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-ﬁeld variational Bayes (VB) method. [sent-9, score-0.29]

5 1 Introduction Markov chain Monte Carlo (MCMC) methods [1] have been dominant tools for posterior analysis in Bayesian inference. [sent-10, score-0.141]

6 Although MCMC can provide numerical representations of the exact posterior, they usually require intensive runs and are therefore time consuming. [sent-11, score-0.058]

7 There have been many efforts dedicated to developing deterministic alternatives, including the Laplace approximation [3], variational methods [4], and expectation propagation (EP) [5]. [sent-13, score-0.269]

8 However, the Gaussian assumption for the latent process prevents INLA from being applied to more general models outside of the family of latent Gaussian models (LGMs). [sent-16, score-0.078]

9 In the machine learning community, variational inference has received signiﬁcant use as an efﬁcient alternative to MCMC. [sent-17, score-0.265]

10 An active area of research has been focused on developing more efﬁcient and accurate variational inference algorithms, for example, collapsed inference [8, 9], non-conjugate models [10, 11], multimodal posteriors [12], and fast convergent methods [13, 14]. [sent-19, score-0.419]

11 The goal of this paper is to develop a reliable and efﬁcient deterministic inference method, to both achieve the accuracy of MCMC and retain its inferential ﬂexibility. [sent-20, score-0.09]

12 We present a promising variational inference method without requiring the widely used factorized approximation to the posterior. [sent-21, score-0.301]

13 Inspired by INLA, we propose a hybrid continuous-discrete variational approximation, which enables us to preserve full posterior dependencies and is therefore more accurate than the mean-ﬁeld variational Bayes (VB) method [15]. [sent-22, score-0.665]

14 The continuous variational approximation is ﬂexible enough for various kinds of latent ﬁelds, which makes our method applicable to more general settings than assumed by INLA. [sent-23, score-0.288]

15 The discretization of the low-dimensional hyper-parameter space can overcome the potential non-conjugacy and multimodal posterior problems in variational inference. [sent-24, score-0.39]

16 1 2 Integrated Non-Factorized Variational Bayesian Inference Consider a general Bayesian hierarchical model with observation y, latent variables x, and hyperparameters θ. [sent-25, score-0.092]

17 The exact joint posterior p(x, θ|y) = p(y, x, θ)/p(y) can be difﬁcult to evaluate, since usually the normalization p(y) = p(y, x, θ)dxdθ is intractable and numerical integration of x is too expensive. [sent-26, score-0.187]

18 To address this problem, we ﬁnd a variational approximation to the exact posterior by minimizing the Kullback-Leibler (KL) divergence KL (q(x, θ|y)||p(x, θ|y)). [sent-27, score-0.467]

19 Applying Jensen’s inequality to the log-marginal data likelihood, one obtains ln p(y) = ln q(x, θ|y) p(y,x,θ) dxdθ ≥ q(x,θ|y) q(x, θ|y) ln p(y,x,θ) dxdθ := L, q(x,θ|y) (1) which holds for any proposed approximating distributions q(x, θ|y). [sent-28, score-0.388]

20 Therefore minimizing the Kullback-Leibler (KL) divergence is equivalent to maximizing the ELBO. [sent-31, score-0.093]

21 To make the variational problem tractable, the variational distribution q(x, θ|y) is commonly required to take a restricted form. [sent-32, score-0.426]

22 1 Hybrid Continuous and Discrete Variational Approximations We consider a non-factorized approximation to the posterior q(x, θ|y) = q(x|y, θ)q(θ|y), to preserve the posterior dependency structure. [sent-35, score-0.304]

23 Unfortunately, this generally leads to a nontrivial optimization problem, q (x, θ|y) = argmin{q(x,θ|y)} KL (q(x, θ|y)||p(x, θ|y)) , = argmin{q(x,θ|y)} q(x,θ|y) q(x, θ|y) ln p(x,θ|y) dxdθ, = argmin{q(x|y,θ), q(θ|y)} q(x|θ,y) q(x|θ, y) ln p(x,θ|y) dx + ln q(θ|y) dθ. [sent-36, score-0.416]

25 Clearly, qd (θ|y) is an approximation of q(θ|y) by discretizing the continuous (typically) low-dimensional parameter space of θ using a grid G with ﬁnite grid points1 . [sent-38, score-0.498]

26 To obtain a useful discretization at a manageable number of grid points, the dimension of θ cannot be too large; this is also the same assumption in INLA [7], but we remove here the Gaussian prior assumption of INLA on latent effects x. [sent-40, score-0.143]

27 The hybrid variational approximation is found by minimizing the KL divergence, i. [sent-41, score-0.294]

28 , KL (q(x, θ|y)||p(x, θ|y)) = q(x|y,θk ) q(x|θk , y) ln p(x,θk |y) dx + ln qd (θk |y) k qd (θk |y) which leads to the approximate marginal posterior, q(x|y) = k q(x|y, θk )qd (θk |y) As will be clearer shortly, the problem in (3) can be much easier to solve than that in (2). [sent-43, score-0.953]

29 (3) (4) We give the name integrated non-factorized variational Bayes (INF-VB) to the method of approximating p(x, θ|y) with q(x|y, θ)qd (θ|y) by solving the optimization problem in (3). [sent-44, score-0.281]

30 The use of qd (θ) is equivalent to numerical integration, which is a key idea of INLA [7], see Section 2. [sent-45, score-0.322]

31 Here we use this idea in variational inference to overcome the potential non-conjugacy and multimodal posterior problems in θ. [sent-48, score-0.418]

32 2 Variational Optimization The proposed INF-VB method consists of two algorithmic steps: 1 The grid points need not to be uniformly spaced, one may put more grid points to potentially high mass regions if credible prior information is available. [sent-50, score-0.183]

33 2 • Step 1: Solving multiple independent optimization problems, each for a grid point in G, to obtain the optimal q(x|y, θk ), ∀θk ∈ G, i. [sent-51, score-0.08]

34 In case it is not available, we may further constrain q(x|y, θk ) to a parametric form, examples including: (i) multivariate Gaussian [17], if the posterior asymptotic normality holds; (ii) skew-normal densities [6, 18]; or (iii) an inducing factorization assumption (see Ch. [sent-54, score-0.143]

35 5 in [19]), if the latent variables x are conditionally independent or their dependencies are negligible. [sent-57, score-0.064]

36 (x|y,θk (6) Note that qd (θ|y) is evaluated at a grid of points θk ∈ G, it needs to be known only up to a multiplicative constant, which can be identiﬁed from the normalization constraint k qd (θk |y) = 1. [sent-59, score-0.701]

37 By contrast, INF-VB with the Gaussian parametric constraint on q (x|y, θk ) provides a global variational Gaussian approximation qV G (x|y, θk ) in the sense that the conditions of the Laplace approximation hold on average [17]. [sent-65, score-0.303]

38 INLA computes the marginal posteriors of θ based on the Laplace’s method of integration [20], qLA (θ|y) = p(x,θ|y) q(x|y,θ) (7) x=x∗ (θ) The quality of this approximation depends on the accuracy of q(x|y, θ). [sent-68, score-0.159]

39 It has been shown in [7] that (7) is accurate enough for latent Gaussian models with qG (x|y, θ). [sent-70, score-0.065]

40 Alternatively, the variational optimal posterior qd (θ|y) by INF-VB (6) can be derived as a lower bound of the true posterior p(θ|y) by Jensen’s inequality. [sent-71, score-0.79]

42 2 also explain the often observed empirical successes of hyperparameter selection based on the ELBO of ln p(y|θ) [13], when the ﬁrst level of Bayesian inference is performed, i. [sent-73, score-0.202]

43 only the conditional posterior q(x|y, θ) with ﬁxed θ is of interest. [sent-75, score-0.125]

44 In contrast to INLA, INF-VB provides q(x|y, θ) and qd (θ|y), both are optimal in a sense of the minimum Kullback-Leibler divergence, within the proposed hybrid distribution family. [sent-83, score-0.347]

45 In this paper we focus on the full posterior inference of Bayesian Lasso [21] where the local Laplace approximation in INLA cannot be applied, as the non-differentiability of the 1 norm prevents one from computing the Hessian matrix. [sent-84, score-0.213]

46 , no data-augmentation), we are actually dealing with a non-conjugate variational inference problem in Bayesian Lasso. [sent-87, score-0.265]

47 Since in our approach variational Gaussian approximation is performed separately (see Section 3. [sent-90, score-0.249]

48 The upper bound of the KL divergence derived in Section 3. [sent-92, score-0.148]

49 2 provides an approximate closed-form solution, that is often accurate enough or requires a small number of gradient iterations to converge to optimality. [sent-93, score-0.061]

50 The tightness of the upper bound is analyzed using spectral-norm bounds (See Section 3. [sent-94, score-0.083]

51 (10) The postulated approximation, q(x|θ, y) = N (x; µ, D), is a multivariate Gaussian density (dropping dependencies of variational parameters (µ, D) on (θ, y) for brevity), whose parameters (µ, D) are found by minimizing the KL divergence to p(x|θ, y), Def. [sent-98, score-0.331]

52 Since g(µ, D) is convex in the parameter space (µ, C), a global optimal variational Gaussian approximation q (x|y, θ) is guaranteed, which achieves the minimum KL divergence to p(x|θ, y) within the family of multivariate Gaussian densities speciﬁed [13]4 . [sent-102, score-0.342]

53 [21] suggested using scaled double-exponential priors under which they showed that p(x, σ 2 |y, λ) is unimodal, further, the unimodality helps to accelerate convergence of the data-augmentation Gibbs sampler and makes the posterior mode more meaningful. [sent-104, score-0.17]

54 4 Code for variational Gaussian approximation is available at mloss. [sent-106, score-0.249]

55 Second, q (θ|y) can be evaluated analytically using either (6) or (7); both will yield a ﬁnite mixture of Gaussian distribution for the marginal posterior q(x|y) via numerical integration, which is highly efﬁcient since we only have two hyperparameters in Bayesian Lasso. [sent-108, score-0.297]

56 Finally, the evidence lower bound (ELBO) in (1) can also be evaluated analytically after simple algebra. [sent-109, score-0.09]

57 3 a comparison with the mean-ﬁeld variational Bayesian (VB) approach, derived based on a scale normal mixture representation [22] of the Laplace prior. [sent-111, score-0.248]

58 2 Upper Bounds of KL divergence ˆ ˆ We provide an approximate solution (µ, D) via minimizing an upper bound of KL divergence (11). [sent-113, score-0.26]

59 (Triangle Inequality) Eq x 1 ≤ Eq x − µ 1 + µ 1 , where Eq x − µ p 2/π j=1 dj , with the expectation taken with respect to q(x; µ, D). [sent-118, score-0.064]

60 For any {dj ≥ 0}p , it holds j=1 p j=1 p d2 ≤ j=1 dj ≤ j √ p 2 Lemma 3. [sent-121, score-0.064]

61 (Upper and Lower bound) For any A, D ∈ Sp , A = ++ √ p 1 √ tr(A) ≤ dj ≤ p tr(A). [sent-126, score-0.064]

62 However, the upper bound f (µ, D) decouples into two additive terms: f1 is a function of µ while f2 is a function of D, which greatly simpliﬁes the minimization. [sent-131, score-0.055]

63 Global optimal solutions ˆ for µ(θk ) on each grid point θk ∈ G can be recovered using the piece-wise linear property. [sent-136, score-0.08]

64 Note that the global ˆ optimum D(θk ) for each grid point θk ∈ G have the same eigenvectors as the Gram matrix Φ Φ and differ only in eigenvalues. [sent-138, score-0.08]

65 ˆ ˆ ˆ ˆ The solutions (µ, D) which minimize the KL upper bound f (µ, D) in (12) achieves its global ˆ ˆ optimum. [sent-146, score-0.055]

66 Meanwhile, it is also accurate in the sense of the KL divergence g(µ, D) in (11), as we will show next. [sent-147, score-0.119]

67 Tightness analysis of the upper bound is also provided, using trace norm bounds. [sent-148, score-0.055]

68 Then ˆ ≤ minµ,D f (µ, D) = f1 (µ) + f2 (D) + ln p(y|θ) ˆ ˆ ˆ g(µ, D) (14) ψ(σ 2 ,λ) −2 2 2 βj α 2 −1 λ ˆ + σ µ 1 , f2 (D) = j ln αj + j 2σj + j 2λ n (αj ) . [sent-156, score-0.238]

69 2 π ˆ ˆ Thus the KL divergence for (µ, D) is upper bounded by the minimum achievable 1 -penalized least ˆ ˆ square error 1 = f1 (µ) and terms in f2 (D) which are ultimately related to the eigenvalues {βj } (j = 1, . [sent-157, score-0.159]

70 ˆ where f1 (µ) = minµ y−Φµ 2σ 2 Let (µ∗ , D∗ ) be the minimizer of the original KL divergence g(µ, D), and g1 (µ|D) collect the terms of g(µ, D) that are related to µ. [sent-161, score-0.111]

71 Then the Bayesian posterior mean obtained via VG, i. [sent-162, score-0.125]

72 This connection indicates that in VG for Bayesian Lasso, the conditions of deterministic Lasso hold on average, with respect to the variational distribution q(x|y, θ), in the parameter space of µ. [sent-166, score-0.233]

73 We also compare INF-VB-1 and INF-VB-2 to VB, a mean-ﬁeld variational Bayes (VB) solution (See Supplementary Material for update equations). [sent-172, score-0.213]

74 1 Synthetic Dataset We compare the proposed INF-VB methods with VB and intensive MCMC runs, in terms of the joint posterior q(λ2 , σ 2 |y) , the marginal posteriors of hyper-parameters q(σ 2 |y) and q(λ2 |y), and the marginal posteriors of regression coefﬁcients q(xj |y) (see Figure 1). [sent-175, score-0.31]

75 Ground truth for latent variables and hyper-parameter are also compared to whenever possible. [sent-191, score-0.078]

76 The hyperparameters for Gamma distributions are set to a = b = r = s = 0. [sent-192, score-0.053]

77 If not mentioned, the grid size is 50 × 50, which is uniformly created around the ordinary least square (OLS) estimates of hyper-parameters. [sent-194, score-0.101]

78 While mean-ﬁeld VB approximates the posterior mode well, the posterior variance can be (sometimes severely) underestimated, see Figure 1(e), (f ). [sent-259, score-0.276]

79 Since we have analytically approximated p(x|y) by a ﬁnite mixture of normal distribution q(x|y, θ) with mixing weights q(θ|y), the posterior marginals for the latent variables: q(xj |y) are easily accessible from this analytical representation. [sent-260, score-0.264]

80 Perhaps surprisingly, both INF-VB and mean-ﬁeld VB provide quite accurate marginal distributions q(xj |y), see Figure 1(j)-(h) for examples. [sent-261, score-0.059]

81 The differences in the tails of q(θ|y) between INF-VB and mean-ﬁeld VB yield negligible differences in the marginal distributions q(xj |y), when θ is integrated out. [sent-262, score-0.086]

82 In Figure 2, we show accurate marginal posteriors of hyperparameters q(σ 2 |y) and q(λ2 |y) as well as marginals of coefﬁcients q(xj |y), j = 1, . [sent-266, score-0.195]

83 1 (l) Figure 2: Posterior marginals of hyperparameters: (a) q(σ 2 |y) and (b)q(λ2 |y); posterior marginals of coefﬁcients: (c)-(l) q(xj |y) (j = 1, . [sent-317, score-0.195]

84 3 Comparison: Accuracy and Speed We quantitatively measure the quality of the approximate joint probability q(x, θ|y) provided by our non-factorized variational methods, and compare them to VB under factorization assumptions. [sent-321, score-0.232]

85 The KL divergence KL(q(x, θ|y)|p(x, θ|y)) is not directly available; instead, we compare the negative evidence lower bound (1), which can be evaluated analytically in our case and differs from the KL divergence only up to a constant. [sent-322, score-0.276]

86 grid size; (a), (b) for the Diabetes dataset (n = 442, p = 10). [sent-330, score-0.08]

87 (c), (d) for the Prostate cancer dataset (n = 97, p = 8) The quality of variational methods depends on the ﬂexibility of variational distributions. [sent-331, score-0.441]

88 See from Figure 3, the accuracy of INF-VB method with a 1×1 grid is worse than mean-ﬁeld VB, which corresponds to the partial Bayesian learning of q(x|y, θ) with a ﬁxed θ. [sent-333, score-0.08]

89 As the grid size increases, the accuracies of INF-VB (even those without gradient steps) also increase and are in general of better quality than mean-ﬁeld VB, in the sense of negative ELBO (KL divergence up to a constant). [sent-334, score-0.189]

90 The computational complexities of INF-VB, mean-ﬁeld VB, and MCMC methods are proportional to the grid size, number of iterations toward local optimum, and the number of runs, respectively. [sent-335, score-0.08]

91 Since the computations on the grid are independent, INF-VB is highly parallelizable, which is an important feature as more multiprocessor computational power becomes available. [sent-336, score-0.08]

92 Besides, one may further reduce its computational load by choosing grid points more economically, which will be pursued in our next step. [sent-337, score-0.096]

93 5 Discussion We have provided a ﬂexible framework for approximate inference of the full posterior p(x, θ|y) based on a hybrid continuous-discrete variational distribution, which is optimal in the sense of the KL divergence. [sent-340, score-0.454]

94 While we have used Bayesian Lasso as an example, our inference method is generically applicable. [sent-342, score-0.052]

95 One can also approximate p(x|y, θ) using other methods, such as scalable variational methods [28], or improved EP [29]. [sent-343, score-0.232]

96 The posterior p(θ|y), which is analyzed based on a grid approximation, enables users to do both model averaging and model selection, depending on speciﬁc purposes. [sent-344, score-0.205]

97 The number of hyperparameters θ should be no more than 5 to 6, which is the same fundamental limitation of INLA. [sent-347, score-0.053]

98 Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. [sent-401, score-0.169]

99 Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. [sent-416, score-0.304]

100 Concave Gaussian variational approximations for inference in large-scale Bayesian linear models. [sent-445, score-0.282]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vb', 0.693), ('qd', 0.302), ('inf', 0.257), ('inla', 0.257), ('variational', 0.213), ('mcmc', 0.158), ('kl', 0.141), ('ols', 0.134), ('posterior', 0.125), ('ln', 0.119), ('laplace', 0.113), ('lasso', 0.104), ('divergence', 0.093), ('bayesian', 0.082), ('grid', 0.08), ('elbo', 0.076), ('dj', 0.064), ('argmin', 0.061), ('tr', 0.061), ('dx', 0.059), ('dxd', 0.057), ('hyperparameters', 0.053), ('integrated', 0.053), ('inference', 0.052), ('posteriors', 0.048), ('hybrid', 0.045), ('qla', 0.043), ('integration', 0.042), ('diabetes', 0.04), ('eq', 0.039), ('truth', 0.039), ('latent', 0.039), ('elapsed', 0.038), ('qg', 0.038), ('eld', 0.037), ('approximation', 0.036), ('ground', 0.035), ('prostate', 0.035), ('marginals', 0.035), ('marginal', 0.033), ('gaussian', 0.032), ('xj', 0.032), ('durham', 0.031), ('hyperparameter', 0.031), ('analytically', 0.03), ('upper', 0.03), ('cct', 0.029), ('challis', 0.029), ('lgms', 0.029), ('bayes', 0.028), ('tightness', 0.028), ('multimodal', 0.028), ('mode', 0.026), ('accurate', 0.026), ('dependencies', 0.025), ('bound', 0.025), ('duke', 0.025), ('djj', 0.025), ('secaucus', 0.025), ('nested', 0.025), ('discretization', 0.024), ('credible', 0.023), ('springerverlag', 0.023), ('intensive', 0.023), ('hj', 0.023), ('jensen', 0.023), ('nc', 0.021), ('square', 0.021), ('deterministic', 0.02), ('vg', 0.02), ('numerical', 0.02), ('mixture', 0.019), ('approximate', 0.019), ('sampler', 0.019), ('evidence', 0.018), ('minimizer', 0.018), ('reliable', 0.018), ('johnstone', 0.018), ('carlo', 0.018), ('gamma', 0.018), ('preserve', 0.018), ('monte', 0.018), ('parametric', 0.018), ('gibbs', 0.017), ('evaluated', 0.017), ('approximations', 0.017), ('chain', 0.016), ('cients', 0.016), ('normal', 0.016), ('gradient', 0.016), ('sp', 0.016), ('texts', 0.016), ('load', 0.016), ('obtains', 0.016), ('gram', 0.015), ('cancer', 0.015), ('runs', 0.015), ('eigenvalues', 0.015), ('parallel', 0.015), ('approximating', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 143 nips-2013-Integrated Non-Factorized Variational Inference

Author: Shaobo Han, Xuejun Liao, Lawrence Carin

2 0.26226062 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi

Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difﬁculty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difﬁculty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efﬁcient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We ﬁrst prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efﬁciency, we also propose an efﬁcient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1

3 0.24921578 60 nips-2013-Buy-in-Bulk Active Learning

Author: Liu Yang, Jaime Carbonell

Abstract: In many practical applications of active learning, it is more cost-effective to request labels in large batches, rather than one-at-a-time. This is because the cost of labeling a large batch of examples at once is often sublinear in the number of examples in the batch. In this work, we study the label complexity of active learning algorithms that request labels in a given number of batches, as well as the tradeoff between the total number of queries and the number of rounds allowed. We additionally study the total cost sufﬁcient for learning, for an abstract notion of the cost of requesting the labels of a given number of examples at once. In particular, we ﬁnd that for sublinear cost functions, it is often desirable to request labels in large batches (i.e., buying in bulk); although this may increase the total number of labels requested, it reduces the total cost required for learning. 1

4 0.24215403 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-speciﬁed approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by ﬁtting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

5 0.2364879 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek

Abstract: The Bayesian online change point detection (BOCPD) algorithm provides an efﬁcient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model’s posterior predictives, which can only be computed online in O(1) time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efﬁcient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution. Change point detection has been applied to many applications [5; 7]. In recent years there have been great improvements to the Bayesian approaches via the Bayesian online change point detection algorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its use of feature-aided tracking [10]: methods that use auxiliary information from radar returns such as signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems would often ﬁlter only noisy position (and perhaps Doppler) measurements while newer systems use more information to improve performance. We use BOCPD for modeling the RCS feature. Whereas BOCPD inference could be done exactly when ﬁnding change points in conjugate exponential family models the physics of RCS measurements often causes them to be distributed in non-exponential family ways, often following a Rice distribution. To do inference efﬁciently we call upon variational Bayes (VB) to ﬁnd approximate posterior (predictive) distributions. Furthermore, the nature of both BOCPD and tracking require the use of online updating. We improve upon the existing and limited approaches to online VB [24; 13]. This paper produces contributions to, and builds upon background from, three independent areas: change point detection, variational Bayes, and radar tracking. Although the emphasis in machine learning is on ﬁltering, a substantial part of tracking with radar data involves data association, illustrated in Figure 1. Observations of radar returns contain measurements from multiple objects (targets) in the sky. If we knew which radar return corresponded to which target we would be presented with NT ∈ N0 independent ﬁltering problems; Kalman ﬁlters [14] (or their nonlinear extensions) are applied to “average out” the kinematic errors in the measurements (typically positions) using the measurements associated with each target. The data association problem is to determine which measurement goes to which track. In the classical setup, once a particular measurement is associated with a certain target, that measurement is plugged into the ﬁlter for that target as if we knew with certainty it was the correct assignment. The association algorithms, in effect, ﬁnd the maximum a posteriori (MAP) estimate on the measurement-to-track association. However, approaches such as the joint probabilistic data association (JPDA) ﬁlter [2] and the probability hypothesis density (PHD) ﬁlter [16] have deviated from this. 1 To ﬁnd the MAP estimate a log likelihood of the data under each possible assignment vector a must be computed. These are then used to construct cost matrices that reduce the assignment problem to a particular kind of optimization problem (the details of which are beyond the scope of this paper). The motivation behind feature-aided tracking is that additional features increase the probability that the MAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature (SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or gamma distributed [26]. The parameters of the RCS distribution are determined by factors such as the shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS characteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS features may help distinguish one aircraft from another if they cross paths or come near one another, for example. RCS also helps distinguish genuine aircraft returns from clutter: a ﬂock of birds or random electrical noise, for example. However, the parameters of the RCS distributions may also change for the same aircraft due to a change in angle or ground conditions. These must be taken into account for accurate association. Providing good predictions in light of a possible sudden change in the parameters of a time series is “right up the alley” of BOCPD and change point methods. The original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family models for time series. In this paper, we expand the set of applications of BOCPD to radar SNR data which often has the same change point structure found in other applications, and requires online predictions. The BOCPD model is highly modular in that it looks for changes in the parameters of any underlying process model (UPM). The UPM merely needs to provide posterior predictive probabilities, the UPM can otherwise be a “black box.” The BOCPD queries the UPM for a prediction of the next data point under each possible run length, the number of points since the last change point. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the posterior is computed by accumulating the sufﬁcient statistics since the last potential change point. This allows for O(1) UPM updates in both computation and memory as the run length increases. We motivate the use of VB for implementing UPMs when the data within a regime is believed to follow a distribution that is not exponential family. The methods presented in this paper can be used to ﬁnd variational run length posteriors for general non-exponential family UPMs in addition to the Rice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are applicable in areas outside of change point detection. Likelihood clutter (birds) track 1 (747) track 2 (EMB 110) 0 5 10 15 20 SNR Figure 1: Illustrative example of a tracking scenario: The black lines (−) show the true tracks while the red stars (∗) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions on the states are shown as blue ellipses. The current (+) and previous (×) measurements are connected to their associated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (·). The distributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right. To our knowledge this paper is the ﬁrst to demonstrate how to compute Bayesian posterior distributions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15], which computes a MAP estimate. Other novel factors of this paper include: demonstrating the usefulness (and advantages over existing techniques) of change point detection for RCS estimation and tracking; and applying variational inference for UPMs where analytic posterior predictives are not possible. This paper provides four main technical contributions: 1) VB inference for inferring the parameters of a Rice distribution. 2) General improvements to online VB (which is then applied to updating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the UPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice distribution) in VB. This is key for processing missed detections in data association. 2 1 Background In this section we brieﬂy review the three areas of background: BOCPD, VB, and tracking. 1.1 Bayesian Online Change Point Detection We brieﬂy summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a detailed description. We assume we have a time series with n observations so far y1 , . . . , yn ∈ Y. In effect, BOCPD performs message passing to do online inference on the run length rn ∈ 0:n − 1, the number of observations since the last change point. Given an underlying predictive model (UPM) and a hazard function h, we can compute an exact posterior over the run length rn . Conditional on a run length, the UPM produces a sequential prediction on the next data point using all the data since the last change point: p(yn |y(r) , Θm ) where (r) := (n − r):(n − 1). The UPM is a simpler model where the parameters θ change at every change point and are modeled as being sampled from a prior with hyper-parameters Θm . The canonical example of a UPM would be a Gaussian whose mean and variance change at every change point. The online updates are summarized as: P (rn |rn−1 ) p(yn |rn−1 , y(r) ) p(rn−1 , y1:n−1 ) . msgn := p(rn , y1:n ) = rn−1 hazard UPM (1) msgn−1 Unless rn = 0, the sum in (1) only contains one term since the only possibility is that rn−1 = rn −1. The indexing convention is such that if rn = 0 then yn+1 is the ﬁrst observation sampled from the new parameters θ. The marginal posterior predictive on the next data point is easily calculated as: p(yn+1 |y1:n ) = p(yn+1 |y(r) )P (rn |y1:n ) . (2) rn Thus, the predictions from BOCPD fully integrate out any uncertainty in θ. The message updates (1) perform exact inference under a model where the number of change points is not known a priori. BOCPD RCS Model We show the Rice UPM as an example as it is required for our application. The data within a regime are assumed to be iid Rice observations, with a normal-gamma prior: yn ∼ Rice(ν, σ) , ν ∼ N (µ0 , σ 2 /λ0 ) , σ −2 =: τ ∼ Gamma(α0 , β0 ) (3) 2 =⇒ p(yn |ν, σ) = yn τ exp(−τ (yn + ν 2 )/2)I0 (yn ντ )I{yn ≥ 0} (4) where I0 (·) is a modiﬁed Bessel function of order zero, which is what excludes the Rice distribution from the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable us to use the VB-EM algorithm. The UPM parameters are the Rice shape1 ν ∈ R and scale σ ∈ R+ , θ := {ν, σ}, and the hyper-parameters are the normal-gamma parameters Θm := {µ0 , λ0 , α0 , β0 }. Every change point results in a new value for ν and σ being sampled. A posterior on θ is maintained for each run length, i.e. every possible starting point for the current regime, and is updated at each new data point. Therefore, BOCPD maintains n distinct posteriors on θ, and although this can be reduced with pruning, it necessitates posterior updates on θ that are computationally efﬁcient. Note that the run length updates in (1) require the UPM to provide predictive log likelihoods at all sample sizes rn (including zero). Therefore, UPM implementations using such approximations as plug-in MLE predictions will not work very well. The MLE may not even be deﬁned for run lengths smaller than the number of UPM parameters |θ|. For a Rice UPM, the efﬁcient O(1) updating in exponential family models by using a conjugate prior and accumulating sufﬁcient statistics is not possible. This motivates the use of VB methods for approximating the UPM predictions. 1.2 Variational Bayes We follow the framework of VB where when computation of the exact posterior distribution p(θ|y1:n ) is intractable it is often possible to create a variational approximation q(θ) that is locally optimal in terms of the Kullback-Leibler (KL) divergence KL(q p) while constraining q to be in a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on the evidence log p(y1:n ), using either gradient based methods or standard ﬁxed point equations. 1 The shape ν is usually assumed to be positive (∈ R+ ); however, there is nothing wrong with using a negative ν as Rice(x|ν, σ) = Rice(x|−ν, σ). It also allows for use of a normal-gamma prior. 3 The VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB ﬁxed point equations can be simpliﬁed by applying the VB-EM algorithm [3]. VB-EM is applicable to models that are conjugate-exponential (CE) after being augmented with latent variables x1:n . A model is CE if: 1) The complete data likelihood p(x1:n , y1:n |θ) is an exponential family distribution; and 2) the prior p(θ) is a conjugate prior for the complete data likelihood p(x1:n , y1:n |θ). We only have to constrain the posterior q(θ, x1:n ) = q(θ)q(x1:n ) to factorize between the latent variables and the parameters; we do not constrain the posterior to be of any particular parametric form. Requiring the complete likelihood to be CE is a much weaker condition than requiring the marginal on the observed data p(y1:n |θ) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented with latent variables (class labels). This is also the case for the Rice distribution (Section 2.1). Like the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the posterior of the latent variables treating the expected natural parameters η := Eq(θ) [η] as correct: ¯ q(xi ) ← p(xi |yi , η = η ). 2) Find the posterior of the parameters using the expected sufﬁcient statis¯ ¯ tics S := Eq(x1:n ) [S(x1:n , y1:n )] as if they were the sufﬁcient statistics for the complete data set: ¯ q(θ) ← p(θ|S(x1:n , y1:n ) = S). The posterior will be of the same exponential family as the prior. 1.3 Tracking In this section we review data association, which along with ﬁltering constitutes tracking. In data association we estimate the association vectors a which map measurements to tracks. At each time NZ (n) step, n ∈ N1 , we observe NZ (n) ∈ N0 measurements, Zn = {zi,n }i=1 , which includes returns from both real targets and clutter (spurious measurements). Here, zi,n ∈ Z is a vector of kinematic measurements (positions in R3 , or R4 with a Doppler), augmented with an RCS component R ∈ R+ for the measured SNR, at time tn ∈ R. The assignment vector at time tn is such that an (i) = j if measurement i is associated with track j > 0; an (i) = 0 if measurement i is clutter. The inverse mapping a−1 maps tracks to measurements: meaning a−1 (an (i)) = i if an (i) = 0; and n n a−1 (i) = 0 ⇔ an (j) = i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5, n Nc = 2, and a−1 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa. In N D data association we jointly ﬁnd the MAP estimate of the association vectors over a sliding window of the last N − 1 time steps. We assume we have NT (n) ∈ N0 total tracks as a known parameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative process each track places a probability distribution on the next N − 1 measurements, with both kinematic and RCS components. However, if the random RCS R for a measurement is below R0 then it will not be observed. There are Nc (n) ∈ N0 clutter measurements from a Poisson process with λ := E[Nc (n)] (often with uniform intensity). The ordering of measurements in Zn is assumed to be uniformly random. For 3D data association the model joint p(Zn−1:n , an−1 , an |Z1:n−2 ) is: NT |Zi | n pi (za−1 (i),n , za−1 n n−1 i=1 (i),n−1 ) × λNc (i) exp(−λ)/|Zi |! i=n−1 p0 (zj,i )I{ai (j)=0} , (5) j=1 where pi is the probability of the measurement sequence under track i; p0 is the clutter distribution. The probability pi is the product of the RCS component predictions (BOCPD) and the kinematic components (ﬁlter); informally, pi (z) = pi (positions) × pi (RCS). If there is a missed detection, i.e. a−1 (i) = 0, we then use pi (za−1 (i),n ) = P (R < R0 ) under the RCS model for track i with no conn n tribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic predictor to be used as a UPM, any black box model of measurement sequences can used in (5). The estimation of association vectors for the 3D case becomes an optimization problem of the form: ˆ (ˆn−1 , an ) = argmax log P (an−1 , an |Z1:n ) = argmax log p(Zn−1:n , an−1 , an |Z1:n−2 ) , (6) a (an−1 ,an ) (an−1 ,an ) which is effectively optimizing (5) with respect to the assignment vectors. The optimization given in (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved efﬁciently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approximate, yet typically very accurate, solvers must be used for real-time operation, which is usually required for tracking systems [20]. If a radar scan occurs at each time step and a target is not detected, we assume the SNR has not exceeded the threshold, implying 0 ≤ R < R0 . This is a (left) censored measurement and is treated differently than a missing data point. Censoring is accounted for in Section 2.3. 4 2 Online Variational UPMs We cover the four technical challenges for implementing non-exponential family UPMs in an efﬁcient and online manner. We drop the index of the data point i when it is clear from context. 2.1 Variational Posterior for a Rice Distribution The Rice distribution has the property that x ∼ N (ν, σ 2 ) , y ∼ N (0, σ 2 ) =⇒ R = x2 + y 2 ∼ Rice(ν, σ) . (7) For simplicity we perform inference using R2 , as opposed to R, and transform accordingly: x ∼ N (ν, σ 2 ) , 1 R2 − x2 ∼ Gamma( 2 , τ ) , 2 τ := 1/σ 2 ∈ R+ =⇒ p(R2 , x) = p(R2 |x)p(x) = Gamma(R2 − x2 | 1 , τ )N (x|ν, σ 2 ) . 2 2 (8) The complete likelihood (8) is the product of two exponential family models and is exponential family itself, parameterized with base measure h and partition factor g: η = [ντ, −τ /2] , S = [x, R2 ] , h(R2 , x) = (2π R2 − x2 )−1 , g(ν, τ ) = τ exp(−ν 2 τ /2) . By inspection we see that the natural parameters η and sufﬁcient statistics S are the same as a Gaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (ν, τ ) as it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM 2 algorithm. We use yi := Ri as the VB observation, not Ri as in (3). In (5), z·,· (end) is the RCS R. VB M-Step We derive the posterior updates to the parameters given expected sufﬁcient statistics: n λ0 µ0 + i E[xi ] , λn = λ0 + n , αn = α0 + n , λ0 + n i=1 n n 1 1 nλ0 1 βn = β0 + (E[xi ] − x)2 + ¯ (¯ − µ0 )2 + x R2 − E[xi ]2 . 2 i=1 2 λ0 + n 2 i=1 i x := ¯ E[xi ]/n , µn = (9) (10) This is the same as an observation from a Gaussian and a gamma that share a (inverse) scale τ . 2 2 ¯ VB E-Step We then must ﬁnd both expected sufﬁcient statistics S. The expectation E[Ri |Ri ] = 2 2 Ri trivially; leaving E[xi |Ri ]. Recall that the joint on (x, y ) is a bivariate normal; if we constrain the radius to R, the angle ω will be distributed by a von Mises (VM) distribution. Therefore, ω := arccos(x/R) ∼ VM(0, κ) , κ = R E[ντ ] =⇒ E[x] = R E[cos ω] = RI1 (κ)/I0 (κ) , (11) where computing κ constitutes the VB E-step and we have used the trigonometric moment on ω [18]. This completes the computations required to do the VB updates on the Rice posterior. Variational Lower Bound For completeness, and to assess convergence, we derive the VB lower bound L(q). Using the standard formula [4] for L(q) = Eq [log p(y1:n , x1:n , θ)] + H[q] we get: n 2 1 E[log τ /2] − 1 E[τ ]Ri + (E[ντ ] − κi /Ri )E[xi ] − 2 E[ν 2 τ ] + log I0 (κi ) − KL(q p) , 2 (12) i=1 where p in the KL is the prior on (ν, τ ) which is easy to compute as q and p are both normal-gamma. Equivalently, (12) can be optimized directly instead of using the VB-EM updates. 2.2 Online Variational Inference In Section 2.1 we derived an efﬁcient way to compute the variational posterior for a Rice distribution for a ﬁxed data set. However, as is apparent from (1) we need online predictions from the UPM; we must be able to update the posterior one data point at a time. When the UPM is exponential family and we can compute the posterior exactly, we merely use the posterior from the previous step as the prior. However, since we are only computing a variational approximation to the posterior, using the previous posterior as the prior does not give the exact same answer as re-computing the posterior from batch. This gives two obvious options: 1) recompute the posterior from batch every update at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy. 5 The difference between the options is encapsulated by looking at the expected sufﬁcient statistics: n ¯ S = i=1 Eq(xi |y1:n ) [S(xi , yi )]. Naive online updating uses old expected sufﬁcient statistics whose n ¯ posterior effectively uses S = i=1 Eq(xi |y1:i ) [S(xi , yi )]. We get the best of both worlds if we adjust those estimates over time. We in fact can do this if we project the expected sufﬁcient statistics into a “feature space” in terms of the expected natural parameters. For some function f , q(xi ) = p(xi |yi , η = η ) =⇒ Eq(xi |y1:n ) [S(xi , yi )] = f (yi , η ) . ¯ ¯ If f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6] n n ¯ f (yi , η ) = φ(¯) ψ(yi ) =⇒ S = ¯ η φ(¯) ψ(yi ) = φ(¯) η η ψ(yi ) , i=1 i=1 (13) (14) where an inﬁnite dimensional φ and ψ may be required for exact representation, but can be approximated by a ﬁnite inner product. In the Rice distribution case we use (11) f (yi , η ) = E[xi ] = Ri I (Ri E[ντ ]) = Ri I ((Ri /µ0 ) µ0 E[ντ ]) , ¯ I (·) := I1 (·)/I0 (·) , (15) 2 Ri where recall that yi = and η1 = E[ντ ]. We can easily represent f with an inner product if we can ¯ represent I as an inner product: I (uv) = φ(u) ψ(v). We use unitless φi (u) = I (ci u) with c1:G as a log-linear grid from 10−2 to 103 and G = 50. We use a lookup table for ψ(v) that was trained to match I using non-negative least squares, which left us with a sparse lookup table. Online updating for VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting factors to forget the contributions from old data points that might be detrimental to accuracy. Since the VB predictions are “embedded” in a change point method, they are automatically phased out if the posterior predictions become inaccurate making the forgetting factors unnecessary. 2.3 Censored Data As mentioned in Section 1.3, we must handle censored RCS observations during a missed detection. In the VB-EM framework we merely have to compute the expected sufﬁcient statistics given the censored measurement: E[S|R < R0 ]. The expected sufﬁcient statistic from (11) is now: R0 E[x|R < R0 ] = 0 ν ν E[x|R]p(R)dR RiceCDF (R0 |ν, τ ) = ν(1 − Q2 ( σ , R0 ))/(1 − Q1 ( σ , R0 )) , σ σ where QM is the Marcum Q function [17] of order M . Similar updates for E[S|R < R0 ] are possible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive. 2.4 Variational Run Length Posteriors: Predictive Log Likelihoods Both updating the BOCPD run length posterior (1) and ﬁnding the marginal predictive log likelihood of the next point (2) require calculating the UPM’s posterior predictive log likelihood log p(yn+1 |rn , y(r) ). The marginal posterior predictive from (2) is used in data association (6) and benchmarking BOCPD against other methods. However, the exact posterior predictive distribution obtained by integrating the Rice likelihood against the VB posterior is difﬁcult to compute. We can break the BOCPD update (1) into a time and measurement update. The measurement update corresponds to a Bayesian model comparison (BMC) calculation with prior p(rn |y1:n ): p(rn |y1:n+1 ) ∝ p(yn+1 |rn , y(r) )p(rn |y1:n ) . (16) Using the BMC results in Bishop [4, Sec. 10.1.4] we ﬁnd a variational posterior on the run length by using the variational lower bound for each run length Li (q) ≤ log p(yn+1 |rn = i, y(r) ), calculated using (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB posterior if the approximating family Q is of the form: q(rn , θ, x) = qUPM (θ, x|rn )q(rn ) =⇒ q(rn = i) = exp(Li (q))p(rn = i|y1:n )/ exp(L(q)) , (17) where qUPM contains whatever constraints we used to compute Li (q). The normalizer on q(rn ) serves as a joint VB lower bound: L(q) = log i exp(Li (q))p(rn = i|y1:n ) ≤ log p(yn+1 |y1:n ). Note that the conditional factorization is different than the typical independence constraint on q. Furthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use a similar conditional constraint on the latent BOCPD variables given the assignment and constrain the assignment posterior to be a point mass. In the 2D assignment case, for example, ˆ q(an , X1:NT ) = q(X1:NT |an )q(an ) = q(X1:NT |an )I{an = an } , (18) 6 2 10 0 10 −1 10 −2 10 10 20 30 40 50 RCS RMSE (dBsm) RCS RMSE (dBsm) 10 KL (nats) 5 10 1 8 6 4 2 3 2 1 0 0 0 100 200 Sample Size (a) Online Updating 4 300 Time (b) Exponential RCS 400 0 100 200 300 400 Time (c) Rice RCS Figure 2: Left: KL from naive updating ( ), Sato’s method [24] ( ), and improved online VB (◦) to the batch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true ν in the generating Rice distribution: ν = 3.16 (red), ν = 10.0 (green), ν = 31.6 (blue) and τ = 1. Middle: The RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn ] is plotted for an exponential RCS model. The curves are BOCPD (blue), IMM (black), identity (magenta), α-ﬁlter (green), and median ﬁlter (red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% conﬁdence intervals. where each track’s Xi represents all the latent variables used to compute the variational lower bound on log p(zj,n |an (j) = i). In the BOCPD case, Xi := {rn , x, θ}. The resulting VB ﬁxed point ˆ equations ﬁnd the posterior on the latent variables Xi by taking an as the true assignment and solving ˆ the VB problem of (17); the assignment an is found by using (6) and taking the joint BOCPD lower bound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5). 3 3.1 Results Improved Online Solution We ﬁrst demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estimation example; here, we only test the VB posterior as no change point detection is applied. Figure 2(a) compares naive online updating, Sato’s method [24], and our improved online updating in KL(online batch) of the posteriors for three different true parameters ν as sample size n increases. The performance curves are the KL divergence between these online approximations to the posterior and the batch VB solution (i.e. restarting VB from “scratch” every new data point) vs sample size. The error for our method stays around a modest 10−2 nats while naive updating incurs large errors of 1 to 50 nats [19, Ch. 4]. Sato’s method tends to settle in around a 1 nat approximation error. The recommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updating. We did a grid search over annealing exponents and show the results for the best performing schedule of n−0.52 . By contrast, our method does not require the tuning of an annealing schedule. 3.2 RCS Estimation Benchmarking We now compare BOCPD with other methods for RCS estimation. We use the same experimental example as Slocumb and Klusman III [25], which uses an augmented interacting multiple model (IMM) based method for estimating the RCS; we also compare against the same α-ﬁlter and median ﬁlter used in [25]. As a reference point, we also consider the “identity ﬁlter” which is merely an unbiased ﬁlter that uses only yn to estimate the mean RCS E[Rn ] at time step n. We extend this example to look at Rice RCS in addition to the exponential RCS case. The bias correction constants in the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4]. The results on exponential distributions used in [25] and the Rice distribution case are shown in Figures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples of ±10 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB). In [25] the authors mention that the IMM reaches an RMSE “ﬂoor” at 2 dB, yet BOCPD continues to drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other methods upon a change in E[Rn ]. The α-ﬁlter and median ﬁlter appear worse than both the IMM and BOCPD. The RMSE and conﬁdence intervals are calculated from 5000 runs of the experiment. 7 45 80 40 30 Northing (km) Improvement (%) 35 25 20 15 10 5 60 40 20 0 0 −5 1 2 3 4 −20 5 Difficulty 0 20 40 60 80 100 Easting (km) (a) SIAP Metrics (b) Heathrow (LHR) Figure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red ), velocity accuracy (green ), and spurious tracks (blue ◦) across difﬁculty levels. Right: LHR: true trajectories shown as black lines (−), estimates using a BOCPD RCS model for association shown as blue stars (∗), and the standard tracker as red circles (◦). The standard tracker has spurious tracks over east London and near Ipswich. Background map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies) 3.3 Flightradar24 Tracking Problem Finally, we used real ﬂight trajectories from ﬂightradar24 and plugged them into our 3D tracking algorithm. We compare tracking performance between using our BOCPD model and the relatively standard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated air picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP metrics are a standard set of metrics used to compare tracking systems. We broke the data into 30 regions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km area centered around the world’s 30 busiest airports [22]. Commercial airport trafﬁc is typically very orderly and does not allow aircraft to ﬂy close to one another or cross paths. Feature-aided tracking is most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets of 10 ﬂight paths and randomly shifted their start time to allow for scenarios of greater interest. The resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by a difﬁculty metric: the number of times in a scenario any two aircraft come within ∼400 m of each other. The biggest improvements are seen for difﬁculties above three where positional accuracy increases by 30%. Signiﬁcant improvements are also seen for velocity accuracy (11%) and the frequency of spurious tracks (6%). Signiﬁcant performance gains are seen at all difﬁculty levels considered. The larger improvements at level three over level ﬁve are possibly due to some level ﬁve scenarios that are not resolvable simply through more sophisticated models. We demonstrate how our RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b). 4 Conclusions We have demonstrated that it is possible to use sophisticated and recent developments in machine learning such as BOCPD, and use the modern inference method of VB, to produce demonstrable improvements in the much more mature ﬁeld of radar tracking. We ﬁrst closed a “hole” in the literature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution, with its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible to use these variational UPMs for non-exponential family models in BOCPD without sacriﬁcing its modular or online nature. The improvements in online VB are extendable to UPMs besides a Rice distribution and more generally beyond change point detection. We can use the variational lower bound from the UPM and obtain a principled variational approximation to the run length posterior. Furthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is in large contrast to the tracking literature. More algorithms from the tracking literature can possibly be cast in various machine learning frameworks, such as VB, and improved upon from there. 8 References [1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. [2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms. YBS Publishing. [3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453–464. [4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [5] Braun, J. V., Braun, R., and M¨ ller, H.-G. (2000). Multiple changepoint ﬁtting via quasilikelihood, with u application to DNA sequence segmentation. Biometrika, 87(2):301–314. [6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029, DTIC. [7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the Americal Statistical Association, 92(438):739–747. [8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience. [9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. [10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve measurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International Conference on, pages 1–8. IEEE. [11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B, 69(4):589–605. [12] Hipp, C. (1974). Sufﬁcient statistics and exponential families. The Annals of Statistics, 2(6):1283–1292. [13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803–808. [14] Kalman, R. E. (1960). A new approach to linear ﬁltering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45. [15] Lauwers, L., Barb´ , K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice e distribution: A Bayesian approach. In Instrumentation and Measurement Technology Conference, 2009. I2MTC’09. IEEE, pages 114–117. IEEE. [16] Mahler, R. (2003). Multi-target Bayes ﬁltering via ﬁrst-order multi-target moments. IEEE Trans. AES, 39(4):1152–1178. [17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand Corporation, Santa Monica, CA. [18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York. [19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, London, UK. [20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as multidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in Aerospace Sensing, pages 172–182. International Society for Optics and Photonics. [21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Principles. SciTech Pub. [22] Rogers, S. (2012). The world’s top 100 airports: listed, ranked and mapped. The Guardian. [23] Saatci, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th ¸ International Conference on Machine Learning, pages 927–934, Haifa, Israel. Omnipress. [24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681. [25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for radar-based feature-aided tracking. In Optics & Photonics 2005, pages 59131N–59131N. International Society for Optics and Photonics. [26] Swerling, P. (1954). Probability of detection for ﬂuctuating targets. Technical Report RM-1217, Rand Corporation. [27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis, University of Cambridge, Cambridge, UK. 9

6 0.12513146 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

7 0.1200863 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

8 0.10800646 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

9 0.097847797 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

10 0.091531008 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

11 0.084010735 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

12 0.078405343 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

13 0.071153797 161 nips-2013-Learning Stochastic Inverses

14 0.067981206 109 nips-2013-Estimating LASSO Risk and Noise Level

15 0.06680233 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

16 0.065284073 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

17 0.064770333 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

18 0.063937329 196 nips-2013-Modeling Overlapping Communities with Node Popularities

19 0.063701898 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

20 0.062490076 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.157), (1, 0.055), (2, 0.001), (3, 0.008), (4, 0.009), (5, 0.178), (6, 0.152), (7, 0.043), (8, 0.098), (9, -0.02), (10, 0.01), (11, 0.134), (12, -0.087), (13, 0.01), (14, -0.062), (15, -0.219), (16, -0.038), (17, -0.058), (18, 0.002), (19, -0.191), (20, -0.065), (21, -0.063), (22, -0.041), (23, 0.116), (24, -0.229), (25, 0.011), (26, -0.002), (27, 0.039), (28, 0.178), (29, 0.004), (30, 0.1), (31, -0.209), (32, -0.12), (33, 0.107), (34, -0.061), (35, 0.152), (36, -0.089), (37, 0.008), (38, -0.07), (39, 0.008), (40, -0.02), (41, 0.24), (42, 0.06), (43, 0.057), (44, -0.095), (45, 0.016), (46, -0.079), (47, -0.018), (48, -0.065), (49, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9640882 143 nips-2013-Integrated Non-Factorized Variational Inference

Author: Shaobo Han, Xuejun Liao, Lawrence Carin

2 0.84046137 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek

3 0.78321159 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi

4 0.71126246 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

5 0.52199543 60 nips-2013-Buy-in-Bulk Active Learning

Author: Liu Yang, Jaime Carbonell

6 0.50997341 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

7 0.39251575 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

8 0.38665891 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

9 0.38248277 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

10 0.37922674 86 nips-2013-Demixing odors - fast inference in olfaction

11 0.36681482 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

12 0.36231065 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

13 0.3597222 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

14 0.35051134 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

15 0.34421906 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

16 0.31805542 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

17 0.31384644 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

18 0.31347176 161 nips-2013-Learning Stochastic Inverses

19 0.30836213 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

20 0.29756796 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.038), (33, 0.14), (34, 0.491), (41, 0.028), (49, 0.023), (56, 0.07), (70, 0.015), (85, 0.024), (89, 0.016), (93, 0.028), (95, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98935795 256 nips-2013-Probabilistic Principal Geodesic Analysis

Author: Miaomiao Zhang, P.T. Fletcher

Abstract: Principal geodesic analysis (PGA) is a generalization of principal component analysis (PCA) for dimensionality reduction of data on a Riemannian manifold. Currently PGA is deﬁned as a geometric ﬁt to the data, rather than as a probabilistic model. Inspired by probabilistic PCA, we present a latent variable model for PGA that provides a probabilistic framework for factor analysis on manifolds. To compute maximum likelihood estimates of the parameters in our model, we develop a Monte Carlo Expectation Maximization algorithm, where the expectation is approximated by Hamiltonian Monte Carlo sampling of the latent variables. We demonstrate the ability of our method to recover the ground truth parameters in simulated sphere data, as well as its effectiveness in analyzing shape variability of a corpus callosum data set from human brain images. 1

2 0.98585135 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

Author: Zhenwen Dai, Georgios Exarchakis, Jörg Lücke

Abstract: We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the ﬁrst time apply a model with non-linear feature superposition and explicit position encoding for patches. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We ﬁrst investigated encodings learned by the model using artiﬁcial data with mutually occluding components. We ﬁnd that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive ﬁelds associated with the model’s hidden units. We ﬁnd many Gabor-like or globular receptive ﬁelds as well as ﬁelds sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efﬁciently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex. 1

3 0.97667849 210 nips-2013-Noise-Enhanced Associative Memories

Author: Amin Karbasi, Amir Hesam Salavati, Amin Shokrollahi, Lav R. Varshney

Abstract: Recent advances in associative memory design through structured pattern sets and graph-based inference algorithms allow reliable learning and recall of exponential numbers of patterns. Though these designs correct external errors in recall, they assume neurons compute noiselessly, in contrast to highly variable neurons in hippocampus and olfactory cortex. Here we consider associative memories with noisy internal computations and analytically characterize performance. As long as internal noise is less than a speciﬁed threshold, error probability in the recall phase can be made exceedingly small. More surprisingly, we show internal noise actually improves performance of the recall phase. Computational experiments lend additional support to our theoretical analysis. This work suggests a functional beneﬁt to noisy neurons in biological neuronal networks. 1

4 0.97621202 129 nips-2013-Generalized Random Utility Models with Multiple Types

Author: Hossein Azari Soufiani, Hansheng Diao, Zhenyu Lai, David C. Parkes

Abstract: We propose a model for demand estimation in multi-agent, differentiated product settings and present an estimation algorithm that uses reversible jump MCMC techniques to classify agents’ types. Our model extends the popular setup in Berry, Levinsohn and Pakes (1995) to allow for the data-driven classiﬁcation of agents’ types using agent-level data. We focus on applications involving data on agents’ ranking over alternatives, and present theoretical conditions that establish the identiﬁability of the model and uni-modality of the likelihood/posterior. Results on both real and simulated data provide support for the scalability of our approach. 1

5 0.97064811 202 nips-2013-Multiclass Total Variation Clustering

Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht

Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1

6 0.9667989 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

7 0.9568435 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

8 0.95186967 122 nips-2013-First-order Decomposition Trees

same-paper 9 0.94186336 143 nips-2013-Integrated Non-Factorized Variational Inference

10 0.94152713 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

11 0.84625322 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

12 0.8319459 347 nips-2013-Variational Planning for Graph-based MDPs

13 0.83127689 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

14 0.80890393 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

15 0.80748135 86 nips-2013-Demixing odors - fast inference in olfaction

16 0.80015701 148 nips-2013-Latent Maximum Margin Clustering

17 0.79559696 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

18 0.79365724 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

19 0.79050106 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

20 0.78969926 317 nips-2013-Streaming Variational Bayes