nips nips2013 nips2013-115 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kohei Hayashi, Ryohei Fujimaki
Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). [sent-4, score-0.228]
2 FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). [sent-5, score-0.048]
3 Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. [sent-6, score-0.045]
4 , automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. [sent-9, score-0.056]
5 1 Introduction Factorized asymptotic Bayesian (FAB) inference is a recently-developed Bayesian approximation inference method for model selection of latent variable models [5, 6]. [sent-10, score-0.228]
6 FAB inference maximizes a computationally tractable lower bound of a “factorized information criterion” (FIC) which converges to a marginal log-likelihood for a large sample limit. [sent-11, score-0.086]
7 One of the interesting characteristics of FAB inference is that it estimates both models (e. [sent-13, score-0.034]
8 With respect to the trade-off between controllability and automation, FAB inference places more importance on automation. [sent-18, score-0.05]
9 Although FAB inference is a promising model selection method, as yet it has only been applicable to models satisfying a specific condition that the Hessian matrix of a complete log-likelihood (i. [sent-19, score-0.072]
10 , of a log-likelihood over both observed and latent variables) must be block diagonal, with only a part of the observed samples contributing individual sub-blocks. [sent-21, score-0.09]
11 Such models include basic latent variable models as MMs [6]. [sent-22, score-0.074]
12 The application of FAB inference to more advanced models that do not satisfy the condition remains to be accomplished. [sent-23, score-0.048]
13 This paper extends an FAB framework to latent feature models (LFMs) [9, 17]. [sent-24, score-0.089]
14 , determination of the dimensionality of latent features) has been addressed by NBP and VB methods [10, 3]. [sent-27, score-0.074]
15 Our asymptotic analysis of the Hessian matrix of the log-likelihood shows that FICs for LFMs have the same form as those for MMs, despite the fact that LFMs do not satisfy the condition explained above (see Lemma 1). [sent-29, score-0.075]
16 Eventually, as FAB/MMs, FAB/LFMs offer several desirable properties, such as FIC convergence to a marginal log-likelihood, automatic hidden states selection, and monotonic increase in the lower FIC bound through iterative optimization. [sent-30, score-0.091]
17 Inspired by this analysis, we propose a shrinkage acceleration method which drastically reduces computational cost in practice, and 2) we show that FAB/LFMs have parameter identifiability. [sent-32, score-0.145]
18 This analysis offers a natural guide to the merging post-processing of latent features. [sent-33, score-0.114]
19 Rigorous proofs and assumptions with respect to the main results are given in the supplementary materials. [sent-34, score-0.032]
20 1 Related Work FIC for MMs Suppose we have N × D observed data X and N × K latent variables Z. [sent-37, score-0.074]
21 FIC considers the following alternative representation of the marginal log-likelihood: { } ∫ ∑ p(X, Z|M) q(Z) log log p(X|M) = max , p(X, Z|M) = p(X, Z|P)p(P|M)dP, (1) q q(Z) Z where q(Z) is a variational distribution on Z; M and P are a model and its parameter, respectively. [sent-38, score-0.146]
22 In the case of MMs, log p(X, Z|P) can be factorized into log p(Z) and log p(X|Z) = ∑ k log pk (X|z·k ), where pk is the k-th observation distribution (we here omit parameters for notational simplicity. [sent-39, score-0.326]
23 ) We can then approximate p(X, Z|M) by individually applying Laplace’s method [28] to log p(Z) and log pk (X|z·k ): K ∏ (2π)DZ /2 (2π)Dk /2 ˆ ∑ p(X, Z|M) ≈ p(X, Z|P) D /2 , (2) N Z det |FZ |1/2 k=1 ( n znk )Dk /2 det |Fk |1/2 ˆ where P is the maximum likelihood estimator (MLE) of p(X, Z|P). [sent-40, score-0.399]
24 1 DZ and Dk are the parameter dimensionalities of p(Z)∑ pk (X|z·k ), respectively. [sent-41, score-0.039]
25 FZ and Fk are −∇∇ log p(Z)|P /N and ˆ and −∇∇ log pk (X|z·k )|P /( n znk ), respectively. [sent-42, score-0.299]
26 Under conditions for asymptotic ignoring of ˆ log det |FZ | and log det |Fk |, substituting Eq. [sent-43, score-0.256]
27 (2) into (1) gives the FIC for MMs as follows: [ ] ∑ Dk ∑ DZ ˆ FICMM ≡ max Eq log p(X, Z|P) − log N − log znk + H(q), (3) q 2 2 n k ∑ where H(q) is the entropy of q(Z). [sent-44, score-0.307]
28 The most important term in FICMM (3) is log( n znk ), which offers such theoretically desirable properties for FAB inference as automatic shrinkage of irrelevant latent variables and parameter identifiability [6]. [sent-45, score-0.42]
29 ∑ Direct optimization of FICMM is difficult because: (i) evaluation of Eq [log n znk ] is computationally infeasible, and (ii) the MLE is not available in principle. [sent-46, score-0.166]
30 For (i), since − log n znk is a convex function, its linear approximation at N πk > 0 yields the lower bound: ˜ ∑ ) ] ∑ Dk [ ∑ ∑ Dk ( ˜ n Eq [znk ]/N − πk − Eq log znk ≥ − log N πk + ˜ , (4) 2 2 πk ˜ n k k where 0 < πk ≤ 1 is a linearization parameter. [sent-48, score-0.504]
31 For (ii), since, from the definition of the MLE, the ˜ ˆ inequality log p(X, Z|P) ≥ log p(X, Z|P) holds for any P, we optimize P along with q. [sent-49, score-0.094]
32 Alternat˜ ing maximization of the lower bound with respect to q, P, and π guarantees a monotonic increase in the FIC lower bound [6]. [sent-50, score-0.093]
33 It enables us to express an infinite number of latent features, and making it possible to adjust model complexity on the basis of observations. [sent-52, score-0.074]
34 Since naive Gibbs sampling requires unrealistic computational cost, acceleration algorithms such as accelerated sampling [2] and VB [3] have been developed. [sent-58, score-0.078]
35 2 2 FIC and FAB Algorithm for LFMs LFMs assume underlying relationships for X with binary features Z ∈ {0, 1}N ×K and linear bases W ∈ RD×K such that, for n = 1, . [sent-64, score-0.029]
36 Z follows a Bernoulli prior distribution znk ∼ Bern(πk ) with a mean parameter πk . [sent-69, score-0.166]
37 , limN →∞ log p(P|M) = 0 N In the case of MMs, we implicitly use the fact that: A1) parameters of pk (X|z·k ) are mutually independent for k = 1, . [sent-74, score-0.086]
38 , K (in other words, ∇∇ log p(X|Z) is block diagonal having K blocks), and ∑ A2) the number of observations which contribute ∇∇ log pk (X|z·k ) is n znk . [sent-77, score-0.315]
39 These conditions ∑ naturally yield the FAB regularization term log n znk by the Laplace approximation of MMs (2). [sent-78, score-0.246]
40 However, since θ d is shared by all latent features in LFMs, A1 and A2 are not satisfied. [sent-79, score-0.103]
41 Under some mild assumptions (see the supplementary materials), the following equality holds: ∑ ∑ znk log det |F(d) | = log n (6) + Op (1). [sent-87, score-0.326]
42 N k ∑ An important fact is that the log n znk term naturally appears in log det |F(d) | without A1 and A2. [sent-88, score-0.31]
43 Lemma 1 induces the following theorem, which states an asymptotic approximation of a marginal complete log-likelihood, log p(X, Z|M). [sent-89, score-0.133]
44 If Lemma 1 holds and the joint marginal log-likelihood is bounded for a sufficiently large N , it can be asymptotically approximated as: ˆ log p(X, Z|M) = J(Z, P) + Op (1), ∑ |P| − DK D∑ J(Z, P) ≡ log p(X, Z|P) − log N − log znk . [sent-91, score-0.395]
45 2 2 n (7) (8) k It is worth noting that, if we evaluate the model complexity of θ d (log det |F(d) |) by N , i. [sent-92, score-0.05]
46 (7) falls into Bayesian Information Criterion [23], which tells us that the model complexity relevant to θ d increases not O(K log N ) but ∑ ∑ O( k log n znk ). [sent-95, score-0.26]
47 This indicates the wide applicability of FICs and suggests that FIC representation of approximated marginal log-likelihoods is feasible not only for MMs but also for more general (discrete) latent variable models. [sent-99, score-0.098]
48 (7) are not affected by the expectation of q(Z), the difference between the FIC and the marginal log-likelihood is asymptotically constant; in other words, the distance between log p(X|M)/N and FICLFM /N is asymptotically small. [sent-101, score-0.105]
49 , we ∏ restrict the class of q(zn ) to a factorized form: q(zn ) = k q (znk |µnk ), where q (z|µ) is a Bernoulli ˜ ˜ distribution with a mean parameter µ = Eq [z]. [sent-108, score-0.06]
50 Rather than this approximation’s making the FIC lower bound looser (the equality (1) no longer holds), the variational distribution has a closed-form solution. [sent-109, score-0.056]
51 The VB-extension of IBP [3] also uses this factorized assumption. [sent-111, score-0.06]
52 Eq [log p(X|Z, Θ) + log p(Z|π) + RHS of (4)] − log N + (10) 2 n ˜ ˜ An FAB algorithm alternatingly maximizes L(q, P, π) with respect to {{µn }, P, π}. [sent-113, score-0.11]
53 This monotonic increase in L gives us a natural stopping condition with a tolerance δ: if (Lt − Lt−1 )/N < δ then stop the algorithm, where we denote the value of L at the t-th iteration by Lt . [sent-115, score-0.035]
54 FAB E-step In the FAB E-step, we update µn in a way similar to that with the variational meanfield inference in a restricted Boltzmann machine [20]. [sent-116, score-0.083]
55 Update equation (11) is a form of coordinate descent, and every update is guaranteed to increase the lower bound [25]. [sent-118, score-0.049]
56 , K, we are able to obtain a local maximum of Eq [zn ] = µn and Eq [zn zn ] = µn µn + diag(µn − µ2 ). [sent-123, score-0.034]
57 (11) is − 2N πk , which originated in the log n znk term in Eq. [sent-125, score-0.213]
58 This results in the shrinking of irrelevant ˜ features, and therefore FAB/LFMs are capable of automatically selecting feature dimensionality K. [sent-131, score-0.101]
59 This regularization effect is induced independently of prior (i. [sent-132, score-0.031]
60 , asymptotic ignorance of prior) and is known as “model induced regularization” which is caused by Bayesian marginalization in singular models [18]. [sent-134, score-0.045]
61 (11) offers another shrinking effect, by means of η(πk ), which is a prior-based regularization. [sent-136, score-0.052]
62 We empirically show that the latter shrinking effect is too weak to mitigate over-fitting and the FAB algorithm achieves faster convergence, with respect to N , to the true model (see Section 4. [sent-137, score-0.063]
63 setting D/2N πk = 0), ˜ then update equation (11) is equivalent to that of variational EM. [sent-140, score-0.049]
64 We eliminate shrunken features after FAB E-step in terms of that LFMs approximate X by ∑ ∑ 1b does k ∑ µ·k w·k +∑ . [sent-146, score-0.029]
65 (12) 6: end for 7: end for 50 −50 −60 −70 Acceleration On −80 −90 Off #Iteration −100 20 100 200 40 300 Elapsed time (sec) 80 160 400 Figure 1: Time evolution of K (top) and L/N (bottom) in FAB with and without shrinkage acceleration (D = 50 and K = 5). [sent-162, score-0.118]
66 This model shrinkage also works to avoid the ill-conditioning of the FIC; if there are latent fea∑ ∑ tures that are never activated ( n µnk /N = 0) or always activated ( n µnk /N = 1), the FIC will no longer be an approximation of the marginal log-likelihood. [sent-164, score-0.199]
67 3 Analysis and Refinements CCCP Interpretation and Shrinkage Acceleration Here we interpret the alternating updates ˜ of µ and π as a convex concave procedure (CCCP) [29] and consider to eliminate irrelevant features in early steps to reduce computational cost. [sent-167, score-0.112]
68 By substituting an optimality condition ∑ πk = n µnk /N (12) into the lower bound, we obtain ˜ ( ) ∑ ∑ D∑ L(q) = − log µnk + (cn + η) µn + H(q) + const. [sent-168, score-0.092]
69 (13) 2 n n k The first and second terms are convex and concave with respect to µnk , respectively. [sent-169, score-0.031]
70 By setting the derivative of the nk “linearized” objective to be zero, we obtain the CCCP update as follows: ) ( D ∑ t−1 t µ . [sent-172, score-0.139]
71 µnk = g cnk + η(πk ) − (14) 2 n nk By taking N πk = ˜ ∑ n t−1 µnk into account, Eq. [sent-173, score-0.156]
72 Such multiple CCCP steps in each FABEM step is expected to accelerate the shrinkage effect discussed in the previous section because the 5 ∑ regularization in terms of −D/2( n µnk ) causes the effect. [sent-178, score-0.099]
73 Eventually, it is expected to reduce the total computational cost since we may be able to remove irrelevant latent features in earlier iterations. [sent-179, score-0.171]
74 ˜ Note that, in practice, we update π along with π for further acceleration of the shrinkage. [sent-181, score-0.085]
75 ) Further discussion of this this update (an exponentiated gradient descent interpretation) can be found in the supplementary materials. [sent-183, score-0.054]
76 Identifiability and Merge Post-processing Parameter identifiability is an important theoretical aspect in learning algorithms for latent variable models. [sent-184, score-0.074]
77 The following theorem shows that FAB inference resolves such non-identifiability in LFMs. [sent-190, score-0.034]
78 nk nl For the ill-conditioned situation described above, the FAB algorithm has a unique solution that balances the sizes of latent features. [sent-203, score-0.215]
79 In addition, Theorem 4 gives us an important insight about post-processing of latent features. [sent-207, score-0.074]
80 If w∗ = w∗ , then Eq [log p(X, Z|M∗ )] is equivalent without ·k ·l relation to µnk and µnl , while model complexity is smaller if we only have one latent feature. [sent-208, score-0.088]
81 Therefore, if w∗ = w∗ , merging these two latent features increases L, i. [sent-209, score-0.123]
82 In practice, we search for such overlapping features on the basis of a Euclidean ·k distance matrix of W∗ and w∗ for k = 1, . [sent-212, score-0.029]
83 , K and merge them if the lower bound increases after ·k the post-processing. [sent-215, score-0.069]
84 4 Experiments We have evaluated FAB/LFMs in terms of computational speed, model selection accuracy, and prediction performance with respect to missing values. [sent-218, score-0.054]
85 We compared FAB inference and the variational EM algorithm (see Section 2. [sent-219, score-0.062]
86 EM selects K using the shrinkage effect of η as we have explained in Section 2. [sent-226, score-0.085]
87 For FAB and EM, we set δ = 10−4 (this was not sensitive) and Tshrink = 100 (FAB only); {µn } were randomly and uniformly initialized by 0 and 1; the initial number of latent features was set to min(N, D) as well as MEIBP. [sent-229, score-0.103]
88 Figure 1 shows the results of a comparison between FAB with and without shrinkage acceleration. [sent-239, score-0.054]
89 2 Clearly, our shrinkage acceleration 2 We also investigated the effect of merge post-processing, but none was observed in this small example. [sent-240, score-0.174]
90 5 10 30 Estimated K Elapsed time (sec) fab 25 20 15 10 5 100 250 500 10002000 N 100 250 500 10002000 100 250 500 1000 2000 N 100 250 500 1000 2000 Figure 2: Comparative evaluation of the artificial simulations in terms of N v. [sent-244, score-0.781]
91 significantly reduced computational cost by eliminating irrelevant features in the early steps, while both algorithms achieved roughly the same objective value L and model selection performance at the convergence. [sent-249, score-0.121]
92 While MEIBP was much faster than FAB in terms of elapsed computational time, FAB achieved the most accurate estimation of K, especially for large N . [sent-251, score-0.065]
93 ) Figure 3 compares estimated features of each method on early learning phase (at the 5th iteration) and after the convergence (the result displayed is the example which has the median log-likelihood over 10 trials. [sent-261, score-0.029]
94 While EM and IBP retain irrelevant features, FAB successfully extracts the true patterns without irrelevant features. [sent-263, score-0.108]
95 Table 1 summarizes the results with respect to elapsed computational time (hours), selected K, PLL, and TLL. [sent-266, score-0.081]
96 For the large sample data sets (EEG, Piano, USPS), PLLs of FAB and EM were competitive with one another; 3 We totally omitted VB because of its long computational time. [sent-277, score-0.03]
97 In addition, our proposed accelerating mechanism for shrinking models drastically reduces total computational time. [sent-463, score-0.059]
98 Experimental comparisons of FAB inference with existing methods, including state-of-the-art IBP methods, have demonstrated the superiority of FAB/LFM. [sent-464, score-0.034]
99 Infinite latent feature models and the indian buffet process, 2005. [sent-525, score-0.188]
100 On bayesian PCA: Automatic dimensionality selection and analytic solution. [sent-573, score-0.052]
wordName wordTfidf (topN-words)
[('fab', 0.766), ('lfms', 0.268), ('fic', 0.237), ('meibp', 0.23), ('ibp', 0.169), ('mms', 0.166), ('znk', 0.166), ('nk', 0.118), ('cccp', 0.074), ('latent', 0.074), ('vb', 0.07), ('eq', 0.069), ('acceleration', 0.064), ('ficlfm', 0.064), ('em', 0.064), ('factorized', 0.06), ('pll', 0.056), ('irrelevant', 0.054), ('shrinkage', 0.054), ('buffet', 0.052), ('elapsed', 0.051), ('accelerateshrinkage', 0.051), ('ficmm', 0.051), ('fics', 0.051), ('det', 0.05), ('indian', 0.047), ('log', 0.047), ('asymptotic', 0.045), ('fz', 0.045), ('piano', 0.045), ('yaleb', 0.042), ('dk', 0.041), ('merge', 0.041), ('usps', 0.04), ('pk', 0.039), ('cnk', 0.038), ('fujimaki', 0.038), ('tll', 0.038), ('inference', 0.034), ('zn', 0.034), ('lt', 0.032), ('shrinking', 0.032), ('hessian', 0.03), ('features', 0.029), ('fk', 0.028), ('bayesian', 0.028), ('variational', 0.028), ('ths', 0.028), ('mle', 0.027), ('eeg', 0.027), ('plls', 0.026), ('tshrink', 0.026), ('grif', 0.025), ('selection', 0.024), ('ml', 0.024), ('marginal', 0.024), ('dz', 0.023), ('nl', 0.023), ('lfm', 0.023), ('laplace', 0.022), ('update', 0.021), ('monotonic', 0.021), ('reed', 0.021), ('sec', 0.021), ('op', 0.02), ('offers', 0.02), ('merging', 0.02), ('ability', 0.019), ('automatic', 0.018), ('substituting', 0.017), ('exponentiated', 0.017), ('asymptotically', 0.017), ('approximation', 0.017), ('omitted', 0.016), ('explained', 0.016), ('block', 0.016), ('respect', 0.016), ('supplementary', 0.016), ('regularization', 0.016), ('hours', 0.016), ('concave', 0.015), ('effect', 0.015), ('feature', 0.015), ('activated', 0.015), ('simulations', 0.015), ('dim', 0.014), ('submodular', 0.014), ('smaller', 0.014), ('bound', 0.014), ('nonparametric', 0.014), ('lower', 0.014), ('identi', 0.014), ('accelerate', 0.014), ('algebraic', 0.014), ('condition', 0.014), ('computational', 0.014), ('lemma', 0.013), ('drastically', 0.013), ('link', 0.013), ('deviation', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
Author: Kohei Hayashi, Ryohei Fujimaki
Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. 1
2 0.14092654 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
3 0.078405343 143 nips-2013-Integrated Non-Factorized Variational Inference
Author: Shaobo Han, Xuejun Liao, Lawrence Carin
Abstract: We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the 1 norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method. 1
4 0.072850145 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
5 0.058489118 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.050953977 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
7 0.048521377 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
8 0.038973592 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
9 0.038378164 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
10 0.037742779 60 nips-2013-Buy-in-Bulk Active Learning
11 0.035741556 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
12 0.035492629 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
14 0.034508511 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
15 0.034419347 317 nips-2013-Streaming Variational Bayes
16 0.031593021 149 nips-2013-Latent Structured Active Learning
17 0.031492636 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
18 0.03142707 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
19 0.03059038 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
20 0.030236386 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
topicId topicWeight
[(0, 0.097), (1, 0.032), (2, -0.004), (3, 0.006), (4, 0.009), (5, 0.058), (6, 0.041), (7, 0.005), (8, 0.064), (9, -0.031), (10, -0.016), (11, 0.051), (12, -0.024), (13, 0.017), (14, -0.03), (15, -0.009), (16, -0.013), (17, -0.0), (18, 0.014), (19, -0.082), (20, -0.019), (21, -0.003), (22, -0.044), (23, 0.023), (24, -0.001), (25, -0.026), (26, 0.006), (27, -0.009), (28, -0.007), (29, -0.022), (30, -0.023), (31, -0.037), (32, -0.016), (33, 0.038), (34, -0.047), (35, 0.046), (36, -0.109), (37, -0.031), (38, 0.005), (39, 0.069), (40, -0.02), (41, 0.021), (42, -0.021), (43, -0.014), (44, 0.014), (45, -0.041), (46, 0.047), (47, -0.067), (48, -0.022), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.84116644 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
Author: Kohei Hayashi, Ryohei Fujimaki
Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. 1
Author: Ryan D. Turner, Steven Bottone, Clay J. Stanek
Abstract: The Bayesian online change point detection (BOCPD) algorithm provides an efficient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model’s posterior predictives, which can only be computed online in O(1) time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efficient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution. Change point detection has been applied to many applications [5; 7]. In recent years there have been great improvements to the Bayesian approaches via the Bayesian online change point detection algorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its use of feature-aided tracking [10]: methods that use auxiliary information from radar returns such as signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems would often filter only noisy position (and perhaps Doppler) measurements while newer systems use more information to improve performance. We use BOCPD for modeling the RCS feature. Whereas BOCPD inference could be done exactly when finding change points in conjugate exponential family models the physics of RCS measurements often causes them to be distributed in non-exponential family ways, often following a Rice distribution. To do inference efficiently we call upon variational Bayes (VB) to find approximate posterior (predictive) distributions. Furthermore, the nature of both BOCPD and tracking require the use of online updating. We improve upon the existing and limited approaches to online VB [24; 13]. This paper produces contributions to, and builds upon background from, three independent areas: change point detection, variational Bayes, and radar tracking. Although the emphasis in machine learning is on filtering, a substantial part of tracking with radar data involves data association, illustrated in Figure 1. Observations of radar returns contain measurements from multiple objects (targets) in the sky. If we knew which radar return corresponded to which target we would be presented with NT ∈ N0 independent filtering problems; Kalman filters [14] (or their nonlinear extensions) are applied to “average out” the kinematic errors in the measurements (typically positions) using the measurements associated with each target. The data association problem is to determine which measurement goes to which track. In the classical setup, once a particular measurement is associated with a certain target, that measurement is plugged into the filter for that target as if we knew with certainty it was the correct assignment. The association algorithms, in effect, find the maximum a posteriori (MAP) estimate on the measurement-to-track association. However, approaches such as the joint probabilistic data association (JPDA) filter [2] and the probability hypothesis density (PHD) filter [16] have deviated from this. 1 To find the MAP estimate a log likelihood of the data under each possible assignment vector a must be computed. These are then used to construct cost matrices that reduce the assignment problem to a particular kind of optimization problem (the details of which are beyond the scope of this paper). The motivation behind feature-aided tracking is that additional features increase the probability that the MAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature (SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or gamma distributed [26]. The parameters of the RCS distribution are determined by factors such as the shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS characteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS features may help distinguish one aircraft from another if they cross paths or come near one another, for example. RCS also helps distinguish genuine aircraft returns from clutter: a flock of birds or random electrical noise, for example. However, the parameters of the RCS distributions may also change for the same aircraft due to a change in angle or ground conditions. These must be taken into account for accurate association. Providing good predictions in light of a possible sudden change in the parameters of a time series is “right up the alley” of BOCPD and change point methods. The original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family models for time series. In this paper, we expand the set of applications of BOCPD to radar SNR data which often has the same change point structure found in other applications, and requires online predictions. The BOCPD model is highly modular in that it looks for changes in the parameters of any underlying process model (UPM). The UPM merely needs to provide posterior predictive probabilities, the UPM can otherwise be a “black box.” The BOCPD queries the UPM for a prediction of the next data point under each possible run length, the number of points since the last change point. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the posterior is computed by accumulating the sufficient statistics since the last potential change point. This allows for O(1) UPM updates in both computation and memory as the run length increases. We motivate the use of VB for implementing UPMs when the data within a regime is believed to follow a distribution that is not exponential family. The methods presented in this paper can be used to find variational run length posteriors for general non-exponential family UPMs in addition to the Rice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are applicable in areas outside of change point detection. Likelihood clutter (birds) track 1 (747) track 2 (EMB 110) 0 5 10 15 20 SNR Figure 1: Illustrative example of a tracking scenario: The black lines (−) show the true tracks while the red stars (∗) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions on the states are shown as blue ellipses. The current (+) and previous (×) measurements are connected to their associated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (·). The distributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right. To our knowledge this paper is the first to demonstrate how to compute Bayesian posterior distributions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15], which computes a MAP estimate. Other novel factors of this paper include: demonstrating the usefulness (and advantages over existing techniques) of change point detection for RCS estimation and tracking; and applying variational inference for UPMs where analytic posterior predictives are not possible. This paper provides four main technical contributions: 1) VB inference for inferring the parameters of a Rice distribution. 2) General improvements to online VB (which is then applied to updating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the UPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice distribution) in VB. This is key for processing missed detections in data association. 2 1 Background In this section we briefly review the three areas of background: BOCPD, VB, and tracking. 1.1 Bayesian Online Change Point Detection We briefly summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a detailed description. We assume we have a time series with n observations so far y1 , . . . , yn ∈ Y. In effect, BOCPD performs message passing to do online inference on the run length rn ∈ 0:n − 1, the number of observations since the last change point. Given an underlying predictive model (UPM) and a hazard function h, we can compute an exact posterior over the run length rn . Conditional on a run length, the UPM produces a sequential prediction on the next data point using all the data since the last change point: p(yn |y(r) , Θm ) where (r) := (n − r):(n − 1). The UPM is a simpler model where the parameters θ change at every change point and are modeled as being sampled from a prior with hyper-parameters Θm . The canonical example of a UPM would be a Gaussian whose mean and variance change at every change point. The online updates are summarized as: P (rn |rn−1 ) p(yn |rn−1 , y(r) ) p(rn−1 , y1:n−1 ) . msgn := p(rn , y1:n ) = rn−1 hazard UPM (1) msgn−1 Unless rn = 0, the sum in (1) only contains one term since the only possibility is that rn−1 = rn −1. The indexing convention is such that if rn = 0 then yn+1 is the first observation sampled from the new parameters θ. The marginal posterior predictive on the next data point is easily calculated as: p(yn+1 |y1:n ) = p(yn+1 |y(r) )P (rn |y1:n ) . (2) rn Thus, the predictions from BOCPD fully integrate out any uncertainty in θ. The message updates (1) perform exact inference under a model where the number of change points is not known a priori. BOCPD RCS Model We show the Rice UPM as an example as it is required for our application. The data within a regime are assumed to be iid Rice observations, with a normal-gamma prior: yn ∼ Rice(ν, σ) , ν ∼ N (µ0 , σ 2 /λ0 ) , σ −2 =: τ ∼ Gamma(α0 , β0 ) (3) 2 =⇒ p(yn |ν, σ) = yn τ exp(−τ (yn + ν 2 )/2)I0 (yn ντ )I{yn ≥ 0} (4) where I0 (·) is a modified Bessel function of order zero, which is what excludes the Rice distribution from the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable us to use the VB-EM algorithm. The UPM parameters are the Rice shape1 ν ∈ R and scale σ ∈ R+ , θ := {ν, σ}, and the hyper-parameters are the normal-gamma parameters Θm := {µ0 , λ0 , α0 , β0 }. Every change point results in a new value for ν and σ being sampled. A posterior on θ is maintained for each run length, i.e. every possible starting point for the current regime, and is updated at each new data point. Therefore, BOCPD maintains n distinct posteriors on θ, and although this can be reduced with pruning, it necessitates posterior updates on θ that are computationally efficient. Note that the run length updates in (1) require the UPM to provide predictive log likelihoods at all sample sizes rn (including zero). Therefore, UPM implementations using such approximations as plug-in MLE predictions will not work very well. The MLE may not even be defined for run lengths smaller than the number of UPM parameters |θ|. For a Rice UPM, the efficient O(1) updating in exponential family models by using a conjugate prior and accumulating sufficient statistics is not possible. This motivates the use of VB methods for approximating the UPM predictions. 1.2 Variational Bayes We follow the framework of VB where when computation of the exact posterior distribution p(θ|y1:n ) is intractable it is often possible to create a variational approximation q(θ) that is locally optimal in terms of the Kullback-Leibler (KL) divergence KL(q p) while constraining q to be in a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on the evidence log p(y1:n ), using either gradient based methods or standard fixed point equations. 1 The shape ν is usually assumed to be positive (∈ R+ ); however, there is nothing wrong with using a negative ν as Rice(x|ν, σ) = Rice(x|−ν, σ). It also allows for use of a normal-gamma prior. 3 The VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB fixed point equations can be simplified by applying the VB-EM algorithm [3]. VB-EM is applicable to models that are conjugate-exponential (CE) after being augmented with latent variables x1:n . A model is CE if: 1) The complete data likelihood p(x1:n , y1:n |θ) is an exponential family distribution; and 2) the prior p(θ) is a conjugate prior for the complete data likelihood p(x1:n , y1:n |θ). We only have to constrain the posterior q(θ, x1:n ) = q(θ)q(x1:n ) to factorize between the latent variables and the parameters; we do not constrain the posterior to be of any particular parametric form. Requiring the complete likelihood to be CE is a much weaker condition than requiring the marginal on the observed data p(y1:n |θ) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented with latent variables (class labels). This is also the case for the Rice distribution (Section 2.1). Like the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the posterior of the latent variables treating the expected natural parameters η := Eq(θ) [η] as correct: ¯ q(xi ) ← p(xi |yi , η = η ). 2) Find the posterior of the parameters using the expected sufficient statis¯ ¯ tics S := Eq(x1:n ) [S(x1:n , y1:n )] as if they were the sufficient statistics for the complete data set: ¯ q(θ) ← p(θ|S(x1:n , y1:n ) = S). The posterior will be of the same exponential family as the prior. 1.3 Tracking In this section we review data association, which along with filtering constitutes tracking. In data association we estimate the association vectors a which map measurements to tracks. At each time NZ (n) step, n ∈ N1 , we observe NZ (n) ∈ N0 measurements, Zn = {zi,n }i=1 , which includes returns from both real targets and clutter (spurious measurements). Here, zi,n ∈ Z is a vector of kinematic measurements (positions in R3 , or R4 with a Doppler), augmented with an RCS component R ∈ R+ for the measured SNR, at time tn ∈ R. The assignment vector at time tn is such that an (i) = j if measurement i is associated with track j > 0; an (i) = 0 if measurement i is clutter. The inverse mapping a−1 maps tracks to measurements: meaning a−1 (an (i)) = i if an (i) = 0; and n n a−1 (i) = 0 ⇔ an (j) = i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5, n Nc = 2, and a−1 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa. In N D data association we jointly find the MAP estimate of the association vectors over a sliding window of the last N − 1 time steps. We assume we have NT (n) ∈ N0 total tracks as a known parameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative process each track places a probability distribution on the next N − 1 measurements, with both kinematic and RCS components. However, if the random RCS R for a measurement is below R0 then it will not be observed. There are Nc (n) ∈ N0 clutter measurements from a Poisson process with λ := E[Nc (n)] (often with uniform intensity). The ordering of measurements in Zn is assumed to be uniformly random. For 3D data association the model joint p(Zn−1:n , an−1 , an |Z1:n−2 ) is: NT |Zi | n pi (za−1 (i),n , za−1 n n−1 i=1 (i),n−1 ) × λNc (i) exp(−λ)/|Zi |! i=n−1 p0 (zj,i )I{ai (j)=0} , (5) j=1 where pi is the probability of the measurement sequence under track i; p0 is the clutter distribution. The probability pi is the product of the RCS component predictions (BOCPD) and the kinematic components (filter); informally, pi (z) = pi (positions) × pi (RCS). If there is a missed detection, i.e. a−1 (i) = 0, we then use pi (za−1 (i),n ) = P (R < R0 ) under the RCS model for track i with no conn n tribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic predictor to be used as a UPM, any black box model of measurement sequences can used in (5). The estimation of association vectors for the 3D case becomes an optimization problem of the form: ˆ (ˆn−1 , an ) = argmax log P (an−1 , an |Z1:n ) = argmax log p(Zn−1:n , an−1 , an |Z1:n−2 ) , (6) a (an−1 ,an ) (an−1 ,an ) which is effectively optimizing (5) with respect to the assignment vectors. The optimization given in (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved efficiently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approximate, yet typically very accurate, solvers must be used for real-time operation, which is usually required for tracking systems [20]. If a radar scan occurs at each time step and a target is not detected, we assume the SNR has not exceeded the threshold, implying 0 ≤ R < R0 . This is a (left) censored measurement and is treated differently than a missing data point. Censoring is accounted for in Section 2.3. 4 2 Online Variational UPMs We cover the four technical challenges for implementing non-exponential family UPMs in an efficient and online manner. We drop the index of the data point i when it is clear from context. 2.1 Variational Posterior for a Rice Distribution The Rice distribution has the property that x ∼ N (ν, σ 2 ) , y ∼ N (0, σ 2 ) =⇒ R = x2 + y 2 ∼ Rice(ν, σ) . (7) For simplicity we perform inference using R2 , as opposed to R, and transform accordingly: x ∼ N (ν, σ 2 ) , 1 R2 − x2 ∼ Gamma( 2 , τ ) , 2 τ := 1/σ 2 ∈ R+ =⇒ p(R2 , x) = p(R2 |x)p(x) = Gamma(R2 − x2 | 1 , τ )N (x|ν, σ 2 ) . 2 2 (8) The complete likelihood (8) is the product of two exponential family models and is exponential family itself, parameterized with base measure h and partition factor g: η = [ντ, −τ /2] , S = [x, R2 ] , h(R2 , x) = (2π R2 − x2 )−1 , g(ν, τ ) = τ exp(−ν 2 τ /2) . By inspection we see that the natural parameters η and sufficient statistics S are the same as a Gaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (ν, τ ) as it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM 2 algorithm. We use yi := Ri as the VB observation, not Ri as in (3). In (5), z·,· (end) is the RCS R. VB M-Step We derive the posterior updates to the parameters given expected sufficient statistics: n λ0 µ0 + i E[xi ] , λn = λ0 + n , αn = α0 + n , λ0 + n i=1 n n 1 1 nλ0 1 βn = β0 + (E[xi ] − x)2 + ¯ (¯ − µ0 )2 + x R2 − E[xi ]2 . 2 i=1 2 λ0 + n 2 i=1 i x := ¯ E[xi ]/n , µn = (9) (10) This is the same as an observation from a Gaussian and a gamma that share a (inverse) scale τ . 2 2 ¯ VB E-Step We then must find both expected sufficient statistics S. The expectation E[Ri |Ri ] = 2 2 Ri trivially; leaving E[xi |Ri ]. Recall that the joint on (x, y ) is a bivariate normal; if we constrain the radius to R, the angle ω will be distributed by a von Mises (VM) distribution. Therefore, ω := arccos(x/R) ∼ VM(0, κ) , κ = R E[ντ ] =⇒ E[x] = R E[cos ω] = RI1 (κ)/I0 (κ) , (11) where computing κ constitutes the VB E-step and we have used the trigonometric moment on ω [18]. This completes the computations required to do the VB updates on the Rice posterior. Variational Lower Bound For completeness, and to assess convergence, we derive the VB lower bound L(q). Using the standard formula [4] for L(q) = Eq [log p(y1:n , x1:n , θ)] + H[q] we get: n 2 1 E[log τ /2] − 1 E[τ ]Ri + (E[ντ ] − κi /Ri )E[xi ] − 2 E[ν 2 τ ] + log I0 (κi ) − KL(q p) , 2 (12) i=1 where p in the KL is the prior on (ν, τ ) which is easy to compute as q and p are both normal-gamma. Equivalently, (12) can be optimized directly instead of using the VB-EM updates. 2.2 Online Variational Inference In Section 2.1 we derived an efficient way to compute the variational posterior for a Rice distribution for a fixed data set. However, as is apparent from (1) we need online predictions from the UPM; we must be able to update the posterior one data point at a time. When the UPM is exponential family and we can compute the posterior exactly, we merely use the posterior from the previous step as the prior. However, since we are only computing a variational approximation to the posterior, using the previous posterior as the prior does not give the exact same answer as re-computing the posterior from batch. This gives two obvious options: 1) recompute the posterior from batch every update at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy. 5 The difference between the options is encapsulated by looking at the expected sufficient statistics: n ¯ S = i=1 Eq(xi |y1:n ) [S(xi , yi )]. Naive online updating uses old expected sufficient statistics whose n ¯ posterior effectively uses S = i=1 Eq(xi |y1:i ) [S(xi , yi )]. We get the best of both worlds if we adjust those estimates over time. We in fact can do this if we project the expected sufficient statistics into a “feature space” in terms of the expected natural parameters. For some function f , q(xi ) = p(xi |yi , η = η ) =⇒ Eq(xi |y1:n ) [S(xi , yi )] = f (yi , η ) . ¯ ¯ If f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6] n n ¯ f (yi , η ) = φ(¯) ψ(yi ) =⇒ S = ¯ η φ(¯) ψ(yi ) = φ(¯) η η ψ(yi ) , i=1 i=1 (13) (14) where an infinite dimensional φ and ψ may be required for exact representation, but can be approximated by a finite inner product. In the Rice distribution case we use (11) f (yi , η ) = E[xi ] = Ri I (Ri E[ντ ]) = Ri I ((Ri /µ0 ) µ0 E[ντ ]) , ¯ I (·) := I1 (·)/I0 (·) , (15) 2 Ri where recall that yi = and η1 = E[ντ ]. We can easily represent f with an inner product if we can ¯ represent I as an inner product: I (uv) = φ(u) ψ(v). We use unitless φi (u) = I (ci u) with c1:G as a log-linear grid from 10−2 to 103 and G = 50. We use a lookup table for ψ(v) that was trained to match I using non-negative least squares, which left us with a sparse lookup table. Online updating for VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting factors to forget the contributions from old data points that might be detrimental to accuracy. Since the VB predictions are “embedded” in a change point method, they are automatically phased out if the posterior predictions become inaccurate making the forgetting factors unnecessary. 2.3 Censored Data As mentioned in Section 1.3, we must handle censored RCS observations during a missed detection. In the VB-EM framework we merely have to compute the expected sufficient statistics given the censored measurement: E[S|R < R0 ]. The expected sufficient statistic from (11) is now: R0 E[x|R < R0 ] = 0 ν ν E[x|R]p(R)dR RiceCDF (R0 |ν, τ ) = ν(1 − Q2 ( σ , R0 ))/(1 − Q1 ( σ , R0 )) , σ σ where QM is the Marcum Q function [17] of order M . Similar updates for E[S|R < R0 ] are possible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive. 2.4 Variational Run Length Posteriors: Predictive Log Likelihoods Both updating the BOCPD run length posterior (1) and finding the marginal predictive log likelihood of the next point (2) require calculating the UPM’s posterior predictive log likelihood log p(yn+1 |rn , y(r) ). The marginal posterior predictive from (2) is used in data association (6) and benchmarking BOCPD against other methods. However, the exact posterior predictive distribution obtained by integrating the Rice likelihood against the VB posterior is difficult to compute. We can break the BOCPD update (1) into a time and measurement update. The measurement update corresponds to a Bayesian model comparison (BMC) calculation with prior p(rn |y1:n ): p(rn |y1:n+1 ) ∝ p(yn+1 |rn , y(r) )p(rn |y1:n ) . (16) Using the BMC results in Bishop [4, Sec. 10.1.4] we find a variational posterior on the run length by using the variational lower bound for each run length Li (q) ≤ log p(yn+1 |rn = i, y(r) ), calculated using (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB posterior if the approximating family Q is of the form: q(rn , θ, x) = qUPM (θ, x|rn )q(rn ) =⇒ q(rn = i) = exp(Li (q))p(rn = i|y1:n )/ exp(L(q)) , (17) where qUPM contains whatever constraints we used to compute Li (q). The normalizer on q(rn ) serves as a joint VB lower bound: L(q) = log i exp(Li (q))p(rn = i|y1:n ) ≤ log p(yn+1 |y1:n ). Note that the conditional factorization is different than the typical independence constraint on q. Furthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use a similar conditional constraint on the latent BOCPD variables given the assignment and constrain the assignment posterior to be a point mass. In the 2D assignment case, for example, ˆ q(an , X1:NT ) = q(X1:NT |an )q(an ) = q(X1:NT |an )I{an = an } , (18) 6 2 10 0 10 −1 10 −2 10 10 20 30 40 50 RCS RMSE (dBsm) RCS RMSE (dBsm) 10 KL (nats) 5 10 1 8 6 4 2 3 2 1 0 0 0 100 200 Sample Size (a) Online Updating 4 300 Time (b) Exponential RCS 400 0 100 200 300 400 Time (c) Rice RCS Figure 2: Left: KL from naive updating ( ), Sato’s method [24] ( ), and improved online VB (◦) to the batch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true ν in the generating Rice distribution: ν = 3.16 (red), ν = 10.0 (green), ν = 31.6 (blue) and τ = 1. Middle: The RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn ] is plotted for an exponential RCS model. The curves are BOCPD (blue), IMM (black), identity (magenta), α-filter (green), and median filter (red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% confidence intervals. where each track’s Xi represents all the latent variables used to compute the variational lower bound on log p(zj,n |an (j) = i). In the BOCPD case, Xi := {rn , x, θ}. The resulting VB fixed point ˆ equations find the posterior on the latent variables Xi by taking an as the true assignment and solving ˆ the VB problem of (17); the assignment an is found by using (6) and taking the joint BOCPD lower bound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5). 3 3.1 Results Improved Online Solution We first demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estimation example; here, we only test the VB posterior as no change point detection is applied. Figure 2(a) compares naive online updating, Sato’s method [24], and our improved online updating in KL(online batch) of the posteriors for three different true parameters ν as sample size n increases. The performance curves are the KL divergence between these online approximations to the posterior and the batch VB solution (i.e. restarting VB from “scratch” every new data point) vs sample size. The error for our method stays around a modest 10−2 nats while naive updating incurs large errors of 1 to 50 nats [19, Ch. 4]. Sato’s method tends to settle in around a 1 nat approximation error. The recommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updating. We did a grid search over annealing exponents and show the results for the best performing schedule of n−0.52 . By contrast, our method does not require the tuning of an annealing schedule. 3.2 RCS Estimation Benchmarking We now compare BOCPD with other methods for RCS estimation. We use the same experimental example as Slocumb and Klusman III [25], which uses an augmented interacting multiple model (IMM) based method for estimating the RCS; we also compare against the same α-filter and median filter used in [25]. As a reference point, we also consider the “identity filter” which is merely an unbiased filter that uses only yn to estimate the mean RCS E[Rn ] at time step n. We extend this example to look at Rice RCS in addition to the exponential RCS case. The bias correction constants in the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4]. The results on exponential distributions used in [25] and the Rice distribution case are shown in Figures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples of ±10 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB). In [25] the authors mention that the IMM reaches an RMSE “floor” at 2 dB, yet BOCPD continues to drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other methods upon a change in E[Rn ]. The α-filter and median filter appear worse than both the IMM and BOCPD. The RMSE and confidence intervals are calculated from 5000 runs of the experiment. 7 45 80 40 30 Northing (km) Improvement (%) 35 25 20 15 10 5 60 40 20 0 0 −5 1 2 3 4 −20 5 Difficulty 0 20 40 60 80 100 Easting (km) (a) SIAP Metrics (b) Heathrow (LHR) Figure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red ), velocity accuracy (green ), and spurious tracks (blue ◦) across difficulty levels. Right: LHR: true trajectories shown as black lines (−), estimates using a BOCPD RCS model for association shown as blue stars (∗), and the standard tracker as red circles (◦). The standard tracker has spurious tracks over east London and near Ipswich. Background map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies) 3.3 Flightradar24 Tracking Problem Finally, we used real flight trajectories from flightradar24 and plugged them into our 3D tracking algorithm. We compare tracking performance between using our BOCPD model and the relatively standard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated air picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP metrics are a standard set of metrics used to compare tracking systems. We broke the data into 30 regions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km area centered around the world’s 30 busiest airports [22]. Commercial airport traffic is typically very orderly and does not allow aircraft to fly close to one another or cross paths. Feature-aided tracking is most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets of 10 flight paths and randomly shifted their start time to allow for scenarios of greater interest. The resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by a difficulty metric: the number of times in a scenario any two aircraft come within ∼400 m of each other. The biggest improvements are seen for difficulties above three where positional accuracy increases by 30%. Significant improvements are also seen for velocity accuracy (11%) and the frequency of spurious tracks (6%). Significant performance gains are seen at all difficulty levels considered. The larger improvements at level three over level five are possibly due to some level five scenarios that are not resolvable simply through more sophisticated models. We demonstrate how our RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b). 4 Conclusions We have demonstrated that it is possible to use sophisticated and recent developments in machine learning such as BOCPD, and use the modern inference method of VB, to produce demonstrable improvements in the much more mature field of radar tracking. We first closed a “hole” in the literature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution, with its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible to use these variational UPMs for non-exponential family models in BOCPD without sacrificing its modular or online nature. The improvements in online VB are extendable to UPMs besides a Rice distribution and more generally beyond change point detection. We can use the variational lower bound from the UPM and obtain a principled variational approximation to the run length posterior. Furthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is in large contrast to the tracking literature. More algorithms from the tracking literature can possibly be cast in various machine learning frameworks, such as VB, and improved upon from there. 8 References [1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, University of Cambridge, Cambridge, UK. [2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms. YBS Publishing. [3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453–464. [4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [5] Braun, J. V., Braun, R., and M¨ ller, H.-G. (2000). Multiple changepoint fitting via quasilikelihood, with u application to DNA sequence segmentation. Biometrika, 87(2):301–314. [6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029, DTIC. [7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the Americal Statistical Association, 92(438):739–747. [8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience. [9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. [10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve measurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International Conference on, pages 1–8. IEEE. [11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B, 69(4):589–605. [12] Hipp, C. (1974). Sufficient statistics and exponential families. The Annals of Statistics, 2(6):1283–1292. [13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pages 803–808. [14] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45. [15] Lauwers, L., Barb´ , K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice e distribution: A Bayesian approach. In Instrumentation and Measurement Technology Conference, 2009. I2MTC’09. IEEE, pages 114–117. IEEE. [16] Mahler, R. (2003). Multi-target Bayes filtering via first-order multi-target moments. IEEE Trans. AES, 39(4):1152–1178. [17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand Corporation, Santa Monica, CA. [18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York. [19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational neuroscience unit, University College London, London, UK. [20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as multidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in Aerospace Sensing, pages 172–182. International Society for Optics and Photonics. [21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Principles. SciTech Pub. [22] Rogers, S. (2012). The world’s top 100 airports: listed, ranked and mapped. The Guardian. [23] Saatci, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th ¸ International Conference on Machine Learning, pages 927–934, Haifa, Israel. Omnipress. [24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681. [25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for radar-based feature-aided tracking. In Optics & Photonics 2005, pages 59131N–59131N. International Society for Optics and Photonics. [26] Swerling, P. (1954). Probability of detection for fluctuating targets. Technical Report RM-1217, Rand Corporation. [27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis, University of Cambridge, Cambridge, UK. 9
3 0.65104598 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
4 0.60060871 143 nips-2013-Integrated Non-Factorized Variational Inference
Author: Shaobo Han, Xuejun Liao, Lawrence Carin
Abstract: We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the 1 norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method. 1
5 0.58425277 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi
Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difficulty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difficulty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efficient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We first prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efficiency, we also propose an efficient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1
6 0.5715543 317 nips-2013-Streaming Variational Bayes
7 0.55132788 294 nips-2013-Similarity Component Analysis
8 0.54970187 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
9 0.52672178 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
10 0.52594018 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
11 0.5042662 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
12 0.48123297 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
13 0.47973812 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
14 0.47483835 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
15 0.45773283 168 nips-2013-Learning to Pass Expectation Propagation Messages
16 0.44784895 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
17 0.43708885 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
18 0.43433708 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
19 0.4257752 158 nips-2013-Learning Multiple Models via Regularized Weighting
20 0.42505243 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
topicId topicWeight
[(3, 0.03), (16, 0.04), (33, 0.116), (34, 0.158), (35, 0.01), (41, 0.035), (49, 0.026), (56, 0.084), (70, 0.017), (76, 0.256), (85, 0.027), (89, 0.032), (93, 0.043), (95, 0.011)]
simIndex simValue paperId paperTitle
1 0.79739028 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
Author: Mahito Sugiyama, Karsten Borgwardt
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1
same-paper 2 0.78545815 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
Author: Kohei Hayashi, Ryohei Fujimaki
Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a specific condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identifiability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efficiency. 1
3 0.75254667 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
Author: Yifei Ma, Roman Garnett, Jeff Schneider
Abstract: A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, equivalent to the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used V-optimality criterion queries nodes that reduce the L2 (regression) loss. V-optimality satisfies a submodularity property showing that greedy reduction produces a (1 − 1/e) globally optimal solution. However, L2 loss may not characterise the true nature of 0/1 loss in classification problems and thus may not be the best choice for active learning. We consider a new criterion we call Σ-optimality, which queries the node that minimizes the sum of the elements in the predictive covariance. Σ-optimality directly optimizes the risk of the surveying problem, which is to determine the proportion of nodes belonging to one class. In this paper we extend submodularity guarantees from V-optimality to Σ-optimality using properties specific to GRFs. We further show that GRFs satisfy the suppressor-free condition in addition to the conditional independence inherited from Markov random fields. We test Σoptimality on real-world graphs with both synthetic and real data and show that it outperforms V-optimality and other related methods on classification. 1
4 0.74612367 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning
Author: Kamalika Chaudhuri, Staal A. Vinterbo
Abstract: Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and datamining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application. In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning – training a regularized linear classifier and building a histogram density estimator that result in end-toend differentially private solutions for these problems. 1
5 0.73146003 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
Author: Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, Andrea L. Thomaz
Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.
6 0.65390944 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
7 0.64944774 149 nips-2013-Latent Structured Active Learning
8 0.64441806 143 nips-2013-Integrated Non-Factorized Variational Inference
9 0.64326429 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
10 0.6399284 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
11 0.63989174 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
12 0.63965732 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
13 0.63807702 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
14 0.63742316 173 nips-2013-Least Informative Dimensions
15 0.63583434 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
16 0.63532257 86 nips-2013-Demixing odors - fast inference in olfaction
17 0.63529599 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
18 0.63470799 201 nips-2013-Multi-Task Bayesian Optimization
19 0.6346311 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
20 0.63449669 347 nips-2013-Variational Planning for Graph-based MDPs