nips nips2006 nips2006-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. [sent-4, score-0.526]
2 While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. [sent-7, score-0.143]
3 We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. [sent-8, score-0.258]
4 1 Introduction Magnetoencephalography (MEG) and electroencephalography (EEG) use an array of sensors to take EM field measurements from on or near the scalp surface with excellent temporal resolution. [sent-9, score-0.111]
5 In both cases, the observed field is generated by the same synchronous, compact current sources located within the brain. [sent-10, score-0.13]
6 Because the mapping from source activity configuration to sensor measurement is many to one, accurately determining the spatial locations of these unknown sources is extremely difficult. [sent-11, score-0.55]
7 The relevant localization problem can be posed as follows: The measured EM signal is B ∈ db ×n , where db equals the number of sensors and n is the number of time points at which measurements are made. [sent-12, score-0.413]
8 The unknown sources S ∈ ds ×n are the (discretized) current values at ds candidate locations distributed throughout the cortical surface. [sent-13, score-0.454]
9 These candidate locations are obtained by segmenting a structural MR scan of a human subject and tesselating the gray matter surface with a set of vertices. [sent-14, score-0.11]
10 To obtain reasonable spatial resolution, the number of candidate source locations will necessarily be much larger than the number of sensors (ds db ). [sent-18, score-0.509]
11 The salient inverse problem then becomes the ill-posed estimation of these activity or source regions, which are reflected by the nonzero rows of ˆ the source estimate matrix S. [sent-19, score-0.626]
12 Because the inverse model is underdetermined, all efforts at source reconstruction are heavily dependent on prior assumptions, which in a Bayesian framework are embedded in the distribution p(S). [sent-20, score-0.33]
13 Examples include variational Bayesian methods [14, 15], hierarchial covariance component models [4, 8, 11], and automatic relevance determination (ARD) [7, 9, 12, 13, 17]. [sent-24, score-0.237]
14 We also analyze several theoretical properties of this framework related to computational/convergence issues, local minima, and localization bias. [sent-26, score-0.188]
15 2 A Generalized Bayesian Framework for Source Localization In this section, we present a general-purpose Bayesian framework for source localization. [sent-29, score-0.25]
16 While derived using different assumptions and methodology, they can be related via the notion of automatic relevance determination [9] and evidence maximization [7]. [sent-31, score-0.217]
17 While the unknown noise covariance can also be parameterized and estimated from the data, for simplicity we assume that Σ is known and fixed. [sent-33, score-0.153]
18 Next we adopt the following source prior for S: dγ p (S; Σs ) = N (0, Σs ) , γi Ci , Σs = (2) i=1 where the distribution is understood to apply independently to each column of S. [sent-34, score-0.301]
19 , γdγ ]T is a vector of dγ nonnegative hyperparameters that control the relative contribution of each covariance basis matrix Ci , all of which we assume are fixed and known. [sent-38, score-0.219]
20 The unknown hyperparameters can be estimated from the data by first integrating out the unknown sources S giving p(B; Σb ) = p (B|S) p (S; Σs ) dS = N (0, Σb ), (3) where Σb = Σ + LΣs LT . [sent-39, score-0.36]
21 This expression is then maximized with respect to the unknown hyperparameters, a process referred to as type-II maximum likelihood or evidence maximization [7, 9] or restricted maximum likelihood [4]. [sent-41, score-0.119]
22 Thus the optimization problem shifts from finding the maximum a posteriori sources given a fixed prior to finding the optimal hyperparameters of a parameterized prior. [sent-42, score-0.325]
23 To the extent that the ‘learned’ prior p(S; Σs ) is ˆ i realistic, this posterior quantifies regions of significant current density and point estimates for the unknown sources can be obtained by evaluating the posterior mean ˆ S ˆ ˆ ˆ E S|B; Σs = Σs LT Σ + LΣs LT −1 B. [sent-45, score-0.336]
24 It is this selection, rather than the adoption of a covariance component model per se, that primarily differentiates the many different empirical Bayesian approaches and points to novel algorithms for future study. [sent-47, score-0.108]
25 The optimization strategy adopted for ˆ computing γ , as well as the particular choice of hyperprior p(γ), if any, can also be distinguishing factors. [sent-48, score-0.106]
26 More interesting covariance component terms have been used to effect spatial smoothness, depth bias compensation, and candidate locations of likely activity [8, 11]. [sent-50, score-0.308]
27 With regard to the latter, it has been suggested that prior information about a source location can be codified by including a Ci term with all zeros except a patch of 1’s along the diagonal signifying a location of probable source activity, perhaps based on fMRI data [11]. [sent-51, score-0.659]
28 An associated hyperparameter γi is then estimated to determine the appropriate contribution of this component to the overall prior covariance. [sent-52, score-0.121]
29 1 This specification for the prior involves the counterintuitive addition of an unknown hyperparameter for every candidate source location which, on casual analysis may seem prone to severe overfitting (in contrast to [11], which uses only one or two fixed location priors). [sent-59, score-0.526]
30 However, the process of marginalization, or the integrating out of the unknown sources S, provides an extremely powerful regularizing effect, driving most of the unknown γi to zero during the evidence maximization stage (more on this in Section 3). [sent-60, score-0.292]
31 This ameliorates the overfitting problem and effectively reduces the space of possible active source locations by choosing a small relevant subset of location priors that optimizes the Bayesian evidence (hence ARD). [sent-61, score-0.391]
32 With this ‘learned’ prior in place, a once ill-posed inverse problem is no longer untenable, with the posterior mean providing a good estimate of source activity. [sent-62, score-0.371]
33 T In contrast, to model sources with some spatial extent, we can choose Ci = ψi ψi , where each ψi represents, for example, an ds × 1 geodesic neural basis vector that specifies an a priori weight location and activity extent [13]. [sent-64, score-0.436]
34 In this scenario, the number of hyperparameters satisfies dγ = vds , where v is the number of scales we wish to examine in a multi-resolution decomposition, and can be quite large (dγ ≈ 106 ). [sent-65, score-0.144]
35 As mentioned above, the ARD framework tests many priors corresponding to many hypotheses or beliefs regarding the locations and scales of the nonzero current activity within the brain, ultimately choosing the one with the highest evidence. [sent-66, score-0.138]
36 The net result of this formulation is a source prior composed of a mixture of Gaussian kernels of varying scales. [sent-67, score-0.301]
37 But the essential ingredient of ARD, that marginalization and subsequent evidence maximization leads to a pruning of unsupported hypotheses, remains unchanged. [sent-70, score-0.115]
38 In [15], a plausible hierarchical prior is adopted that, unfortunately, leads to intractable integrations when computing the desired source posterior. [sent-72, score-0.327]
39 This motivates the inclusion of a variational approximation that models the true posterior as a factored distribution over parameters at two levels of the prior hierarchy. [sent-73, score-0.146]
40 While seemingly quite different, drawing on results from [1], we can show that the resulting cost function is exactly equivalent to standard ARD assuming Σs is parameterized as ds Σs = ds T γ(ds +j) ψj ψj , γi ei ei + i=1 (5) j=1 and so dγ = 2ds . [sent-74, score-0.447]
41 When fMRI data is available, it is incorporated into a particular inverse Gamma hyperprior on γ, as is also commonly done with ARD methods [1]. [sent-75, score-0.109]
42 In contrast, the variational model from [14] introduces an additional hierarchy to the ARD framework to explicitly model temporal correlations between sources which may be spatially separated. [sent-78, score-0.258]
43 2 Here it is assumed that S can be decomposed with respect to dz pre1 Here we assume dipoles with orientations constrained to be orthogonal to the cortical surface; however, the method is easily extended to handle unconstrained dipoles. [sent-79, score-0.121]
44 2 Although standard ARD does not explicitly model correlated sources that are spatially separated, it still works well in this situation (see Section 3) and can reflect such correlations via the inferred posterior mean. [sent-80, score-0.289]
45 sources via S = W Z, p(W ; Σw ) = N (0, Σw ), p(Z) = N (0, I), (6) dz ×n where Z ∈ represents the pre-source matrix and Σw is analogous to Σs . [sent-81, score-0.169]
46 As stated in [14], direct application of ARD would involve integration over W and Z to find the hyperparameters γ that maximize p(B; Σb ). [sent-82, score-0.144]
47 In fact, if we assume the appropriate hyperprior on γ, then this correlated source method is essentially the same as the procedure from [15] but with an additional level in the approximate posterior factorization for handling the decomposition (6). [sent-89, score-0.447]
48 However, the ˆ ˆ ˆ posterior mean of W , W , is used as an estimate of the source correlation matrix (using W W T ) to substantially improve beamforming results that were errantly based on uncorrelated source models. [sent-91, score-0.583]
49 While shown to be successful in estimating a handful of hyperparameters in [8, 11], this could potentially be problematic when very large numbers of hyperparameters are present. [sent-97, score-0.317]
50 For example, in several toy problems (with dγ large) we have found that a fraction of the hyperparameters obtained can be negative-valued, inconsistent with our initial premise. [sent-98, score-0.144]
51 A new decomposition of Σb is defined as dγ dγ γi Ci LT = Σ + Σb = Σ + L i=1 γi Li LT , i (8) i=1 where Li LT LCi LT with ri rank(Li LT ) ≤ db . [sent-101, score-0.136]
52 Also, using commutative properties of the i i trace operator, L(γ) only depends on the data B through the db ×db sample correlation matrix BB T . [sent-102, score-0.146]
53 Therefore, to reduce the computational burden, we replace B with a matrix B ∈ db ×rank(B) such that B B T = BB T . [sent-103, score-0.108]
54 By casting the update rules in this way and noting that off-diagonal elements of the second term need not be computed, the per-iteration cost is at dγ most O d2 i=1 ri ≤ O d3 dγ . [sent-107, score-0.157]
55 In general, the linear dependence on dγ is one of the attractive aspects of this method, effectively allowing for extremely large numbers of hyperparameters and covariance components. [sent-112, score-0.219]
56 The only reported localization results using this type of EM algorithm are from [15], where a relatively low resolution lead-field matrix is used in conjunction with a simplifying heuristic that constrains some of the hyperparameter values. [sent-114, score-0.232]
57 However, to avoid these types of constraints, which can potentially degrade the quality of source estimates, a faster update rule is needed. [sent-115, score-0.324]
58 For example, given db = 275 sensors, n = 1000 observation vectors, and using a pseudo lead-field with 120,000 unique columns and an equal number of hyperparameters, requires approximately 5-10 mins. [sent-118, score-0.108]
59 Example localization results using (10) demonstrate the ability to recover very complex source configurations with variable spatial extent [13]. [sent-121, score-0.477]
60 (11) F Finally, the correlated source method from [14] can be incorporated into the general ARD framework as well using update rules related to the above; however, because all off-diagonal terms are required 2 by this method, the iterations now scale as ( i ri ) in the general case. [sent-127, score-0.399]
61 2 Relationship with Other Bayesian Methods As a point of comparison, we now describe how ARD can be related to alternative Bayesian-inspired approaches such as the sLORETA paradigm [10] and the iterative FOCUSS source localization algorithm [5]. [sent-130, score-0.41]
62 The j-th column of R (called a point-spread function) equals the source estimate obtained using (4) when the true source is a unit dipole at location j [16]. [sent-132, score-0.609]
63 Continuing, if we assume that initialization of ARD occurs with γ (0) = 1 (as is customary), then the hyperparameters produced after a single iteration of ARD are equivalent to computing the sLORETA estimate for standardized current density power [10] (this assumes fixed orientation constraints). [sent-133, score-0.2]
64 In this context, the inclusion of R as a normalization factor helps to compensate for depth bias, which is the propensity for deep current sources within the brain to be underrepresented at the scalp surface [10, 12]. [sent-134, score-0.245]
65 Therefore, ARD can be viewed in some sense as taking the recursive FOCUSS update rules and including the sLORETA normalization that, among other things, allows for depth bias compensation. [sent-138, score-0.147]
66 Recall that the evidence maximization procedure upon which ARD is based involves integrating out the unknown sources before optimizing the hyperparameters γ. [sent-141, score-0.393]
67 For example, when p(γ) is a noninformative Jeffreys prior, then g(x) = log x and (13) becomes a generalized form of the FOCUSS cost function (and reduces to the exact FOCUSS cost when Ai = ei for all i). [sent-143, score-0.182]
68 ) can be naturally handled and, if desired, the noise covariance Σ can be seamlessly estimated as well (see [3] for a special case of the latter in the context of kernel regression). [sent-148, score-0.11]
69 3 General Properties of ARD Methods ARD methods maintain several attributes that make them desirable candidates for source localization. [sent-153, score-0.25]
70 Previously, we have claimed that the ARD process naturally forces excessive/irrelevant hyperparameters to converge to zero, thereby reducing model complexity. [sent-157, score-0.144]
71 While this observation has been verified empirically by ourselves and others in various application settings, there has been relatively little corroborating theoretical evidence, largely because of the difficulty in analyzing the potentially multimodal, non-convex ARD cost function. [sent-158, score-0.109]
72 Every local minimum of the generalized ARD cost function (7) is achieved at a solution with at most rank(B)db ≤ d2 nonzero hyperparameters. [sent-160, score-0.127]
73 Result 1 comprises a worst-case bound that is only tight in very nuanced situations; in practice, for any reasonable value of Σ , the number of nonzero hyperparameters is typically much smaller than db . [sent-162, score-0.3]
74 The bound holds for all Σ , including Σ = 0, indicating that some measure of hyperparameter pruning, and therefore covariance component pruning, is built into the ARD framework irrespective of the noise-based regularization. [sent-163, score-0.113]
75 Moreover, the number of nonzero hyperparameters decreases monotonically to zero as Σ is increased. [sent-164, score-0.192]
76 And so there is always some Σ = Σ sufficiently large such that all hyperparameters converge to exactly zero. [sent-165, score-0.144]
77 For example, with low noise and perfectly correlated sources, the estimation problem reduces to an equivalent problem with n = 1, so the local minima profile of the cost function does not improve with increasing n. [sent-171, score-0.203]
78 In contrast, geometric arguments can be made to show that uncorrelated sources with large n offer the best opportunity for local minima avoidance. [sent-173, score-0.21]
79 Further theoretical support for ARD is possible in the context of localization bias assuming simple source configurations. [sent-175, score-0.508]
80 For example, substantial import has been devoted to quantifying localization bias when estimating a single dipolar source. [sent-176, score-0.326]
81 Recently it has been shown, both empirically [10] and theoretically [16], that sLORETA has zero location bias under this condition at high SNR. [sent-177, score-0.124]
82 We assume that the lead-field matrix L represents a sufficiently high sampling of the source space such that any active dipole aligns with some lead-field column. [sent-181, score-0.305]
83 Assume that Σs includes (among others) ds covariance components of the form Ci = ei eT . [sent-184, score-0.255]
84 Then in the absence of noise (high SNR), ARD has provably zero localization bias when i estimating a single dipolar source, regardless of the value of n. [sent-185, score-0.361]
85 For example, multiple dipolar sources can be localized with zero bias if they are perfectly uncorrelated (orthogonal) across time and assuming some mild technical conditions [20]. [sent-187, score-0.372]
86 Let Σs be constructed as above and assume the noise covariance matrix Σ is known up to a scale factor. [sent-190, score-0.11]
87 Then given a single dipolar source, in the limit as n becomes large the ARD cost function is unimodal, and a source estimate with zero localization bias achieves the global minimum. [sent-191, score-0.628]
88 As for proofs, all the theoretical results pertaining to localization bias in this section follow from local minima properties of ML covariance component estimates. [sent-193, score-0.408]
89 While details have been deferred to [20], the basic idea is that if the outerproduct BB T can be expressed as some non-negative linear combination of the available covariance components, then the ARD cost function is unimodal and Σb = n−1 BB T at any minimizing solution. [sent-194, score-0.127]
90 This Σb in turn produces unbiased source estimates in a variety of situations. [sent-195, score-0.25]
91 For example, all of the MAP-based focal algorithms we are aware of, including FOCUSS and MCE methods, provably maintain a localization bias in the general setting, although in particular cases they may not exhibit one. [sent-197, score-0.23]
92 (Also, because of the additional complexity involved, it is still unclear whether the correlated source method of [14] satisfies a similar result. [sent-198, score-0.294]
93 ) When we move to more complex source configurations with possible correlations and noise, theoretical results are not available; however, empirical tests provide a useful means of comparison. [sent-199, score-0.359]
94 4 Discussion The efficacy of modern empirical Bayesian techniques and variational approximations make them attractive candidates for source localization. [sent-201, score-0.337]
95 Rao, “Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm,” J. [sent-241, score-0.309]
96 Friston, “MEG source localization under multiple constraints: An extended Bayesian framework,” NeuroImage, vol. [sent-270, score-0.41]
97 Friston, “An empirical Bayesian solution to the source reconstruction problem in EEG,” NeuroImage, vol. [sent-287, score-0.283]
98 Makeig, “Neuroelectromagnetic source imaging using multiscale ı geodesic neural bases and sparse Bayesian learning,” 12th Conf. [sent-295, score-0.316]
99 Nagarajan, “Reconstructing MEG sources with unknown correlations,” Advances in Neural Information Processing Systems 16, MIT Press, 2004. [sent-300, score-0.173]
100 Nagarajan, “Localization bias and spatial resolution of adaptive and non-adaptive spatial filters for MEG source reconstruction,” NeuroImage, vol. [sent-314, score-0.428]
wordName wordTfidf (topN-words)
[('ard', 0.674), ('source', 0.25), ('focuss', 0.203), ('sloreta', 0.203), ('lt', 0.167), ('localization', 0.16), ('hyperparameters', 0.144), ('sources', 0.13), ('meg', 0.128), ('db', 0.108), ('ds', 0.102), ('dipolar', 0.096), ('bayesian', 0.082), ('hyperprior', 0.08), ('neuroimage', 0.079), ('ei', 0.078), ('covariance', 0.075), ('mce', 0.073), ('bias', 0.07), ('ci', 0.069), ('bb', 0.058), ('neuroelectromagnetic', 0.055), ('rez', 0.055), ('dipole', 0.055), ('ram', 0.055), ('variational', 0.054), ('location', 0.054), ('eld', 0.053), ('cost', 0.052), ('prior', 0.051), ('activity', 0.049), ('correlations', 0.048), ('nonzero', 0.048), ('phillips', 0.048), ('evidence', 0.046), ('update', 0.045), ('correlated', 0.044), ('dipoles', 0.044), ('unknown', 0.043), ('li', 0.043), ('relevance', 0.042), ('uncorrelated', 0.042), ('em', 0.041), ('brain', 0.041), ('scalp', 0.041), ('locations', 0.041), ('posterior', 0.041), ('rao', 0.041), ('determination', 0.039), ('pruning', 0.039), ('dz', 0.039), ('friston', 0.039), ('orientations', 0.038), ('hyperparameter', 0.038), ('trace', 0.038), ('minima', 0.038), ('sensors', 0.037), ('makeig', 0.037), ('mattout', 0.037), ('pertaining', 0.037), ('reml', 0.037), ('rugg', 0.037), ('spatial', 0.037), ('candidate', 0.036), ('seemingly', 0.035), ('noise', 0.035), ('perfectly', 0.034), ('resolution', 0.034), ('eeg', 0.034), ('geodesic', 0.034), ('surface', 0.033), ('exible', 0.033), ('empirical', 0.033), ('assumptions', 0.033), ('imaging', 0.032), ('nagarajan', 0.032), ('neuromagnetic', 0.032), ('jeffreys', 0.032), ('wipf', 0.032), ('rules', 0.032), ('appropriate', 0.032), ('extent', 0.03), ('maximization', 0.03), ('sahani', 0.029), ('standardized', 0.029), ('transparent', 0.029), ('potentially', 0.029), ('issues', 0.029), ('inverse', 0.029), ('scenario', 0.029), ('theoretical', 0.028), ('ri', 0.028), ('respects', 0.027), ('orientation', 0.027), ('minimum', 0.027), ('automatic', 0.027), ('ai', 0.027), ('adopted', 0.026), ('spatially', 0.026), ('rank', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
2 0.25972241 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data
Author: Johanna M. Zumer, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan
Abstract: We have developed a novel algorithm for integrating source localization and noise suppression based on a probabilistic graphical model of stimulus-evoked MEG/EEG data. Our algorithm localizes multiple dipoles while suppressing noise sources with the computational complexity equivalent to a single dipole scan, and is therefore more efficient than traditional multidipole fitting procedures. In simulation, the algorithm can accurately localize and estimate the time course of several simultaneously-active dipoles, with rotating or fixed orientation, at noise levels typical for averaged MEG data. Furthermore, the algorithm is superior to beamforming techniques, which we show to be an approximation to our graphical model, in estimation of temporally correlated sources. Success of this algorithm for localizing auditory cortex in a tumor patient and for localizing an epileptic spike source are also demonstrated. 1
3 0.15970717 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
4 0.11290389 60 nips-2006-Convergence of Laplacian Eigenmaps
Author: Mikhail Belkin, Partha Niyogi
Abstract: Geometrically based methods for various tasks of machine learning have attracted considerable attention over the last few years. In this paper we show convergence of eigenvectors of the point cloud Laplacian to the eigenfunctions of the Laplace-Beltrami operator on the underlying manifold, thus establishing the first convergence results for a spectral dimensionality reduction algorithm in the manifold setting. 1
5 0.10804654 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments
Author: Michael I. Mandel, Daniel P. Ellis, Tony Jebara
Abstract: We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation. 1
6 0.096486218 33 nips-2006-Analysis of Representations for Domain Adaptation
7 0.09619946 46 nips-2006-Blind source separation for over-determined delayed mixtures
8 0.076179236 113 nips-2006-Learning Structural Equation Models for fMRI
9 0.069845065 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
10 0.067229338 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
11 0.066976197 22 nips-2006-Adaptive Spatial Filters with predefined Region of Interest for EEG based Brain-Computer-Interfaces
12 0.065166704 116 nips-2006-Learning from Multiple Sources
13 0.062001649 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
14 0.057797417 131 nips-2006-Mixture Regression for Covariate Shift
15 0.057592504 57 nips-2006-Conditional mean field
16 0.057235148 69 nips-2006-Distributed Inference in Dynamical Systems
17 0.057213772 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
18 0.056391202 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
19 0.05542158 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
20 0.054330267 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
topicId topicWeight
[(0, -0.206), (1, -0.003), (2, 0.051), (3, -0.061), (4, -0.068), (5, -0.02), (6, 0.218), (7, 0.038), (8, -0.066), (9, 0.068), (10, 0.051), (11, -0.025), (12, -0.05), (13, -0.052), (14, 0.179), (15, -0.127), (16, 0.169), (17, -0.175), (18, 0.063), (19, 0.003), (20, -0.089), (21, 0.026), (22, -0.094), (23, -0.051), (24, -0.101), (25, -0.019), (26, 0.011), (27, -0.039), (28, 0.061), (29, -0.089), (30, -0.055), (31, 0.097), (32, -0.073), (33, 0.123), (34, -0.016), (35, 0.161), (36, 0.181), (37, -0.067), (38, -0.018), (39, -0.114), (40, -0.038), (41, 0.09), (42, 0.039), (43, -0.005), (44, -0.049), (45, -0.039), (46, -0.126), (47, 0.022), (48, 0.077), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.9499808 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
2 0.85121596 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data
Author: Johanna M. Zumer, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan
Abstract: We have developed a novel algorithm for integrating source localization and noise suppression based on a probabilistic graphical model of stimulus-evoked MEG/EEG data. Our algorithm localizes multiple dipoles while suppressing noise sources with the computational complexity equivalent to a single dipole scan, and is therefore more efficient than traditional multidipole fitting procedures. In simulation, the algorithm can accurately localize and estimate the time course of several simultaneously-active dipoles, with rotating or fixed orientation, at noise levels typical for averaged MEG data. Furthermore, the algorithm is superior to beamforming techniques, which we show to be an approximation to our graphical model, in estimation of temporally correlated sources. Success of this algorithm for localizing auditory cortex in a tumor patient and for localizing an epileptic spike source are also demonstrated. 1
3 0.52607095 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments
Author: Michael I. Mandel, Daniel P. Ellis, Tony Jebara
Abstract: We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation. 1
4 0.47727054 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
5 0.44596139 46 nips-2006-Blind source separation for over-determined delayed mixtures
Author: Lars Omlor, Martin Giese
Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and specifically for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1
6 0.43287876 113 nips-2006-Learning Structural Equation Models for fMRI
7 0.40447611 33 nips-2006-Analysis of Representations for Domain Adaptation
8 0.39358929 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
9 0.38003334 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
10 0.37265059 60 nips-2006-Convergence of Laplacian Eigenmaps
11 0.33194965 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
12 0.31723267 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
13 0.31212518 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
14 0.31051046 194 nips-2006-Towards a general independent subspace analysis
15 0.30950156 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
16 0.30197108 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
17 0.2808052 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
18 0.28055677 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
19 0.27038714 41 nips-2006-Bayesian Ensemble Learning
20 0.26674372 69 nips-2006-Distributed Inference in Dynamical Systems
topicId topicWeight
[(1, 0.12), (3, 0.037), (7, 0.08), (9, 0.042), (20, 0.033), (22, 0.069), (34, 0.036), (39, 0.177), (44, 0.088), (57, 0.077), (65, 0.059), (69, 0.037), (71, 0.02), (90, 0.017)]
simIndex simValue paperId paperTitle
1 0.92384636 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
Author: David Barber, Silvia Chiappa
Abstract: Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be more easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent dynamical processes underlying noisy EEG signals. 1 Linear Gaussian State-Space Models Linear Gaussian State-Space Models (LGSSMs)1 are fundamental in time-series analysis [1, 2, 3]. In these models the observations v1:T 2 are generated from an underlying dynamical system on h1:T according to: v v vt = Bht + ηt , ηt ∼ N (0V , ΣV ), h h ht = Aht−1 + ηt , ηt ∼ N (0H , ΣH ) , where N (µ, Σ) denotes a Gaussian with mean µ and covariance Σ, and 0X denotes an Xdimensional zero vector. The observation vt has dimension V and the hidden state ht has dimension H. Probabilistically, the LGSSM is defined by: T p(v1:T , h1:T |Θ) = p(v1 |h1 )p(h1 ) p(vt |ht )p(ht |ht−1 ), t=2 with p(vt |ht ) = N (Bht , ΣV ), p(ht |ht−1 ) = N (Aht−1 , ΣH ), p(h1 ) = N (µ, Σ) and where Θ = {A, B, ΣH , ΣV , µ, Σ} denotes the model parameters. Because of the widespread use of these models, a Bayesian treatment of parameters is of considerable interest [4, 5, 6, 7, 8]. An exact implementation of the Bayesian LGSSM is formally intractable [8], and recently a Variational Bayesian (VB) approximation has been studied [4, 5, 6, 7, 9]. The most challenging part of implementing the VB method is performing inference over h1:T , and previous authors have developed their own specialized routines, based on Belief Propagation, since standard LGSSM inference routines appear, at first sight, not to be applicable. 1 2 Also called Kalman Filters/Smoothers, Linear Dynamical Systems. v1:T denotes v1 , . . . , vT . A key contribution of this paper is to show how the Variational Bayesian treatment of the LGSSM can be implemented using standard LGSSM inference routines. Based on the insight we provide, any standard inference method may be applied, including those specifically addressed to improve numerical stability [2, 10, 11]. In this article, we decided to describe the predictor-corrector and Rauch-Tung-Striebel recursions [2], and also suggest a small modification that reduces computational cost. The Bayesian LGSSM is particularly of interest when strong prior constraints are needed to find adequate solutions. One such case is in EEG signal analysis, whereby we wish to extract sources that evolve independently through time. Since EEG is particularly noisy [12], a prior that encourages sources to have preferential dynamics is advantageous. This application is discussed in Section 4, and demonstrates the ease of applying our VB framework. 2 Bayesian Linear Gaussian State-Space Models In the Bayesian treatment of the LGSSM, instead of considering the model parameters Θ as fixed, ˆ ˆ we define a prior distribution p(Θ|Θ), where Θ is a set of hyperparameters. Then: ˆ p(v1:T |Θ) = ˆ p(v1:T |Θ)p(Θ|Θ) . (1) Θ In a full Bayesian treatment we would define additional prior distributions over the hyperparameters ˆ Θ. Here we take instead the ML-II (‘evidence’) framework, in which the optimal set of hyperpaˆ ˆ rameters is found by maximizing p(v1:T |Θ) with respect to Θ [6, 7, 9]. For the parameter priors, here we define Gaussians on the columns of A and B 3 : H e− p(A|α, ΣH ) ∝ αj 2 ˆ ( A j −A j ) T ˆ Σ−1 (Aj −Aj ) H H , e− p(B|β, ΣV ) ∝ j=1 βj 2 T ˆ (Bj −Bj ) ˆ Σ−1 (Bj −Bj ) V , j=1 ˆ ˆ which has the effect of biasing the transition and emission matrices to desired forms A and B. The −1 −1 4 conjugate priors for general inverse covariances ΣH and ΣV are Wishart distributions [7] . In the simpler case assumed here of diagonal covariances these become Gamma distributions [5, 7]. The ˆ hyperparameters are then Θ = {α, β}5 . Variational Bayes ˆ Optimizing Eq. (1) with respect to Θ is difficult due to the intractability of the integrals. Instead, in VB, one considers the lower bound [6, 7, 9]6 : ˆ ˆ L = log p(v1:T |Θ) ≥ Hq (Θ, h1:T ) + log p(Θ|Θ) q(Θ) + E(h1:T , Θ) q(Θ,h1:T ) ≡ F, where E(h1:T , Θ) ≡ log p(v1:T , h1:T |Θ). Hd (x) signifies the entropy of the distribution d(x), and · d(x) denotes the expectation operator. The key approximation in VB is q(Θ, h1:T ) ≡ q(Θ)q(h1:T ), from which one may show that, for optimality of F, ˆ E(h1:T ,Θ) q(h1:T ) . q(h1:T ) ∝ e E(h1:T ,Θ) q(Θ) , q(Θ) ∝ p(Θ|Θ)e These coupled equations need to be iterated to convergence. The updates for the parameters q(Θ) are straightforward and are given in Appendices A and B. Once converged, the hyperparameters are ˆ updated by maximizing F with respect to Θ, which lead to simple update formulae [7]. Our main concern is with the update for q(h1:T ), for which this paper makes a departure from treatments previously presented. 3 More general Gaussian priors may be more suitable depending on the application. For expositional simplicity, we do not put priors on µ and Σ. 5 For simplicity, we keep the parameters of the Gamma priors fixed. 6 Strictly we should write throughout q(·|v1:T ). We omit the dependence on v1:T for notational convenience. 4 Unified Inference on q(h1:T ) 3 Optimally q(h1:T ) is Gaussian since, up to a constant, E(h1:T , Θ) − 1 2 q(Θ) is quadratic in h1:T 7 : T T (vt −Bht )T Σ−1 (vt −Bht ) V q(B,ΣV ) + (ht −Aht−1 ) Σ−1 (ht −Aht−1 ) H t=1 q(A,ΣH ) . (2) In addition, optimally, q(A|ΣH ) and q(B|ΣV ) are Gaussians (see Appendix A), so we can easily carry out the averages in Eq. (2). The further averages over q(ΣH ) and q(ΣV ) are also easy due to conjugacy. Whilst this defines the distribution q(h1:T ), quantities such as q(ht ), required for example for the parameter updates (see the Appendices), need to be inferred from this distribution. Clearly, in the non-Bayesian case, the averages over the parameters are not present, and the above simply represents the posterior distribution of an LGSSM whose visible variables have been clamped into their evidential states. In that case, inference can be performed using any standard LGSSM routine. Our aim, therefore, is to try to represent the averaged Eq. (2) directly as the posterior distribution q (h1:T |˜1:T ) of an LGSSM , for some suitable parameter settings. ˜ v Mean + Fluctuation Decomposition A useful decomposition is to write (vt − Bht )T Σ−1 (vt − Bht ) V = (vt − B ht )T Σ−1 (vt − B ht ) + hT SB ht , t V q(B,ΣV ) f luctuation mean and similarly (ht −Aht−1 )T Σ−1 (ht −Aht−1 ) H = (ht − A ht−1 )T Σ−1 (ht − A ht−1 ) +hT SA ht−1 , t−1 H q(A,ΣH ) mean f luctuation T −1 where the parameter covariances are SB ≡ B T Σ−1 B − B Σ−1 B = V HB and SA ≡ V V T −1 −1 −1 AT ΣH A − A ΣH A = HHA (for HA and HB defined in Appendix A). The mean terms simply represent a clamped LGSSM with averaged parameters. However, the extra contributions from the fluctuations mean that Eq. (2) cannot be written as a clamped LGSSM with averaged parameters. In order to deal with these extra terms, our idea is to treat the fluctuations as arising from an augmented visible variable, for which Eq. (2) can then be considered as a clamped LGSSM. Inference Using an Augmented LGSSM To represent Eq. (2) as an LGSSM q (h1:T |˜1:T ), we may augment vt and B as8 : ˜ v vt = vert(vt , 0H , 0H ), ˜ ˜ B = vert( B , UA , UB ), T where UA is the Cholesky decomposition of SA , so that UA UA = SA . Similarly, UB is the Cholesky decomposition of SB . The equivalent LGSSM q (h1:T |˜1:T ) is then completed by specifying9 ˜ v ˜ A≡ A , ˜ ΣH ≡ Σ−1 H −1 , ˜ ΣV ≡ diag( Σ−1 V −1 , IH , IH ), µ ≡ µ, ˜ ˜ Σ ≡ Σ. The validity of this parameter assignment can be checked by showing that, up to negligible constants, the exponent of this augmented LGSSM has the same form as Eq. (2)10 . Now that this has been written as an LGSSM q (h1:T |˜1:T ), standard inference routines in the literature may be applied to ˜ v compute q(ht |v1:T ) = q (ht |˜1:T ) [1, 2, 11]11 . ˜ v 7 For simplicity of exposition, we ignore the first time-point here. The notation vert(x1 , . . . , xn ) stands for vertically concatenating the arguments x1 , . . . , xn . 9 ˜ ˜ ˜ Strictly, we need a time-dependent emission Bt = B, for t = 1, . . . , T − 1. For time T , BT has the Cholesky factor UA replaced by 0H,H . 10 There are several ways of achieving a similar augmentation. We chose this since, in the non-Bayesian limit UA = UB = 0H,H , no numerical instabilities would be introduced. 11 Note that, since the augmented LGSSM q (h1:T |˜1:T ) is designed to match the fully clamped distribution ˜ v q(h1:T |v1:T ), the filtered posterior q (ht |˜1:t ) does not correspond to q(ht |v1:t ). ˜ v 8 Algorithm 1 LGSSM: Forward and backward recursive updates. The smoothed posterior p(ht |v1:T ) ˆ is returned in the mean hT and covariance PtT . t procedure F ORWARD 1a: P ← Σ −1 T T 1b: P ← DΣ, where D ≡ I − ΣUAB I + UAB ΣUAB UAB ˆ0 ← µ 2a: h1 ˆ 2b: h0 ← Dµ 1 1 ˆ ˆ ˆ 3: K ← P B T (BP B T + ΣV )−1 , P1 ← (I − KB)P , h1 ← h0 + K(vt − B h0 ) 1 1 1 for t ← 2, T do t−1 4: Ptt−1 ← APt−1 AT + ΣH t−1 5a: P ← Pt −1 T T 5b: P ← Dt Ptt−1 , where Dt ≡ I − Ptt−1 UAB I + UAB Ptt−1 UAB UAB ˆ ˆ 6a: ht−1 ← Aht−1 t t−1 ˆ ˆ 6b: ht−1 ← Dt Aht−1 t t−1 T ˆ ˆ ˆ 7: K ← P B (BP B T + ΣV )−1 , Ptt ← (I − KB)P , ht ← ht−1 + K(vt − B ht−1 ) t t t end for end procedure procedure BACKWARD for t ← T − 1, 1 do ← − t At ← Ptt AT (Pt+1 )−1 ← T − ← − T t t Pt ← Pt + At (Pt+1 − Pt+1 )At T ← ˆT − ˆ ˆ ˆ hT ← ht + At (ht+1 − Aht ) t t t end for end procedure For completeness, we decided to describe the standard predictor-corrector form of the Kalman Filter, together with the Rauch-Tung-Striebel Smoother [2]. These are given in Algorithm 1, where q (ht |˜1:T ) is computed by calling the FORWARD and BACKWARD procedures. ˜ v We present two variants of the FORWARD pass. Either we may call procedure FORWARD in ˜ ˜ ˜ ˜ ˜ ˜ Algorithm 1 with parameters A, B, ΣH , ΣV , µ, Σ and the augmented visible variables vt in which ˜ we use steps 1a, 2a, 5a and 6a. This is exactly the predictor-corrector form of a Kalman Filter [2]. Otherwise, in order to reduce the computational cost, we may call procedure FORWARD with the −1 ˜ ˜ parameters A, B , ΣH , Σ−1 , µ, Σ and the original visible variable vt in which we use steps ˜ ˜ V T 1b (where UAB UAB ≡ SA +SB ), 2b, 5b and 6b. The two algorithms are mathematically equivalent. Computing q(ht |v1:T ) = q (ht |˜1:T ) is then completed by calling the common BACKWARD pass. ˜ v The important point here is that the reader may supply any standard Kalman Filtering/Smoothing routine, and simply call it with the appropriate parameters. In some parameter regimes, or in very long time-series, numerical stability may be a serious concern, for which several stabilized algorithms have been developed over the years, for example the square-root forms [2, 10, 11]. By converting the problem to a standard form, we have therefore unified and simplified inference, so that future applications may be more readily developed12 . 3.1 Relation to Previous Approaches An alternative approach to the one above, and taken in [5, 7], is to write the posterior as T log q(h1:T ) = φt (ht−1 , ht ) + const. t=2 for suitably defined quadratic forms φt (ht−1 , ht ). Here the potentials φt (ht−1 , ht ) encode the averaging over the parameters A, B, ΣH , ΣV . The approach taken in [7] is to recognize this as a 12 The computation of the log-likelihood bound does not require any augmentation. pairwise Markov chain, for which the Belief Propagation recursions may be applied. The approach in [5] is based on a Kullback-Leibler minimization of the posterior with a chain structure, which is algorithmically equivalent to Belief Propagation. Whilst mathematically valid procedures, the resulting algorithms do not correspond to any of the standard forms in the Kalman Filtering/Smoothing literature, whose properties have been well studied [14]. 4 An Application to Bayesian ICA A particular case for which the Bayesian LGSSM is of interest is in extracting independent source signals underlying a multivariate timeseries [5, 15]. This will demonstrate how the approach developed in Section 3 makes VB easily to apply. The sources si are modeled as independent in the following sense: p(si , sj ) = p(si )p(sj ), 1:T 1:T 1:T 1:T for i = j, i, j = 1, . . . , C. Independence implies block diagonal transition and state noise matrices A, ΣH and Σ, where each block c has dimension Hc . A one dimensional source sc for each independent dynamical subsystem is then t formed from sc = 1T hc , where 1c is a unit vector and hc is the state of t c t t dynamical system c. Combining the sources, we can write st = P ht , where P = diag(1T , . . . , 1T ), ht = vert(h1 , . . . , hC ). The resulting 1 C t t emission matrix is constrained to be of the form B = W P , where W is the V × C mixing matrix. This means that the observations v are formed from linearly mixing the sources, vt = W st + ηt . The Figure 1: The structure of graphical structure of this model is presented in Fig 1. To encourage redundant components to be removed, we place a zero mean Gaussian the LGSSM for ICA. prior on W . In this case, we do not define a prior for the parameters ΣH and ΣV which are instead considered as hyperparameters. More details of the model are given in [15]. The constraint B = W P requires a minor modification from Section 3, as we discuss below. Inference on q(h1:T ) A small modification of the mean + fluctuation decomposition for B occurs, namely: (vt − Bht )T Σ−1 (vt − Bht ) V q(W ) = (vt − B ht )T Σ−1 (vt − B ht ) + hT P T SW P ht , t V −1 where B ≡ W P and SW = V HW . The quantities W and HW are obtained as in Appendix A.1 with the replacement ht ← P ht . To represent the above as a LGSSM, we augment vt and B as vt = vert(vt , 0H , 0C ), ˜ ˜ B = vert( B , UA , UW P ), where UW is the Cholesky decomposition of SW . The equivalent LGSSM is then completed by ˜ ˜ ˜ ˜ specifying A ≡ A , ΣH ≡ ΣH , ΣV ≡ diag(ΣV , IH , IC ), µ ≡ µ, Σ ≡ Σ, and inference for ˜ q(h1:T ) performed using Algorithm 1. This demonstrates the elegance and unity of the approach in Section 3, since no new algorithm needs to be developed to perform inference, even in this special constrained parameter case. 4.1 Demonstration As a simple demonstration, we used an LGSSM to generate 3 sources sc with random 5×5 transition t matrices Ac , µ = 0H and Σ ≡ ΣH ≡ IH . The sources were mixed into three observations v vt = W st + ηt , for W chosen with elements from a zero mean unit variance Gaussian distribution, and ΣV = IV . We then trained a Bayesian LGSSM with 5 sources and 7 × 7 transition matrices Ac . ˆ To bias the model to find the simplest sources, we used Ac ≡ 0Hc ,Hc for all sources. In Fig2a and Fig 2b we see the original sources and the noisy observations respectively. In Fig2c we see the estimated sources from our method after convergence of the hyperparameter updates. Two of the 5 sources have been removed, and the remaining three are a reasonable estimation of the original sources. Another possible approach for introducing prior knowledge is to use a Maximum a Posteriori (MAP) 0 50 100 150 200 250 300 0 50 100 (a) 150 200 250 300 0 50 (b) 100 150 200 250 300 0 50 (c) 100 150 200 250 300 (d) Figure 2: (a) Original sources st . (b) Observations resulting from mixing the original sources, v v vt = W st + ηt , ηt ∼ N (0, I). (c) Recovered sources using the Bayesian LGSSM. (d) Sources found with MAP LGSSM. 0 1 2 (a) 3s 0 1 2 (b) 3s 0 1 2 (c) 3s 0 1 2 (d) 3s 0 1 2 3s (e) Figure 3: (a) Original raw EEG recordings from 4 channels. (b-e) 16 sources st estimated by the Bayesian LGSSM. procedure by adding a prior term to the original log-likelihood log p(v1:T |A, W, ΣH , ΣV , µ, Σ) + log p(A|α) + log p(W |β). However, it is not clear how to reliably find the hyperparameters α and β in this case. One solution is to estimate them by optimizing the new objective function jointly with respect to the parameters and hyperparameters (this is the so-called joint map estimation – see for example [16]). A typical result of using this joint MAP approach on the artificial data is presented in Fig 2d. The joint MAP does not estimate the hyperparameters well, and the incorrect number of sources is identified. 4.2 Application to EEG Analysis In Fig 3a we plot three seconds of EEG data recorded from 4 channels (located in the right hemisphere) while a person is performing imagined movement of the right hand. As is typical in EEG, each channel shows drift terms below 1 Hz which correspond to artifacts of the instrumentation, together with the presence of 50 Hz mains contamination and masks the rhythmical activity related to the mental task, mainly centered at 10 and 20 Hz [17]. We would therefore like a method which enables us to extract components in these information-rich 10 and 20 Hz frequency bands. Standard ICA methods such as FastICA do not find satisfactory sources based on raw ‘noisy’ data, and preprocessing with band-pass filters is usually required. Additionally, in EEG research, flexibility in the number of recovered sources is important since there may be many independent oscillators of interest underlying the observations and we would like some way to automatically determine their effective number. To preferentially find sources at particular frequencies, we specified a block ˆ diagonal matrix Ac for each source c, where each block is a 2 × 2 rotation matrix at the desired frequency. We defined the following 16 groups of frequencies: [0.5], [0.5], [0.5], [0.5]; [10,11], [10,11], [10,11], [10,11]; [20,21], [20,21], [20,21], [20,21]; [50], [50], [50], [50]. The temporal evolution of the sources obtained after training the Bayesian LGSSM is given in Fig3(b,c,d,e) (grouped by frequency range). The Bayes LGSSM removed 4 unnecessary sources from the mixing matrix W , that is one [10,11] Hz and three [20,21] Hz sources. The first 4 sources contain dominant low frequency drift, sources 5, 6 and 8 contain [10,11] Hz, while source 10 contains [20,21] Hz centered activity. Of the 4 sources initialized to 50 Hz, only 2 retained 50 Hz activity, while the Ac of the other two have changed to model other frequencies present in the EEG. This method demonstrates the usefulness and applicability of the VB method in a real-world situation. 5 Conclusion We considered the application of Variational Bayesian learning to Linear Gaussian State-Space Models. This is an important class of models with widespread application, and finding a simple way to implement this approximate Bayesian procedure is of considerable interest. The most demanding part of the procedure is inference of the hidden states of the model. Previously, this has been achieved using Belief Propagation, which differs from inference in the Kalman Filtering/Smoothing literature, for which highly efficient and stabilized procedures exist. A central contribution of this paper is to show how inference can be written using the standard Kalman Filtering/Smoothing recursions by augmenting the original model. Additionally, a minor modification to the standard Kalman Filtering routine may be applied for computational efficiency. We demonstrated the elegance and unity of our approach by showing how to easily apply a Variational Bayes analysis of temporal ICA. Specifically, our Bayes ICA approach successfully extracts independent processes underlying EEG signals, biased towards preferred frequency ranges. We hope that this simple and unifying interpretation of Variational Bayesian LGSSMs may therefore facilitate the further application to related models. A A.1 Parameter Updates for A and B Determining q(B|ΣV ) By examining F, the contribution of q(B|ΣV ) can be interpreted as the negative KL divergence between q(B|ΣV ) and a Gaussian. Hence, optimally, q(B|ΣV ) is a Gaussian. The covariance [ΣB ]ij,kl ≡ Bij − Bij Bkl − Bkl (averages wrt q(B|ΣV )) is given by: T hj hl t t −1 [ΣB ]ij,kl = [HB ]jl [ΣV ]ik , where [HB ]jl ≡ t=1 −1 The mean is given by B = NB HB , where [NB ]ij ≡ T t=1 hj t q(ht ) q(ht ) + βj δjl . i ˆ vt + βj Bij . Determining q(A|ΣH ) Optimally, q(A|ΣH ) is a Gaussian with covariance T −1 hj hl t t −1 [ΣA ]ij,kl = [HA ]jl [ΣH ]ik , where [HA ]jl ≡ t=1 −1 The mean is given by A = NA HA , where [NA ]ij ≡ B T t=2 q(ht ) hj hi t−1 t + αj δjl . q(ht−1:t ) ˆ + αj Aij . Covariance Updates By specifying a Wishart prior for the inverse of the covariances, conjugate update formulae are possible. In practice, it is more common to specify diagonal inverse covariances, for which the corresponding priors are simply Gamma distributions [7, 5]. For this simple diagonal case, the explicit updates are given below. Determining q(ΣV ) For the constraint Σ−1 = diag(ρ), where each diagonal element follows a Gamma prior V Ga(b1 , b2 ) [7], q(ρ) factorizes and the optimal updates are q(ρi ) = Ga b1 + where GB ≡ −1 T NB HB NB . T T 1 i , b2 + (vt )2 − [GB ]ii + 2 2 t=1 j ˆ2 βj Bij , Determining q(ΣH ) Analogously, for Σ−1 = diag(τ ) with prior Ga(a1 , a2 ) [5], the updates are H T T −1 1 ˆij , a2 + (hi )2 − [GA ]ii + αj A2 , q(τi ) = Ga a1 + t 2 2 t=2 j −1 T where GA ≡ NA HA NA . Acknowledgments This work is supported by the European DIRAC Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and X.-R. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1998. [2] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MATLAB. John Wiley and Sons, Inc., 2001. [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications. Springer, 2000. [4] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21:349–356, 2005. [5] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 13th European Signal Processing Conference, 2005. [6] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation, 14:2647–2692, 2002. [7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [8] M. Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.O. Bernardo, J.O. Berger, A.P Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. [9] D. J. C. MacKay. Ensemble learning and evidence maximisation. Unpublished manuscipt: www.variational-bayes.org, 1995. [10] M. Morf and T. Kailath. Square-root algorithms for least-squares estimation. IEEE Transactions on Automatic Control, 20:487–497, 1975. [11] P. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions on Automatic Control, 41:727–732, 1996. [12] E. Niedermeyer and F. Lopes Da Silva. Electroencephalography: basic principles, clinical applications and related fields. Lippincott Williams and Wilkins, 1999. [13] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305–345, 1999. [14] M. Verhaegen and P. Van Dooren. Numerical aspects of different Kalman filter implementations. IEEE Transactions of Automatic Control, 31:907–917, 1986. [15] S. Chiappa and D. Barber. Bayesian linear Gaussian state-space models for biosignal decomposition. Signal Processing Letters, 14, 2007. [16] S. S. Saquib, C. A. Bouman, and K. Sauer. ML parameter estimation for Markov random fields with applicationsto Bayesian tomography. IEEE Transactions on Image Processing, 7:1029–1044, 1998. [17] G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, pages 1842–1857, 1999.
same-paper 2 0.86235487 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
3 0.75419748 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
4 0.75017166 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
Author: Andrea Vedaldi, Stefano Soatto
Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1
5 0.74877143 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
6 0.74722415 20 nips-2006-Active learning for misspecified generalized linear models
7 0.74672896 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.74512374 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
9 0.74394339 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
10 0.74347264 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
11 0.74124265 117 nips-2006-Learning on Graph with Laplacian Regularization
12 0.73680556 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
13 0.73605835 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
14 0.73582935 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
15 0.73423022 169 nips-2006-Relational Learning with Gaussian Processes
16 0.73403448 79 nips-2006-Fast Iterative Kernel PCA
17 0.73380715 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
18 0.73262763 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
19 0.73239207 35 nips-2006-Approximate inference using planar graph decomposition
20 0.73237264 121 nips-2006-Learning to be Bayesian without Supervision