nips nips2012 nips2012-321 knowledge-graph by maker-knowledge-mining

321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Source: pdf

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identiﬁcation in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identiﬁcation algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These beneﬁts are shown to extend to real neural data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Spectral learning of linear dynamics from generalised-linear observations with application to neural population data Lars Buesing∗ , Jakob H. [sent-1, score-0.098]

2 uk Abstract Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. [sent-5, score-0.214]

3 We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. [sent-7, score-0.352]

4 Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. [sent-9, score-0.239]

5 1 Introduction Latent linear dynamical system (LDS) models, also known as Kalman-ﬁlter models or linearGaussian state-space models, provide an important framework for modelling shared temporal structure in multivariate time series. [sent-11, score-0.142]

6 Inference of the dynamical state in such a model can be performed exactly by Kalman smoothing [1] and so the expectation-maximisation (EM) algorithm may be used to ﬁnd a local maximum of the likelihood [2]. [sent-13, score-0.084]

7 This is a method-of-moments-based estimation process, which, like other spectral methods, provides estimators that are non-iterative, consistent and do not suffer from the problems of multiple optima that dog maximum-likelihood (ML) learning in practice. [sent-15, score-0.096]

8 Such methods may be computationally intensive, suffer from varying degrees of approximation error, and are subject to the same concerns about multiple likelihood optima as is the linear-Gaussian case2 Thus, a consistent spectral method is likely to be of particular value for such models. [sent-26, score-0.096]

9 Even when data are few, the approach provides a valuable initialisation to approximate EM. [sent-29, score-0.193]

10 1 SSID for LDS models with linear-Gaussian observations Let q-dimensional observations yt , t ∈ {1, . [sent-33, score-0.172]

11 , T } depend on a p-dimensional latent state xt , described by a linear ﬁrst-order auto-regressive process with Gaussian initial distribution and Gaussian innovations: x1 ∼ N (x0 , Q0 ) xt+1 | xt ∼ N (Axt , Q) (1) zt = Cxt + d yt | zt ∼ N (zt , R). [sent-36, score-0.356]

12 Here, x0 and Q0 are the mean and covariance of the initial state and Q is the covariance of the innovations. [sent-37, score-0.094]

13 The variable zt of dimension q is deﬁned as an afﬁne function of the latent state xt , parametrised by the loading matrix C and the mean parameter d. [sent-39, score-0.234]

14 Given zt , observations are independently distributed around this value with covariance R. [sent-40, score-0.17]

15 Furthermore let Π := limt→∞ Cov[xt ] denote the covariance of the stationary marginal distribution if the system is stable (i. [sent-41, score-0.075]

16 This algorithm takes as input the empirical estimate of the so-called “future-past Hankel matrix” H which is deﬁned as the cross-covariance between + − time-lagged vectors yt (the “future”) and yt (the “past”) of the observed data:     yt yt−1 . [sent-48, score-0.319]

17 The key to SSID is that H (which is independent of t as stationarity is assumed) has rank equal to the dimensionality p of the linear dynamical state. [sent-56, score-0.116]

18 (2) The SSID algorithm ﬁrst takes the singular value decomposition (SVD) of the empirical estimate H of H to recover a two-part factorisation as in (2) given a user-deﬁned latent dimensionality p (a ˆ suitable p may be estimated by inspection of the singular value spectrum of H). [sent-64, score-0.17]

19 Although SSID techniques have been developed for observations that are Gaussian-distributed around a mean that is a nonlinear function of the latent state [5], we are unaware of SSID methods that address arbitrary observation models. [sent-69, score-0.101]

20 2 SSID for gl-LDS models by moment conversion Consider now the gl-LDS in which the Gaussian observation process of model (1) is replaced by the following more general observation model. [sent-71, score-0.176]

21 Further, let yt,i | zt be arbitrarily distributed around a (known) monotonic element-wise nonlinear mapping f (·) such that E[yt |zt ] = f (zt ). [sent-75, score-0.085]

22 Consider the covariance matrix Cov[y± ] of the combined 2kq-dimensional future-past vector y± which is deﬁned by stacking y+ and y− (here and henceforth we drop the subscripts t as unnecessary given the assumed stationarity of the process). [sent-84, score-0.113]

23 Denote the mean and covariance matrix of the normal distribution of z± (deﬁned analogously to y± ) by µ and Σ. [sent-85, score-0.081]

24 The upper-right block of the covariance matrix Σ then provides an estimate of the future-past Hankel matrix Cov[z+ , z− ] which can be decomposed as in standard Ho-Kalman SSID. [sent-95, score-0.115]

25 3 SSID for Poisson dynamical systems (PLDSID) We now consider in greater detail a special case of the gl-LDS model, which is of particular interest in neuroscience applications. [sent-97, score-0.084]

26 The observations in this model are (when conditioned on the latent state) Poisson-distributed with a mean that is exponential in the output of the dynamical system, yt,i | zt,i ∼ Poisson[exp(zt,i )]. [sent-98, score-0.156]

27 PLDS and close variants have recently been applied for modelling multi-electrode recordings [12, 13, 14, 15]. [sent-100, score-0.084]

28 5 0 true # SV D difference of EVs C normed SV B normed SV observed normed SV A 0. [sent-123, score-0.148]

29 A) Time-lagged covariance matrix Cov[yt+1 , yt ] and the singular value (SV) spectrum of the full Hankel matrix H = Cov[y+ , y− ] computed from the observed count data (artiﬁcial data set I). [sent-129, score-0.378]

30 D) Summed absolute difference of the eigenvalue spectra of the ground truth dynamics matrix A and the one identiﬁed by PLDSID. [sent-134, score-0.199]

31 E) Same as C) but for the angle between the subspaces spanned by the loading matrix of the ground truth and estimated models. [sent-136, score-0.158]

32 F) SV spectrum of the Hankel matrix of multi-electrode data before (left) and after (right) moment conversion. [sent-137, score-0.175]

33 This procedure ensures that there exists a unique solution (µ, Σ) to the moment conversion (6)-(8). [sent-141, score-0.118]

34 In such cases, the moment conversion asymptotically yields the unique correct moments µ and Σ of the Gaussian log-rates z. [sent-144, score-0.178]

35 Assuming stationarity, the HoKalman SSID yields consistent estimates of A, C, Q, d given the true µ and Σ. [sent-145, score-0.075]

36 Hence, the proposed two-stage method yields consistent estimates of the parameters A, C, Q, d of a stationary PLDS. [sent-146, score-0.075]

37 Fortunately, provided that the external inputs are Gaussian-distributed and perturb the dynamics linearly, PLDSID can be extended to identify the 4 parameters of this augmented model. [sent-150, score-0.099]

38 Let ut denote the r-dimensional observed external input at time t, and assume that u1 , . [sent-151, score-0.13]

39 In this case, the N4SID algorithm [3] can perform subspace identiﬁcation based on the joint covariance of u± and z± . [sent-155, score-0.083]

40 Although this covariance is not observed directly in the gl-LDS case, our assumptions make u± and z± jointly normal and so we can use moment transformation again to estimate the required covariance from the observed covariance of u± and y± . [sent-156, score-0.266]

41 For the Poisson model with exponential nonlinearity, this transformation remains closed-form, and in combination with N4SID yields consistent estimates of the PLDS parameters and the input-coupling matrix B. [sent-157, score-0.109]

42 3 Results We investigated the properties of the proposed PLDSID algorithm in numerical experiments, using both artiﬁcial data and multi-electrode recordings of neural activity. [sent-159, score-0.109]

43 Time-series were generated by sampling from a stationary ground truth PLDS with p = 10 latent and q = 25 observed dimensions. [sent-162, score-0.136]

44 The loading matrices C were generated from a matrix with orthonormal columns and by a subsequent scaling with 12. [sent-167, score-0.087]

45 This resulted in instantaneous correlations that were comparable to (average absolute correlation coefﬁcient data set I: c = 2 · 10−2 ) or smaller than (data ¯ sets II, III: c = 3. [sent-169, score-0.094]

46 5 · 10−3 ) those observed in the cortical multi-electrode recordings used below ¯ (¯ = 2. [sent-170, score-0.085]

47 Additionally, we generated a data set for identifying PLDS models with external input by driving the ground truth PLDS of data set II with a 3 dimensional Gaussian AR(1) process ut ; the coupling matrix B was generated such that But had the same covariance as the innovations Q. [sent-173, score-0.283]

48 We ﬁrst illustrate the moment conversion deﬁned by equations (6)-(8) on artiﬁcial data set I. [sent-175, score-0.146]

49 1A shows the time-lagged cross-covariance Cov[yt+1 , yt ] as well as the singular value (SV) spectrum of the full future-past Hankel matrix H = Cov[y+ , y− ] (normalised such that the largest SV is 1), both estimated from 200 trials, with a Hankel size of k = 10. [sent-177, score-0.237]

50 The raw spectrum gradually decays towards small values but does not show a clear low-rank structure of the future-past Hankel matrix H. [sent-178, score-0.112]

51 1B shows the output of the moment transformation yielding an approximation of the cross-covariance Cov[zt+1 , zt ] of the underlying inputs. [sent-180, score-0.148]

52 Further the SV spectrum of the full, transformed future-past Hankel matrix Cov[z+ , z− ] is shown. [sent-181, score-0.112]

53 The latter is dominated by only a few SVs, whose number matches the dimension of the ground truth system p = 10, clearly indicating a low-rank structure. [sent-182, score-0.099]

54 1B as well as its SV spectrum are close to the ones computed from the underlying inputs shown in Fig. [sent-185, score-0.078]

55 , the summed absolute differences between sorted eigenvalues) of the identiﬁed and the ground truth dynamics matrix A. [sent-191, score-0.138]

56 The spectrum of A is an important characteristic of the model, as it determines the time-constants of the underlying dynamics. [sent-192, score-0.078]

57 1E shows the subspace-angle between the true loading matrix C and the one estimated 4 Again, simply applying SSID to the log of the observed counts does not work as most counts are 0. [sent-196, score-0.118]

58 75 1 50 # EM iterations 100 1 40 80 # EM iterations Figure 2: PLDSID is a good initialiser for EM. [sent-201, score-0.124]

59 Cosmoothing performance on the training set as a function of the number of EM iterations for different initialisers on various data sets. [sent-202, score-0.078]

60 A) Artiﬁcial data set consisting of 200 trials and 25 observed dimensions. [sent-203, score-0.124]

61 EM initialised by PLDSID converges faster and achieves higher training performance than EM initialised with FA, Gaussian SSID or random parameter values. [sent-204, score-0.17]

62 B) Same as A) but for data with lower instantaneous correlations and longer auto-correlation. [sent-205, score-0.094]

63 C) Same as A) but for data with negligible temporal correlations and low instantaneous correlations. [sent-207, score-0.094]

64 D) 100 trials of multi-electrode recordings with 86 observed dimensions (spike-sorted units). [sent-209, score-0.178]

65 E) Same as D) but of data set size 500 trials, and only using the 40 most active units F) Same as D) but for 863 trials with all 86 units. [sent-210, score-0.093]

66 As for the dynamics spectrum, the identiﬁed loading matrix approaches the true one for increasing training set size. [sent-212, score-0.12]

67 We compared it to 3 different methods, namely initialisation with random parameters (with 20-50 restarts), factor analysis (FA) and Gaussian SSID. [sent-214, score-0.193]

68 The quality of these initialisers was assessed by monitoring performance of the identiﬁed parameters as a function of EM iterations after initialisation. [sent-215, score-0.078]

69 Good initial parameter values yield fast convergence of EM in few iterations to high performance values, whereas poor initialisations are characterised by slow convergence and termination of EM in poor local maxima (or, potentially, shallow regions of the likelihood). [sent-216, score-0.184]

70 2A), which was designed to have short auto-correlation time constants but pronounced instantaneous correlations between the observed dimensions, PLDSID initialisation leads to superior performance compared to competing methods. [sent-224, score-0.318]

71 Furthermore, the PLDSID+EM parameters converge to a better local optimum than those initialised by the other methods. [sent-226, score-0.085]

72 Hence, on this data set, our initialisation yields both faster computation time and better ﬁnal results. [sent-227, score-0.219]

73 The second artiﬁcial data set featured smaller instantaneous correlations between dimensions but longer auto-correlation time constants. [sent-228, score-0.094]

74 2B, the PLDSID initialisation here yields parameters which are not further improved by EM iterations whereas EM with other initialisations becomes stuck in poor local solutions. [sent-230, score-0.366]

75 2C), and only very small instantaneous correlations across neurons (average instantaneous absolute-correlation c = 3. [sent-232, score-0.147]

76 In general, we observed that PLDSID compares favourably to the other initialisation methods on any data sets we investigated as long as it exhibits shared variability across dimensions and time, and it was observed to work particularly well when correlations were substantial. [sent-235, score-0.324]

77 2 Expectation Maximisation initialised by PLDSID identiﬁes better models on neural data We move now to examine the value of PLDSID in providing initial parameter values for subsequent EM iterations on multi-electrode recordings of neural activity. [sent-241, score-0.239]

78 Such data sets are challenging for statistical modelling as they are high-dimensional (on the order of 102 observed dimensions), sparse (on the order of 10 Hz of spiking activity) and show shared variability across time and dimensions. [sent-242, score-0.099]

79 Brieﬂy, spiking activity was acquired from a 96-channel silicon electrode array (Blackrock, Salt Lake City, UT) implanted in motor areas of the cortex of a rhesus macaque performing a delayed center-out reach task. [sent-244, score-0.094]

80 First, we investigated the SV spectrum of the future-past Hankel matrix computed either from the count-observations of the data, or from the inferred underlying inputs (using Hankel size k = 30 and all trials, see Fig. [sent-249, score-0.14]

81 While we did not observe a marked difference between the two spectra, both spectra indicate that the data can be well described using a small number of singular values. [sent-251, score-0.09]

82 Next, we compared PLDSID to FA and Gaussian SSID initialisations for EM on two different subsets as well as the whole multi-electrode recording data set. [sent-253, score-0.101]

83 PLDSID provides the most appropriate initialisation for EM, allowing it to converge rapidly to better parameter values than are found starting from either the FA or SSID estimates. [sent-256, score-0.193]

84 We also applied all of the methods to the complete data set consisting of 863 trials with all 86 observed neurons (Hankel size k = 30). [sent-259, score-0.124]

85 2F indicate that again PLDSID provided the most useful initialisation for EM. [sent-261, score-0.193]

86 However, random initialisation leads to slow convergence and thus requires substantial computation, as described below. [sent-263, score-0.193]

87 Gaussian SSID yielded poor values for parameters on all data sets, leading EM to terminate in poor local optima after only a few iterations. [sent-264, score-0.153]

88 Thus, an ideal parameter initialisation method will not only improve the robustness of the identiﬁed parameters, but also reduce the computational time needed for EM convergence. [sent-270, score-0.193]

89 Using the variant of PLDSID which also identiﬁes the coupling matrix B yields yields the best parameters. [sent-275, score-0.086]

90 In contrast, using the PLDSID variant which does not estimate B (B is initialised at 0) yields parameters which are of the same quality as alternative methods. [sent-276, score-0.111]

91 For all of the data sets used above, one single EM iteration in our implementation was substantially more costly than parameter initialisation by PLDSID (Fig. [sent-279, score-0.193]

92 In addition, EM started with random initialisation still yielded worse performance than with PLDSID initialisation even after 50 iterations (see Figure 2). [sent-286, score-0.465]

93 Thus, even with a conservative estimate, PLDSID initialisation reduces computational cost by at least a factor of 50 compared to random initialisation. [sent-287, score-0.193]

94 Both PLDSID and EM have a time computational complexity which is proportional to the size N T of the data set (where N is the number of trials and T is the trial length). [sent-288, score-0.093]

95 This simple covariance calculation was much cheaper in our experiments than the moment conversion with cost O(pq 2 ) or the SVD with cost O(p3 q 3 ), both of which are independent of the data set size N T . [sent-290, score-0.165]

96 4 Discussion We investigated parameter estimation for linear-Gaussian state-space models with generalisedlinear observations and presented a method for parameter identiﬁcation in such models which builds on the extensive subspace-identiﬁcation literature for fully Gaussian state-space models. [sent-294, score-0.098]

97 We showed that PLDSID yields consistent estimates of the model parameters without requiring iterative computation. [sent-296, score-0.075]

98 Even when this was not the case, EM initialised with the results of PLDSID converged in fewer iterations, and to a better parameter estimate than when it was initialised randomly, or by other methods—an effect seen with multiple artiﬁcial and multi-electrode recording data sets. [sent-298, score-0.207]

99 As the practical computational difﬁculties of parameter estimation (slow convergence and shallow optima in parameter estimation with EM) in this model are substantial, our algorithm facilitates the use of linear state-space models with non-Gaussian observations in practice. [sent-299, score-0.084]

100 Of particular interest for neural data might be a dynamical system model which precisely reproduced the marginal distribution of integer observations for each observed dimension (by using a ‘Discretised Gaussian’ [20] as the observation model). [sent-301, score-0.237]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pldsid', 0.601), ('ssid', 0.395), ('hankel', 0.265), ('em', 0.259), ('plds', 0.221), ('initialisation', 0.193), ('sv', 0.146), ('yt', 0.096), ('trials', 0.093), ('cov', 0.09), ('zt', 0.085), ('initialised', 0.085), ('dynamical', 0.084), ('spectrum', 0.078), ('identi', 0.077), ('comput', 0.066), ('external', 0.066), ('initialisations', 0.064), ('moment', 0.063), ('spectra', 0.061), ('fa', 0.058), ('lds', 0.056), ('conversion', 0.055), ('recordings', 0.054), ('loading', 0.053), ('instantaneous', 0.053), ('covariance', 0.047), ('kq', 0.047), ('ez', 0.047), ('iterations', 0.046), ('optima', 0.046), ('neurosci', 0.045), ('macke', 0.042), ('correlations', 0.041), ('ii', 0.041), ('normed', 0.039), ('observations', 0.038), ('spiking', 0.038), ('arti', 0.037), ('recording', 0.037), ('poor', 0.037), ('shenoy', 0.036), ('subspace', 0.036), ('truth', 0.036), ('poisson', 0.036), ('ground', 0.035), ('moments', 0.034), ('matrix', 0.034), ('latent', 0.034), ('cial', 0.034), ('maximisation', 0.033), ('activity', 0.033), ('dynamics', 0.033), ('yielded', 0.033), ('ut', 0.033), ('mi', 0.032), ('stationarity', 0.032), ('innovations', 0.032), ('pq', 0.032), ('cosmoothing', 0.032), ('evs', 0.032), ('generalisedlinear', 0.032), ('initialiser', 0.032), ('initialisers', 0.032), ('jakob', 0.032), ('observed', 0.031), ('modelling', 0.03), ('ey', 0.03), ('observation', 0.029), ('count', 0.029), ('singular', 0.029), ('gaussian', 0.029), ('xt', 0.028), ('equations', 0.028), ('system', 0.028), ('investigated', 0.028), ('spike', 0.028), ('vidne', 0.028), ('neural', 0.027), ('ms', 0.027), ('spectral', 0.027), ('estimates', 0.026), ('ahmadian', 0.026), ('cunningham', 0.026), ('kulkarni', 0.026), ('ryu', 0.026), ('yields', 0.026), ('hz', 0.025), ('approximations', 0.024), ('lars', 0.024), ('sahani', 0.024), ('churchland', 0.024), ('sii', 0.024), ('maneesh', 0.024), ('santhanam', 0.024), ('consistent', 0.023), ('fano', 0.023), ('binning', 0.023), ('axt', 0.023), ('motor', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

2 0.15443379 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be deﬁned by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

3 0.14977729 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

4 0.11299695 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been ﬁrst studied in [1]. The paper extends this work in two directions: ﬁrst, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identiﬁability and the uniqueness of maximum likelihood estimates are being given. 1

5 0.08363001 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-conﬁdence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-conﬁdence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

6 0.078574017 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.076778136 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

8 0.074620605 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

9 0.071479589 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

10 0.070917636 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

11 0.070557699 180 nips-2012-Learning Mixtures of Tree Graphical Models

12 0.069846332 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

13 0.062464558 190 nips-2012-Learning optimal spike-based representations

14 0.062363561 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

15 0.057149973 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

16 0.056488011 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

17 0.056236196 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.055477973 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

19 0.055242624 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

20 0.054526813 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.155), (1, 0.027), (2, 0.029), (3, 0.139), (4, -0.043), (5, 0.024), (6, 0.002), (7, -0.02), (8, -0.012), (9, 0.009), (10, 0.008), (11, -0.049), (12, 0.023), (13, -0.005), (14, 0.039), (15, 0.035), (16, 0.06), (17, 0.023), (18, 0.046), (19, -0.063), (20, -0.038), (21, 0.024), (22, -0.024), (23, 0.047), (24, 0.018), (25, 0.045), (26, 0.009), (27, -0.104), (28, -0.035), (29, 0.089), (30, 0.009), (31, 0.093), (32, 0.001), (33, 0.073), (34, -0.063), (35, 0.002), (36, -0.018), (37, -0.027), (38, 0.05), (39, 0.124), (40, 0.053), (41, -0.074), (42, -0.077), (43, -0.011), (44, -0.017), (45, -0.008), (46, -0.006), (47, -0.026), (48, 0.0), (49, -0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89516854 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

2 0.64789444 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

Author: Zhitang Chen, Kun Zhang, Laiwan Chan

Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are signiﬁcant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a speciﬁc type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identiﬁability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.

3 0.62790537 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

4 0.61317325 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we ﬁrst develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the ﬁrst method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

5 0.59951508 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

6 0.55827343 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

7 0.53091401 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

8 0.5197466 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

9 0.51683533 72 nips-2012-Cocktail Party Processing via Structured Prediction

10 0.50108248 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

11 0.48682788 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

12 0.47823247 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

13 0.47716635 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

14 0.47034132 283 nips-2012-Putting Bayes to sleep

15 0.46853539 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

16 0.46289393 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

17 0.44596452 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

18 0.43599486 34 nips-2012-Active Learning of Multi-Index Function Models

19 0.43421316 224 nips-2012-Multi-scale Hyper-time Hardware Emulation of Human Motor Nervous System Based on Spiking Neurons using FPGA

20 0.43246853 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.064), (1, 0.04), (2, 0.169), (17, 0.011), (21, 0.031), (38, 0.085), (42, 0.04), (54, 0.026), (55, 0.019), (61, 0.014), (74, 0.035), (76, 0.144), (80, 0.169), (92, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84774303 196 nips-2012-Learning with Partially Absorbing Random Walks

Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang

Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process uniﬁes many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring ﬁndings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classiﬁcation.

2 0.84532422 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an afﬁne transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efﬁciently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the ﬁtting for example of skew variational densities to the target density. 1

same-paper 3 0.84044099 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

4 0.83399796 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit signiﬁcant shape and appearance variability, and ﬁnd that it obtains results that are comparable to the state-of-the-art. There has been signiﬁcant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The signiﬁcance of parts in computer vision has been recognized since the earliest days of the ﬁeld (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added beneﬁt of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and ﬁnd that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efﬁcient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a ﬁxed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the ﬁxed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to deﬁne probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are speciﬁed by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efﬁciently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efﬁciently deﬁne higher-order potentials that take into account the conﬁguration of larger groups of image pixels. The introduction of such hidden variables provides a way to efﬁciently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and deﬁnes a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the ﬁrst and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the ﬁrst layer, local receptive ﬁelds are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of ﬁrst-layer hidden units and patches, and purposely restrict the number of units in h2 . These modiﬁcations signiﬁcantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W ﬁxed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution deﬁned on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } speciﬁes the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we ﬁnd the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efﬁcient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the ﬁnal values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we ﬁnd it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefﬁcients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to ﬁnetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A signiﬁcant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit signiﬁcant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and deﬁnes distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The ﬁrst dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also ﬂipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are deﬁned on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We ﬁnd that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the ﬁnal segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspeciﬁc algorithm of [21], and that inference in the model signiﬁcantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reﬂects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The ﬁnal column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The ﬁnal column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiproﬁle view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the ﬁrst and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We ﬁnd that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such ﬁne structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efﬁcient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

5 0.80818027 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

Author: Marc Deisenroth, Shakir Mohamed

Abstract: Rich and complex time-series data, such as those generated from engineering systems, ﬁnancial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires ﬂexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian ﬁlters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1

6 0.798388 197 nips-2012-Learning with Recursive Perceptual Representations

7 0.79681063 200 nips-2012-Local Supervised Learning through Space Partitioning

8 0.79593688 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

9 0.79231411 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

10 0.79011446 279 nips-2012-Projection Retrieval for Classification

11 0.78950042 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

12 0.78808177 100 nips-2012-Discriminative Learning of Sum-Product Networks

13 0.7826004 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

14 0.77933902 188 nips-2012-Learning from Distributions via Support Measure Machines

15 0.77735788 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

16 0.77733755 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

17 0.77599919 168 nips-2012-Kernel Latent SVM for Visual Recognition

18 0.77525407 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

19 0.77412415 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

20 0.77411336 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models