nips nips2013 nips2013-308 knowledge-graph by maker-knowledge-mining

308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Source: pdf

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantiﬁes the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufﬁcient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difﬁcult to infer from limited data. Our approach mitigates this difﬁculty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. [sent-7, score-1.693]

2 Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. [sent-8, score-1.006]

3 Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufﬁcient depth. [sent-9, score-1.113]

4 Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difﬁcult to infer from limited data. [sent-10, score-0.258]

5 We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. [sent-12, score-0.951]

6 The entropy rate is of interest as a measure of uncertainty per unit time, an upper bound on the rate of information transmission, and an intermediate step in computing mutual information rate between stimulus and neural response. [sent-16, score-0.92]

7 Unfortunately, accurate entropy rate estimation is difﬁcult, and estimates from limited data are often severely biased. [sent-17, score-0.704]

8 We present a Bayesian method for estimating entropy rates from binary data that uses hierarchical Dirichlet process priors (HDP) to reduce this bias. [sent-18, score-0.677]

9 Our method proceeds by modeling the source of the data as a Markov chain, and then using the fact that the entropy rate of a Markov chain is a deterministic function of its transition probabilities. [sent-19, score-0.945]

10 Fitting the model yields parameters relevant to both questions (1) and (2) above: we obtain both an approximation of the underlying stochastic process as a Markov chain, and an estimate of the entropy rate of the process. [sent-20, score-0.701]

11 For binary data, the HDP reduces to a hierarchy of beta priors, where the prior probability over g, the probability of the next symbol given a long history, is a beta distribution centered on the probability of that symbol given a truncated, one-symbol-shorter, history. [sent-21, score-0.587]

12 The posterior over symbols given 1 a certain history is thus “smoothed” by the probability over symbols given a shorter history. [sent-22, score-0.415]

13 In Section 2, we present deﬁnitions and challenges involved in entropy rate estimation, and discuss existing estimators. [sent-25, score-0.642]

14 In Section 3, we discuss Markov models and their relationship to entropy rate. [sent-26, score-0.503]

15 2 Entropy Rate Estimation In information theory, the entropy of a random variable is a measure of the variable’s average unpredictability. [sent-30, score-0.503]

16 The entropy of a discrete random variable X with possible values {x1 , . [sent-31, score-0.503]

17 While entropy is a property of a random variable, entropy rate is a property of a stochastic process, such as a time series, and quantiﬁes the amount of uncertainty per symbol. [sent-36, score-1.145]

18 The neural and simulated data considered here will be binary sequences representing the spike train of a neuron, where each symbol represents either the presence of a spike in a bin (1) or the absence of a spike (0). [sent-37, score-0.611]

19 To evaluate the average uncertainty of each new symbol (0 or 1) given the previous symbols - or the amount of new information per symbol - we would like to compute the entropy rate of the process. [sent-39, score-1.018]

20 For a stochastic process {Xi }∞ the entropy of the random vector (X1 , . [sent-40, score-0.53]

21 If we deﬁne the block entropy Hk to be the entropy of the distribution of length-k sequences of symbols, Hk = H(Xi+1 , . [sent-44, score-1.186]

22 Xi+k ), then the entropy rate of a stochastic process {Xi }∞ is deﬁned by i=1 1 h = lim Hk (2) k→∞ k when the limit exists (which, for stationary stochastic processes, it must). [sent-47, score-0.669]

23 There are two other deﬁnitions for entropy rate, which are equivalent to the ﬁrst for stationary processes: h= h= lim Hk+1 − Hk (3) lim H(Xi+1 |Xi , Xi−1 , . [sent-48, score-0.503]

24 Xi−k ) (4) k→∞ k→∞ We now brieﬂy review existing entropy rate estimators, to which we will compare our results. [sent-51, score-0.642]

25 1 Block Entropy Rate Estimators Since much work has been done to accurately estimate entropy from data, Equations (2) and (3) suggest a simple entropy rate estimator, which consists of choosing ﬁrst a block size k and then a suitable entropy estimator with which to estimate Hk . [sent-53, score-1.931]

26 A simple such estimator is the “plugin” entropy estimator, which approximates the probability of each length-k block (x1 , . [sent-54, score-0.722]

27 We would expect that using better block entropy estimators would yield better entropy rate estimators, and so we also consider two other block based entropy rate estimators. [sent-63, score-2.136]

28 The ﬁrst uses the Bayesian entropy estimator HN SB from Nemenman, Shafee and Bialek [3], which gives a Bayesian least squares estimate for entropy given a mixture-of-Dirichlet prior. [sent-64, score-1.14]

29 For a given k, we obtain entropy rate estimators hN SB,k = HN SB /k and hM M,k = HM M /k by applying the entropy estimators from [3] and [4] respectively to the empirical distribution of the length-k blocks. [sent-66, score-1.375]

30 While we can improve the accuracy of these block entropy rate estimates by choosing a better entropy estimator, choosing the block size k remains a challenge. [sent-67, score-1.415]

31 On the other hand, as we consider larger k, limited data leads to underestimates of the entropy rate. [sent-69, score-0.554]

32 See the plots of hplugin , hN SB , and hM M in Figure 2d for an instance of this effect of block size on entropy rate estimates. [sent-70, score-0.835]

33 Unfortunately, the block entropy rate at this plateau may still be biased, and for data sequences that are short with respect to their time dependencies, there may be no discernible plateau at all ([1], Figure 1). [sent-72, score-0.979]

34 2 Other Entropy Rate Estimators Not all existing techniques for entropy rate estimation involve an explicit choice of block length. [sent-74, score-0.759]

35 The estimator from [2], for example, parses the full string of symbols in the data by starting from the ﬁrst symbol, and sequentially removing and counting as a “phrase” the shortest substring that has not yet appeared. [sent-75, score-0.312]

36 N A ﬁxed block length model like the ones described in the previous section uses the entropy of the distribution of all the blocks of a some length - e. [sent-77, score-0.814]

37 In the context tree weighting (CTW) framework of [1], the authors instead use a minimum descriptive length criterion to weight different tree topologies, which have within the same tree terminal nodes corresponding to blocks of different lengths. [sent-80, score-0.376]

38 They use this weighting to generate Monte Carlo samples and approximate the integral h(θ)p(θ|T, data)p(T|data) dθ dT, in which T represents the tree topology, and θ represents transition probabilities associated with the terminal nodes of the tree. [sent-81, score-0.442]

39 In our approach, the HDP prior combined with a Markov model of our data will be a key tool in overcoming some of the difﬁculties of choosing a block-length appropriately for entropy rate estimation. [sent-82, score-0.712]

40 It will allow us to choose a block length that is large enough to capture possibly important long time dependencies, while easing the difﬁculty of estimating the properties of these long time dependencies from short data. [sent-83, score-0.357]

41 (So that for, example, 110 can transition to state 101 or state 100, but not to any other state). [sent-92, score-0.258]

42 Because only two transitions are possible from each state, the transition probability distribution from each s is completely speciﬁed by only one parameter, which we denote gs , the probability of observing a 1 given the context s. [sent-93, score-0.841]

43 The entropy rate of an ergodic Markov chain with ﬁnite state set A is given by: h= p(s)H(x|s), (7) s∈A where p(s) is the stationary probability associated with state s, and H(x|s) is the entropy of the distribution of possible transitions from state s. [sent-94, score-1.345]

44 The vector of stationary state probabilities p(s) for all s is computed as a left eigenvector of the transition matrix T: p(s) = 1 p(s)T = p(s) , (8) s Since each row of the transition matrix T contains only two non-zero entries, gs , and 1 − gs , p(s) can be calculated relatively quickly. [sent-95, score-1.686]

45 With equations 7 and 8, h can be calculated analytically from the vector of all 2k transition probabilities {gs }. [sent-96, score-0.391]

46 A Bayesian estimator of entropy rate based on a Markov model of order k is given by ˆ hBayes = h(g)p(g|data)dg (9) where g = {gs : |s| = k}, h is the deterministic function of g given by Equations 7 and 8, and p(g|data) ∝ p(data|g)p(g) given some appropriate prior over g. [sent-97, score-0.788]

47 4 Hierarchical Dirichlet Process priors We describe a hierarchical beta prior, a special case of the hierarchical Dirichlet process (HDP), which was presented in [5] and applied to problems of natural language processing in [6] and [7]. [sent-100, score-0.305]

48 4 The true entropy rate h = limk→∞ Hk /k captures time dependencies of inﬁnite depth. [sent-101, score-0.716]

49 However, it is difﬁcult to estimate transition probabilities for long blocks with short data sequences, so choosing large k may lead to inaccurate posterior estimates for the transition probabilities g. [sent-103, score-0.933]

50 In particular, shorter data sequences may not even have observations of all possible symbol sequences of a given length. [sent-104, score-0.306]

51 That is, the probability of observing a 1 after the context sequence 0011 should be similar to that of seeing a 1 after 011, since it might be reasonable to assume that context symbols from the more distant past matter less. [sent-108, score-0.299]

52 Thus we choose for our prior: gs |gs ∼ Beta(α|s| gs , α|s| (1 − gs )) (10) where s denotes the context s with the earliest symbol removed. [sent-109, score-1.768]

53 This choice gives the prior distribution of gs mean gs , as desired. [sent-110, score-1.128]

54 We continue constructing the prior with gs |gs ∼ Beta(α|s | gs , α|s | (1 − gs )) and so on until g[] ∼ Beta(α0 p∅ , α0 (1 − p∅ )) where g[] is the probability of a spike given no context information and p∅ is a hyperparameter reﬂecting our prior belief about the probability of a spike. [sent-111, score-1.876]

55 A priori, the distribution of each transition probability is centered around the transition probability from a one-symbol-shorter block of symbols. [sent-113, score-0.533]

56 As long as the assumption that more distant contextual symbols matter less actually holds (at least to some degree), this structure allows the sharing of statistical information across different contextual depths. [sent-114, score-0.309]

57 We can obtain reasonable estimates for the transition probabilities from long blocks of symbols, even from data that is so short that we may have few (or no) observations of each of these long blocks of symbols. [sent-115, score-0.6]

58 We could use any number of distributions with mean gs to center the prior distribution of gs at gs ; we use Beta distributions because they are conjugate to the likelihood. [sent-116, score-1.67]

59 Thus for a partition into only two sets, the Dirichlet process reduces to a beta distribution, which is why when we specialize the HDP to binary data, we obtain a hierarchical beta distribution. [sent-131, score-0.354]

60 We can calculate the posterior predictive distribution gpr which is speciﬁed by the 2k values {gs = ˆ p(1|s) : |s| = k} by using counts c from the data and performing the above recursive calculation to estimate gs for each of the 2k states s. [sent-135, score-0.757]

61 Given the estimated Markov transition probabilities gpr ˆ we then have an empirical Bayesian entropy rate estimate via Equations 7 and 8. [sent-136, score-1.129]

62 Note that while gpr is the posterior mean of the transition probabilities, the ˆ 5 entropy rate estimator hempHDP is no longer a fully Bayesian estimate, and is not equivalent to ˆ the hBayes of equation 9. [sent-138, score-1.143]

63 Because the parameters of the model have the structure of a tree, each gs for |s| < k is conditionally independent from all but its immediate ancestor in the tree, gs , and its two descendants, g0s and g1s . [sent-145, score-1.084]

64 α|s| , α|s|=1 ) ∝ Beta(gs ; α|s| gs , α|s| (1 − gs ))Beta(g0s ; α|s|+1 gs , α|s|+1 (1 − gs )) Beta(g1s ; α|s|+1 gs , α|s|+1 (1 − gs )) (13) and we can compute these probabilities on a discrete grid since they are each one dimensional, then sample the posterior gs via this grid. [sent-147, score-3.969]

65 For the transition probabilities from the bottom level of the tree {gs : |s| = k}, the conjugacy of the beta distributions with binomial likelihood function gives the posterior conditional of gs a recognizable form: p(gs |gs , data) = Beta(αk gs + cs1 , αk (1 − gs ) + cs0 ). [sent-149, score-2.173]

66 This upper bound on α is rather arbitrary, but we veriﬁed that increasing the range for α had little effect on the entropy rate estimate, at least for the ranges and block sizes considered. [sent-152, score-0.759]

67 In some applications, the Markov transition probabilities g, and not just the entropy rate, may be of interest as a description of the time dependencies present in the data. [sent-153, score-0.916]

68 The Gibbs sampler above yields samples from the distribution g|data, and averaging these NM C samples yields a Bayes least squares estimator of transition probabilities, ggibbsHDP . [sent-154, score-0.31]

69 Note that this estimate is closely related ˆ to the estimate gpr from the previous section; with more MC samples, ggibbsHDP converges to the ˆ ˆ posterior mean gpr (when the α are ﬁxed rather than sampled, to match the ﬁxed α per level used in ˆ Equation 11). [sent-155, score-0.29]

70 7 Results We applied the model to both simulated data with a known entropy rate and to neural data, where the entropy rate is unknown. [sent-156, score-1.354]

71 We examine the accuracy of the fully Bayesian and empirical Bayesian entropy rate estimators hHDP and hempHDP , and compare the entropy rate estimators hplugin , hN SB , hM M , hLZ [2], and hCT W [1], which are described in Section 2. [sent-157, score-1.618]

72 We also consider estimates of the Markov transition probabilities g produced by both inference methods. [sent-158, score-0.375]

73 1 4 Figure 2: Comparison of estimated (a) transition probability and (b,c,d) entropy rate for data simulated from a Markov model of depth 5. [sent-175, score-1.008]

74 In (b,c,d) results were averaged over 5 data sequences, and (c) plots the average absolute value of the difference between true and estimated entropy rates. [sent-178, score-0.554]

75 9 We considered data simulated from a Markov model with transition probabilities set so that transihplugin tion probabilities from states with similar sufﬁxes are similar (i. [sent-180, score-0.54]

76 8 that more distant context symbols matter less than more recent ones in determining transiproperty hctw tions). [sent-183, score-0.298]

77 We used a depth-5 Markov model, whose true transition probabilities are shown in black in hempHDP 0. [sent-184, score-0.339]

78 In Figure 2a we compare HDP estimates of transition probabilities of this simulated data to the plugin estimator of transition probabilities gs = ccs1 calculated from a 500-symbol sequence. [sent-186, score-1.556]

79 (The ˆ s other estimators do not include calculating transition probabilities as an intermediate step, and so cannot be included here. [sent-187, score-0.454]

80 ) With a series of 500 symbols, we do not expect enough observations of each of possible transitions to adequately estimate the 2k transition probabilities, even for rather modest depths such as k = 5. [sent-188, score-0.296]

81 And indeed, the “plugin” estimates of transition probabilities do not match the true transition probabilities well. [sent-189, score-0.714]

82 On the other hand, the transition probabilities estimated using the HDP prior show the kind of “smoothing” the prior was meant to encourage, where states corresponding to contexts with same sufﬁxes have similar estimated transition probabilities. [sent-190, score-0.733]

83 Lastly, we plot the convergence of the entropy rate estimators with increased length of the data sequence and the associated error in Figures 2b,c. [sent-191, score-0.854]

84 We see in Figure 2c that the HDP-based entropy rate estimates converge quickly with increasing data, relative to other models. [sent-193, score-0.678]

85 The motivation of the hierarchical prior was to allow observations of transitions from shorter contexts to inform estimates of transitions from longer contexts. [sent-194, score-0.402]

86 This, it was hoped, would mitigate the drop-off with larger block-size seen in block-entropy based entropy rate estimators. [sent-195, score-0.642]

87 Figure 2d indicates that for simulated data that is indeed the case, although we do see some bias the fully Bayesian entropy rate estimator for large block lengths. [sent-196, score-0.959]

88 The empirical Bayes and and fully Bayesian entropy rate estimators with HDP priors produce estimates that are close to the true entropy rate across a wider range of block-size. [sent-197, score-1.493]

89 2 Neural Data We applied the same analysis to neural spike train data collected from primate retinal ganglion cells stimulated with binary full-ﬁeld movies refreshed at 100 Hz [8]. [sent-199, score-0.268]

90 In this case, the true transition probabilities are unknown (and indeed the process may not be exactly Markovian). [sent-200, score-0.366]

91 Thus the estimated transition probabilities reﬂect the idea that more distant context cues matter less, and the smoothing of the HDP prior appears to be appropriate for this neural data. [sent-202, score-0.51]

92 5 2 4 6 8 10 Block Length 12 Figure 3: Comparison of estimated (a) transition probability and (b,c,d) entropy rate for neural data. [sent-220, score-0.875]

93 Mean Absolute Error The true entropy rate is also unknown, but again we estimate it using the plugin estimator on a large data set. [sent-225, score-0.9]

94 4 h converged the long plateau of the estimators in Figure 3d indicating the relative stability of the HDP entropy hNSB 0. [sent-227, score-0.726]

95 3 rate estimators with respect to choice of model depth. [sent-228, score-0.254]

96 8 entropy rate of a spike train or arbitrary binary sequence. [sent-232, score-0.803]

97 8 The true entropy rate of a stochastic process involves consideration of inﬁnitely long time depen0. [sent-234, score-0.699]

98 To make entropy rate estimation tractable, one can try to ﬁx a maximum depth of time 0. [sent-236, score-0.705]

99 2 2 4 6 8 10 10 a Markov chain and estimating transition probabilities using a hierarchical prior that links transition Block Length Data Length probabilities from longer contexts to transition probabilities from shorter contexts. [sent-242, score-1.318]

100 Both methods of entropy rate estimation also yield estimates of the transition probabilities when the data is modeled as a Markov chain, parameters which may be of interest in the own right as descriptive of the statistical structure and time dependencies in a spike train. [sent-246, score-1.245]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gs', 0.542), ('entropy', 0.503), ('hdp', 0.219), ('transition', 0.208), ('symbols', 0.162), ('rate', 0.139), ('probabilities', 0.131), ('hemphdp', 0.121), ('beta', 0.118), ('block', 0.117), ('estimators', 0.115), ('symbol', 0.107), ('spike', 0.105), ('estimator', 0.102), ('plugin', 0.098), ('gpr', 0.091), ('hhdp', 0.086), ('markov', 0.084), ('hplugin', 0.076), ('dependencies', 0.074), ('length', 0.071), ('chain', 0.069), ('hbayes', 0.069), ('hlz', 0.069), ('hierarchical', 0.065), ('bayesian', 0.064), ('sequences', 0.063), ('depth', 0.063), ('dirichlet', 0.062), ('hk', 0.057), ('transitions', 0.056), ('blocks', 0.052), ('ctw', 0.052), ('hm', 0.049), ('hn', 0.049), ('contexts', 0.048), ('plateau', 0.048), ('shorter', 0.047), ('tree', 0.046), ('nsb', 0.046), ('posterior', 0.044), ('prior', 0.044), ('simulated', 0.044), ('shafee', 0.042), ('nemenman', 0.04), ('sb', 0.039), ('distant', 0.039), ('hierarchy', 0.037), ('estimates', 0.036), ('dp', 0.036), ('context', 0.035), ('short', 0.035), ('terminal', 0.035), ('retinal', 0.035), ('emphdp', 0.034), ('ggibbshdp', 0.034), ('hctw', 0.034), ('nm', 0.034), ('ci', 0.033), ('whye', 0.032), ('yee', 0.032), ('estimate', 0.032), ('lz', 0.03), ('madow', 0.03), ('uzzell', 0.03), ('converged', 0.03), ('calculated', 0.03), ('priors', 0.03), ('long', 0.03), ('train', 0.03), ('bayes', 0.029), ('air', 0.029), ('longer', 0.028), ('matter', 0.028), ('fully', 0.028), ('process', 0.027), ('trains', 0.026), ('miller', 0.026), ('linguistics', 0.026), ('data', 0.026), ('binary', 0.026), ('state', 0.025), ('contextual', 0.025), ('underestimates', 0.025), ('bialek', 0.025), ('cs', 0.025), ('xi', 0.025), ('estimated', 0.025), ('alphabet', 0.024), ('primate', 0.023), ('ganglion', 0.023), ('descriptive', 0.023), ('calculate', 0.022), ('mc', 0.022), ('hyperparameter', 0.022), ('weighting', 0.022), ('analytically', 0.022), ('inform', 0.022), ('hmm', 0.022), ('string', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

2 0.26617533 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow

Abstract: Shannon’s entropy is a basic quantity in information theory, and a fundamental building block for the analysis of neural codes. Estimating the entropy of a discrete distribution from samples is an important and difﬁcult problem that has received considerable attention in statistics and theoretical neuroscience. However, neural responses have characteristic statistical structure that generic entropy estimators fail to exploit. For example, existing Bayesian entropy estimators make the naive assumption that all spike words are equally likely a priori, which makes for an inefﬁcient allocation of prior probability mass in cases where spikes are sparse. Here we develop Bayesian estimators for the entropy of binary spike trains using priors designed to ﬂexibly exploit the statistical structure of simultaneouslyrecorded spike responses. We deﬁne two prior distributions over spike words using mixtures of Dirichlet distributions centered on simple parametric models. The parametric model captures high-level statistical features of the data, such as the average spike count in a spike word, which allows the posterior over entropy to concentrate more rapidly than with standard estimators (e.g., in cases where the probability of spiking differs strongly from 0.5). Conversely, the Dirichlet distributions assign prior mass to distributions far from the parametric model, ensuring consistent estimates for arbitrary distributions. We devise a compact representation of the data and prior that allow for computationally efﬁcient implementations of Bayesian least squares and empirical Bayes entropy estimators with large numbers of neurons. We apply these estimators to simulated and real neural data and show that they substantially outperform traditional methods.

3 0.17504108 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Author: Isik B. Fidaner, Taylan Cemgil

Abstract: Inﬁnite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or ﬁnd its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based deﬁnition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various inﬁnite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.

4 0.11492986 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its inﬁnite-state Bayesian nonparametric extension. Starting with the standard HMM, we ﬁrst derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and ﬂexible algorithm for discrete-state sequence data with a non-ﬁxed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

5 0.10058603 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

Author: Paul Valiant, Gregory Valiant

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Speciﬁcally, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modiﬁcation of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to ﬁrst use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1

6 0.097255372 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

7 0.091771483 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

8 0.090463474 201 nips-2013-Multi-Task Bayesian Optimization

9 0.08697056 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

10 0.084662579 173 nips-2013-Least Informative Dimensions

11 0.080178872 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

12 0.079883307 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

13 0.077420495 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

14 0.077198289 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

15 0.073632762 66 nips-2013-Computing the Stationary Distribution Locally

16 0.073475495 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

17 0.073199004 299 nips-2013-Solving inverse problem of Markov chain with partial observations

18 0.071283557 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

19 0.07088729 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

20 0.070628256 149 nips-2013-Latent Structured Active Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, 0.025), (2, -0.057), (3, -0.013), (4, -0.099), (5, 0.082), (6, 0.115), (7, -0.025), (8, 0.058), (9, 0.028), (10, -0.017), (11, -0.009), (12, -0.022), (13, 0.003), (14, 0.047), (15, -0.04), (16, 0.084), (17, 0.043), (18, -0.034), (19, 0.079), (20, -0.056), (21, 0.074), (22, -0.133), (23, -0.181), (24, 0.049), (25, 0.105), (26, -0.043), (27, -0.1), (28, 0.004), (29, 0.044), (30, 0.046), (31, 0.251), (32, -0.041), (33, -0.088), (34, 0.082), (35, 0.097), (36, -0.1), (37, 0.133), (38, 0.001), (39, 0.035), (40, 0.115), (41, 0.061), (42, 0.048), (43, -0.067), (44, -0.002), (45, 0.012), (46, -0.106), (47, 0.068), (48, 0.022), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97246617 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

2 0.83769971 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow

3 0.76307309 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Author: Isik B. Fidaner, Taylan Cemgil

4 0.63544881 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

Author: Il M. Park, Evan W. Archer, Kenneth Latimer, Jonathan W. Pillow

Abstract: Probabilistic models for binary spike patterns provide a powerful tool for understanding the statistical dependencies in large-scale neural recordings. Maximum entropy (or “maxent”) models, which seek to explain dependencies in terms of low-order interactions between neurons, have enjoyed remarkable success in modeling such patterns, particularly for small groups of neurons. However, these models are computationally intractable for large populations, and low-order maxent models have been shown to be inadequate for some datasets. To overcome these limitations, we propose a family of “universal” models for binary spike patterns, where universality refers to the ability to model arbitrary distributions over all 2m binary patterns. We construct universal models using a Dirichlet process centered on a well-behaved parametric base measure, which naturally combines the ﬂexibility of a histogram and the parsimony of a parametric model. We derive computationally efﬁcient inference methods using Bernoulli and cascaded logistic base measures, which scale tractably to large populations. We also establish a condition for equivalence between the cascaded logistic and the 2nd-order maxent or “Ising” model, making cascaded logistic a reasonable choice for base measure in a universal model. We illustrate the performance of these models using neural data. 1

5 0.56716514 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin

Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we ﬁnd several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1

6 0.56440061 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

7 0.5182718 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

8 0.47153842 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

9 0.46816668 299 nips-2013-Solving inverse problem of Markov chain with partial observations

10 0.44895813 173 nips-2013-Least Informative Dimensions

11 0.4210555 246 nips-2013-Perfect Associative Learning with Spike-Timing-Dependent Plasticity

12 0.40326864 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

13 0.39109236 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

14 0.38870713 46 nips-2013-Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models

15 0.38826889 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

16 0.38281661 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

17 0.3808732 36 nips-2013-Annealing between distributions by averaging moments

18 0.3668009 201 nips-2013-Multi-Task Bayesian Optimization

19 0.36157432 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

20 0.35013929 252 nips-2013-Predictive PAC Learning and Process Decompositions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.052), (16, 0.044), (33, 0.143), (34, 0.117), (36, 0.253), (41, 0.014), (49, 0.028), (56, 0.099), (70, 0.018), (85, 0.029), (89, 0.063), (93, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84395701 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-speciﬁed approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by ﬁtting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

same-paper 2 0.82058585 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

3 0.79387558 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the ﬁrst algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

4 0.76852 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in ﬁnance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we ﬁrst deﬁne a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a trafﬁc signal control application. 1

5 0.74989897 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efﬁcient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1

6 0.6873219 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

7 0.67952549 233 nips-2013-Online Robust PCA via Stochastic Optimization

8 0.67232329 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

9 0.66893721 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

10 0.66806024 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

11 0.66630244 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

12 0.66343814 232 nips-2013-Online PCA for Contaminated Data

13 0.6632328 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

14 0.66289991 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

15 0.66189641 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

16 0.65979278 173 nips-2013-Least Informative Dimensions

17 0.6596911 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

18 0.65903187 324 nips-2013-The Fast Convergence of Incremental PCA

19 0.65870315 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

20 0.65866441 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations