nips nips2006 nips2006-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak
Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. [sent-13, score-0.555]
2 This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. [sent-14, score-0.232]
3 Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. [sent-15, score-0.51]
4 However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. [sent-16, score-0.214]
5 Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. [sent-18, score-0.226]
6 We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. [sent-19, score-0.419]
7 Inferring the structure of signalling networks from experimental data precedes any such analysis and is thus a basic and fundamental task. [sent-21, score-0.125]
8 Measurements which directly reveal network structure are often beyond experimental capabilities or are excessively expensive. [sent-22, score-0.23]
9 This paper addresses the problem of inferring the structure of a network from co-occurrence data: observations which indicate nodes that are activated in each of a set of signaling pathways but do not directly reveal the order of nodes within each pathway. [sent-23, score-0.685]
10 Co-occurrence observations arise naturally in a number of interesting contexts, including biological and communication networks, and networks of neuronal colonies. [sent-24, score-0.172]
11 Biological signal transduction networks describe fundamental cell functions and responses to environmental stress [1]. [sent-25, score-0.245]
12 Highthroughput measurement techniques such as microarrays have successfully been used to identify the components of different signal transduction pathways [2]. [sent-27, score-0.224]
13 Developing computational techniques for inferring pathway orders is a largely unexplored research area [3]. [sent-29, score-0.129]
14 In this context, each path corresponds to a transmission between an origin and destination. [sent-31, score-0.19]
15 The origin and destination are observed, in addition to the activated switches/routers carrying the transmission through the network. [sent-32, score-0.143]
16 Treating distinct brain regions as nodes in a functional brain network that co-activate when a subject performs different tasks may lead to a similar network inference problem. [sent-36, score-0.441]
17 Given a collection of co-occurrences, a feasible network (consistent with the observations) is easily obtained by assigning an order to the elements of each co-occurrence, thereby specifying a path through the hypothesized network. [sent-37, score-0.464]
18 Since any arbitrary order of each co-occurrence leads to a feasible network, the number of feasible solutions is proportional to the number of permutations of all the co-occurrence observations. [sent-38, score-0.457]
19 Consequently we are faced with combinatorial explosion of the feasible set, and without additional assumptions or side information there is no reason to prefer one particular feasible network over the others. [sent-39, score-0.435]
20 Despite the apparent intractability of the problem, physical principles governing most networks suggest that not all feasible solutions are equally plausible. [sent-41, score-0.269]
21 This intuition has been used as a stepping stone by recent approaches proposed in the context of telecommunications [4], and in learning networks of collaborators [8]. [sent-43, score-0.091]
22 The random permutation accounts for lack of observed order. [sent-46, score-0.125]
23 In this framework, network inference amounts to maximum likelihood estimation of the parameters governing the random walk (initial state distribution and transition matrix). [sent-48, score-0.292]
24 Direct maximization is intractable due to the highly non-convex log-likelihood function and exponential feasible set arising from simultaneously considering all permutations of all co-occurrences. [sent-49, score-0.414]
25 Instead, we derive a computationally efficient EM algorithm, treating the random permutations as hidden variables. [sent-50, score-0.26]
26 In this framework the likelihood factorizes with respect to each pathway/observation, so that the computational complexity of the EM algorithm is determined by the E-step which is only exponential in the longest path. [sent-51, score-0.116]
27 In order to handle networks with long paths, we propose a Monte Carlo E-step based on a simple, linear complexity importance sampling scheme. [sent-52, score-0.263]
28 Whereas the exact E-step has computational complexity which is exponential in path length, we prove that a polynomial number of importance samples suffices to retain desirable convergence properties of the EM algorithm with high probability. [sent-53, score-0.527]
29 It is worth noting that the approach described here differs considerably from that of learning the structure of a directed graphical model or Bayesian network [9, 10]. [sent-55, score-0.195]
30 The aim of graphical modelling is to find a graph corresponding to a factorization of a high-dimensional distribution which predicts the observations well. [sent-56, score-0.112]
31 These probabilistic models do not directly reflect physical structures, and applying such an approach to co-occurrences would ignore physical constraints inherent to the observations: co-occurring vertices must lie along a path in the network. [sent-57, score-0.306]
32 1 Model Formulation and EM Algorithm The Shuffled Markov Model We model a network as a directed graph G = (V, E), where V = {1, . [sent-59, score-0.196]
33 An observation, y ⊂ V , is a subset of vertices co-activated when a particular stimulus is applied to the network (e. [sent-63, score-0.214]
34 , collection of signaling proteins activated in response to an environmental stress). [sent-65, score-0.225]
35 , yNm }, we say that a graph (V, E) is feasible w. [sent-72, score-0.154]
36 Y if for each y(m) ∈ Y there is an ordered path (m) (m) (m) (m) (m) (m) z(m) = (z1 , . [sent-75, score-0.147]
37 , τNm ) such that zt = y (m) , and τt (zt−1 , zt ) ∈ E, for t = 2, . [sent-81, score-0.126]
38 Each observation y(m) results from shuffling the elements of z(m) via an unobserved permutation τ (m) , (m) (m) drawn uniformly from SNm (the set of all permutations of Nm objects); i. [sent-93, score-0.38]
39 Under this model, the log-likelihood of the set of observations Y is T log P [Y|A, π] = P [y(m) |τ , A, π] − log(Nm ! [sent-100, score-0.128]
40 log (1) τ ∈SNm m=1 N where P [y|τ , A, π] = πyτ1 t=2 Ayτt−1 ,yτt , and network inference consists in computing the maximum likelihood (ML) estimates (AML , π ML ) = arg maxA,π log P [Y|A, π]. [sent-102, score-0.255]
41 With the ML estimates in hand, we may determine the most likely permutation for each y(m) and obtain a feasible reconstruction from the ordered paths. [sent-103, score-0.287]
42 Next, we derive an EM algorithm for this purpose, by treating the permutations as missing data. [sent-105, score-0.26]
43 , r(T ) } be the collection of permutation matrices corresponding to (m) (m) = t ). [sent-130, score-0.162]
44 The EM algorithm proceeds by (the E-step) computing Q A, π; Ak , π k = E log P [X , R|A, π] X , Ak , π k , the expected value of log P [X , R|A, π] w. [sent-137, score-0.102]
45 (m) Since the permutations are (a priori) equiprobable, we have P [r(m) ] = (Nm ! [sent-150, score-0.219]
46 A and π, under the normalization constraints, leads to the M-step: Ak+1 = i,j (m) (m) (m) T Nm ¯ m=1 t ,t =1 αt ,t xt ,i xt ,j (m) (m) (m) |S| T Nm ¯ j=1 m=1 t ,t =1 αt ,t xt ,i xt ,j and k+1 πi = (m) (m) T Nm ¯ m=1 t =1 r1,t xt ,i . [sent-166, score-0.44]
47 (m) (m) |S| T Nm ¯ i=1 m=1 t =1 r1,t xt ,i (5) Standard convergence results for the EM algorithm due to Boyles and Wu [11,12] guarantee that the sequence {(Ak , π k )} converges monotonically to a local maximum of the likelihood. [sent-167, score-0.135]
48 3 Handling Known Endpoints In some applications, (one or both of) the endpoints of each path are known and only the internal nodes are shuffled. [sent-169, score-0.278]
49 For example, in telecommunications problems, the origin and destination of each transmission are known, but not the network connectivity. [sent-170, score-0.282]
50 In estimating biological signal transduction pathways, a physical stimulus (e. [sent-171, score-0.243]
51 Knowledge of the endpoints of each path imposes the constraints (m) (m) r1,1 = 1 and rNm ,Nm = 1. [sent-176, score-0.231]
52 In this setup, the E-step has a similar form as (4) but with sums over r replaced by sums over permutation matrices satisfying r1,1 = 1 and rN,N = 1. [sent-179, score-0.125]
53 3 Large Scale Inference via Importance Sampling For long paths, the combinatorial nature of the exact E-step – summing over all permutations of each sequence in (3) and (4) – may render exact computation intractable. [sent-181, score-0.325]
54 This section presents a Monte Carlo importance sampling (see, e. [sent-182, score-0.179]
55 , [13]) version of the E-step, along with finite sample bounds guaranteeing that a polynomial complexity Monte Carlo EM algorithm retains desirable convergence properties of the EM algorithm; i. [sent-184, score-0.164]
56 Moreover, since the statistics αt ,t and r1,t depend only ¯ ¯ on the mth co-activation observation, y(m) , we focus on a particular length-N path observation y = (y1 , y2 , . [sent-189, score-0.25]
57 A na¨ve Monte Carlo approximation would be based on random permutations sampled from the ı uniform distribution on SN . [sent-193, score-0.219]
58 Instead, we propose the following sequential scheme for sampling a permutation using the current parameter estimates, (A, π). [sent-197, score-0.187]
59 (6) Our sampling scheme proceeds as follows: Step 1: Initialize f so that fi = 1 if yt = i for some t = 1, . [sent-211, score-0.15]
60 Set fv = 0 to prevent yt from being sampled again (ensure τ is a permutation). [sent-218, score-0.101]
61 Repeating this sampling procedure L times yields a collection of iid permutations τ 1 , τ 2 , . [sent-226, score-0.318]
62 , τ L , where the superscript now identifies the sample number; the corresponding permutation matrices are r1 , r2 , . [sent-229, score-0.125]
63 (8) t A detailed derivation of the exact form of the induced distribution, R, and the correction factor, u , based on the sequential nature of the sampling scheme, along with further discussion and comparison with alternative sampling schemes can be found in the supplementary document [7]. [sent-235, score-0.267]
64 2 Monotonicity and Convergence Standard EM convergence results directly apply when the exact E-step is used [11, 12]. [sent-238, score-0.1]
65 Together with the fact that the marginal log-likelihood (1) is continuous in θ and bounded above, the monotonicity property guarantees that the exact EM iterates converge monotonically to a local maximum of log P [Y|θ]. [sent-241, score-0.181]
66 To assure the Monte Carlo EM algorithm (MCEM) converges, the number of importance samples, L, must be chosen carefully so that Q approximates Q well enough; otherwise the MCEM may be swamped with error. [sent-243, score-0.117]
67 Because Q(θ ; θ ) involves terms k k log Ak and log πi , in practice we bound Ak and πi away from zero to ensure that Q does not i,j i,j k blow up. [sent-248, score-0.102]
68 There exist finite constants bm > 0, independent of Nm , so that if Lm = 4 2b2 T 2 Nm | log θmin |2 m 2 log 2 2Nm 1 − (1 − δ)1/T importance samples are used for the mth observation, then Q(θ probability greater than 1 − δ. [sent-251, score-0.331]
69 First, we derive finite sample concentration-style bounds for (m) (m) the importance sample estimates showing, e. [sent-253, score-0.16]
70 , that αt ,t converges to αt ,t at a rate which is ¯ exponential in the number of importance samples used. [sent-255, score-0.201]
71 These bounds are based on rather novel concentration inequalities for importance sampling estimators, which may be of interest in their own right (see the supplementary document [7] for details). [sent-256, score-0.269]
72 , (0, δ)-PAM, not approximately monotonic) if Lm importance samples are used for the mth observation, where Lm also depends polynomially on Nm and T . [sent-260, score-0.229]
73 The bound above stipulates that the number of importance samples required for a 2 4 PAM update is on the order of Nm log Nm . [sent-264, score-0.213]
74 Generating one importance sample using the sequential procedure described above requires Nm operations. [sent-265, score-0.117]
75 4 Experimental Results The performance of our algorithm for network inference from co-occurrences (NICO, pronounced “nee-koh”) has been evaluated on both simulated data and on a biological data set. [sent-267, score-0.207]
76 In these experiments, network structure is inferred by first executing the EM algorithm to infer the parameters (A, π) of a Markov chain. [sent-268, score-0.262]
77 Then, inserting edges in the inferred graph based on the most likely order of each path according to (A, π) ensures the resulting graph is feasible with respect to the observations. [sent-269, score-0.44]
78 To gauge the performance of our algorithm we use the edge symmetric difference error: the total number of false positives (edges in the inferred network which do not exist in the true network) plus the number of false negatives (edges in the true network not appearing in the inferred network). [sent-271, score-0.504]
79 We keep the number of origins fixed at 5 and vary the number of destinations between 5 and 40 to see how the number of observations effects performance. [sent-275, score-0.161]
80 Figure 1 plots the edge error for synthetic data generated using (a) shortest path routing, and (b) random routing. [sent-277, score-0.248]
81 Each curve is the average performance over 100 different network and path real- 7 5 4 3 2 1 0 5 Freq. [sent-278, score-0.308]
82 Destinations 30 (a) Shortest path routes 35 40 0 5 10 15 20 25 Num. [sent-283, score-0.185]
83 Destinations 30 35 40 (b) Random routes Figure 1: Edge symmetric differences between inferred networks and the network one would obtain using co-occurrence measurements arranged in the correct order. [sent-284, score-0.315]
84 We then choose the NICO solution yielding the largest likelihood, and compare with both the sparsest (fewest edges) and clairvoyant best (lowest error) FM solution. [sent-287, score-0.126]
85 Exact E-step calculation is used for observations with Nm ≤ 12, and importance sampling is used for longer paths. [sent-290, score-0.256]
86 The FM uses simple pairwise frequencies of co-occurrence to assign an order independently to each path observation. [sent-292, score-0.147]
87 We plot FM performance for two schemes; one based on choosing the sparsest FM solution (the one with the fewest edges), and one based on clairvoyantly choosing the FM solution with lowest error. [sent-295, score-0.122]
88 Our method has also been applied to infer the stress-activated protein kinease (SAPK)/Jun N terminal kinase (JNK) and NFκB signal transduction pathways1 (biological networks). [sent-297, score-0.144]
89 The clustering procedure described in [2] is applied to microarray data in order to identify 18 co-occurrences arising from different environmental stresses or growth factors (path source) and terminating in the production of SAPK/JNK or NFκB proteins. [sent-298, score-0.219]
90 The reconstructed network (combined SAPK/JNK and NFκB signal transduction pathways) is depicted in Figure 2. [sent-299, score-0.305]
91 This structure agrees with the signalling pathways identified using traditional experimental techniques which test individually for each possible edge (e. [sent-300, score-0.204]
92 5 Conclusion This paper describes a probabilistic model and statistical inference procedure for inferring network structure from incomplete “co-occurrence” measurements. [sent-305, score-0.263]
93 We describe exact and Monte Carlo EM algorithms for calculating maximum likelihood estimates of the Markov chain parameters (initial state distribution and transition matrix), treating the random permutations as hidden variables. [sent-307, score-0.405]
94 Although our exact EM algorithm has exponential computational complexity, we provide finite-sample bounds guaranteeing convergence of the Monte Carlo EM variation to a local maximum with high probability and with only polynomial complexity. [sent-309, score-0.221]
95 The NFκB pathway is a collection of paths activated by various environmental stresses and growth factors, and terminating in the production of NFκB. [sent-312, score-0.361]
96 Co-occurrences are obtained from gene expression data via the clustering algorithm described in [2], and then network is inferred using NICO. [sent-314, score-0.228]
97 Hero for providing the data and collaborating on the biological network experiment reported in Section 4. [sent-318, score-0.207]
98 A computational approach for ordering signal transduction pathway components from genomics and proteomics data. [sent-340, score-0.239]
99 Understanding the topology of a telephone network via internally-sensed network tomography. [sent-351, score-0.322]
100 Being Bayesian about Bayesian network structure: A Bayesian approach to structure discovery in Bayesian networks. [sent-395, score-0.195]
wordName wordTfidf (topN-words)
[('nm', 0.529), ('permutations', 0.219), ('ak', 0.219), ('em', 0.202), ('nico', 0.194), ('network', 0.161), ('path', 0.147), ('fm', 0.138), ('nf', 0.134), ('carlo', 0.131), ('monte', 0.131), ('permutation', 0.125), ('mcem', 0.121), ('feasible', 0.119), ('importance', 0.117), ('transduction', 0.103), ('rabbat', 0.097), ('xt', 0.088), ('shuf', 0.084), ('destinations', 0.084), ('endpoints', 0.084), ('sparsest', 0.084), ('pathways', 0.08), ('observations', 0.077), ('monotonicity', 0.077), ('pam', 0.073), ('signaling', 0.072), ('inferring', 0.068), ('mth', 0.067), ('inferred', 0.067), ('monotonic', 0.064), ('activated', 0.064), ('ml', 0.063), ('rt', 0.063), ('zt', 0.063), ('sampling', 0.062), ('pathway', 0.061), ('paths', 0.061), ('av', 0.058), ('exact', 0.053), ('physical', 0.053), ('shortest', 0.053), ('yt', 0.053), ('vertices', 0.053), ('environmental', 0.052), ('supplementary', 0.052), ('log', 0.051), ('transition', 0.049), ('networks', 0.049), ('aml', 0.048), ('caffo', 0.048), ('fv', 0.048), ('governing', 0.048), ('jnk', 0.048), ('snm', 0.048), ('sporns', 0.048), ('stresses', 0.048), ('edge', 0.048), ('lm', 0.048), ('convergence', 0.047), ('nodes', 0.047), ('biological', 0.046), ('samples', 0.045), ('polynomial', 0.044), ('microarray', 0.044), ('estimates', 0.043), ('transmission', 0.043), ('longest', 0.042), ('instituto', 0.042), ('telecommunications', 0.042), ('networked', 0.042), ('clairvoyant', 0.042), ('signalling', 0.042), ('treating', 0.041), ('signal', 0.041), ('exponential', 0.039), ('markov', 0.039), ('initializations', 0.038), ('terminating', 0.038), ('routes', 0.038), ('fewest', 0.038), ('guaranteeing', 0.038), ('nowak', 0.038), ('hero', 0.038), ('document', 0.038), ('edges', 0.037), ('collection', 0.037), ('sn', 0.037), ('arising', 0.037), ('brain', 0.036), ('observation', 0.036), ('destination', 0.036), ('explosion', 0.036), ('graph', 0.035), ('complexity', 0.035), ('fi', 0.035), ('reveal', 0.035), ('structure', 0.034), ('ordering', 0.034), ('walk', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 98 nips-2006-Inferring Network Structure from Co-Occurrences
Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak
Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1
2 0.11587863 57 nips-2006-Conditional mean field
Author: Peter Carbonetto, Nando D. Freitas
Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean field approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-specified variational approximations based on mean field theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean field. Moreover, experiments on a well-studied problem— inferring the stable configurations of the Ising spin glass—show that the solutions can be significantly better than those obtained using sum-product-based methods. 1
3 0.11356084 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
Author: Jeremy Lewi, Robert Butera, Liam Paninski
Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).
4 0.11340296 21 nips-2006-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1
5 0.097973228 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
Author: Su-in Lee, Varun Ganapathi, Daphne Koller
Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.
6 0.092788316 14 nips-2006-A Small World Threshold for Economic Network Formation
7 0.090406895 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
8 0.078911796 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
9 0.075747289 69 nips-2006-Distributed Inference in Dynamical Systems
10 0.075123101 203 nips-2006-implicit Online Learning with Kernels
11 0.070497133 113 nips-2006-Learning Structural Equation Models for fMRI
12 0.069583587 77 nips-2006-Fast Computation of Graph Kernels
13 0.06950061 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
14 0.066211537 194 nips-2006-Towards a general independent subspace analysis
15 0.064297237 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
16 0.061284967 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
17 0.059783082 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
18 0.059619941 158 nips-2006-PG-means: learning the number of clusters in data
19 0.058840085 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
20 0.05881593 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
topicId topicWeight
[(0, -0.221), (1, -0.004), (2, -0.076), (3, -0.012), (4, 0.031), (5, -0.028), (6, 0.127), (7, -0.034), (8, -0.098), (9, -0.069), (10, 0.017), (11, -0.035), (12, 0.041), (13, 0.056), (14, -0.119), (15, -0.033), (16, 0.02), (17, -0.026), (18, 0.06), (19, 0.071), (20, 0.027), (21, 0.081), (22, 0.093), (23, -0.021), (24, 0.03), (25, -0.047), (26, 0.044), (27, -0.016), (28, -0.047), (29, 0.135), (30, -0.041), (31, -0.117), (32, 0.022), (33, 0.09), (34, 0.084), (35, -0.078), (36, 0.036), (37, 0.036), (38, 0.056), (39, -0.037), (40, -0.165), (41, 0.108), (42, 0.069), (43, -0.078), (44, -0.012), (45, -0.028), (46, -0.104), (47, 0.008), (48, -0.033), (49, -0.128)]
simIndex simValue paperId paperTitle
same-paper 1 0.94770777 98 nips-2006-Inferring Network Structure from Co-Occurrences
Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak
Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1
2 0.58443952 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
Author: Joseph M. Kimmel, Richard M. Salter, Peter J. Thomas
Abstract: Chemical reaction networks by which individual cells gather and process information about their chemical environments have been dubbed “signal transduction” networks. Despite this suggestive terminology, there have been few attempts to analyze chemical signaling systems with the quantitative tools of information theory. Gradient sensing in the social amoeba Dictyostelium discoideum is a well characterized signal transduction system in which a cell estimates the direction of a source of diffusing chemoattractant molecules based on the spatiotemporal sequence of ligand-receptor binding events at the cell membrane. Using Monte Carlo techniques (MCell) we construct a simulation in which a collection of individual ligand particles undergoing Brownian diffusion in a three-dimensional volume interact with receptors on the surface of a static amoeboid cell. Adapting a method for estimation of spike train entropies described by Victor (originally due to Kozachenko and Leonenko), we estimate lower bounds on the mutual information between the transmitted signal (direction of ligand source) and the received signal (spatiotemporal pattern of receptor binding/unbinding events). Hence we provide a quantitative framework for addressing the question: how much could the cell know, and when could it know it? We show that the time course of the mutual information between the cell’s surface receptors and the (unknown) gradient direction is consistent with experimentally measured cellular response times. We find that the acquisition of directional information depends strongly on the time constant at which the intracellular response is filtered. 1 Introduction: gradient sensing in eukaryotes Biochemical signal transduction networks provide the computational machinery by which neurons, amoebae or other single cells sense and react to their chemical environments. The precision of this chemical sensing is limited by fluctuations inherent in reaction and diffusion processes involving a ∗ Current address: Computational Neuroscience Graduate Program, The University of Chicago. Oberlin Center for Computation and Modeling, http://occam.oberlin.edu/. ‡ To whom correspondence should be addressed. http://www.case.edu/artsci/math/thomas/thomas.html; Oberlin College Research Associate. † finite quantity of molecules [1, 2]. The theory of communication provides a framework that makes explicit the noise dependence of chemical signaling. For example, in any reaction A + B → C, we may view the time varying reactant concentrations A(t) and B(t) as input signals to a noisy channel, and the product concentration C(t) as an output signal carrying information about A(t) and B(t). In the present study we show that the mutual information between the (known) state of the cell’s surface receptors and the (unknown) gradient direction follows a time course consistent with experimentally measured cellular response times, reinforcing earlier claims that information theory can play a role in understanding biochemical cellular communication [3, 4]. Dictyostelium is a soil dwelling amoeba that aggregates into a multicellular form in order to survive conditions of drought or starvation. During aggregation individual amoebae perform chemotaxis, or chemically guided movement, towards sources of the signaling molecule cAMP, secreted by nearby amoebae. Quantitive studies have shown that Dictyostelium amoebae can sense shallow, static gradients of cAMP over long time scales (∼30 minutes), and that gradient steepness plays a crucial role in guiding cells [5]. The chemotactic efficiency (CE), the population average of the cosine between the cell displacement directions and the true gradient direction, peaks at a cAMP concentration of 25 nanoMolar, similar to the equilibrium constant for the cAMP receptor (the Keq is the concentration of cAMP at which the receptor has a 50% chance of being bound or unbound, respectively). For smaller or larger concentrations the CE dropped rapidly. Nevertheless over long times cells were able (on average) to detect gradients as small as 2% change in [cAMP] per cell length. At an early stage of development when the pattern of chemotactic centers and spirals is still forming, individual amoebae presumably experience an inchoate barrage of weak, noisy and conflicting directional signals. When cAMP binds receptors on a cell’s surface, second messengers trigger a chain of subsequent intracellular events including a rapid spatial reorganization of proteins involved in cell motility. Advances in fluorescence microscopy have revealed that the oriented subcellular response to cAMP stimulation is already well underway within two seconds [6, 7]. In order to understand the fundamental limits to communication in this cell signaling process we abstract the problem faced by a cell to that of rapidly identifying the direction of origin of a stimulus gradient superimposed on an existing mean background concentration. We model gradient sensing as an information channel in which an input signal – the direction of a chemical source – is noisily transmitted via a gradient of diffusing signaling molecules; and the “received signal” is the spatiotemporal pattern of binding events between cAMP and the cAMP receptors [8]. We neglect downstream intracellular events, which cannot increase the mutual information between the state of the cell and the direction of the imposed extracellular gradient [9]. The analysis of any signal transmission system depends on precise representation of the noise corrupting transmitted signals. We develop a Monte Carlo simulation (MCell, [10, 11]) in which a simulated cell is exposed to a cAMP distribution that evolves from a uniform background to a gradient at low (1 nMol) average concentration. The noise inherent in the communication of a diffusionmediated signal is accurately represented by this method. Our approach bridges both the transient and the steady state regimes and allows us to estimate the amount of stimulus-related information that is in principle available to the cell through its receptors as a function of time after stimulus initiation. Other efforts to address aspects of cell signaling using the conceptual tools of information theory have considered neurotransmitter release [3] and sensing temporal signals [4], but not gradient sensing in eukaryotic cells. A typical natural habitat for social amoebae such as Dictyostelium is the complex anisotropic threedimensional matrix of the forest floor. Under experimental conditions cells typically aggregate on a flat two-dimensional surface. We approach the problem of gradient sensing on a sphere, which is both harder and more natural for the ameoba, while still simple enough for us analytically and numerically. Directional data is naturally described using unit vectors in spherical coordinates, but the ameobae receive signals as binding events involving intramembrane protein complexes, so we have developed a method for projecting the ensemble of receptor bindings onto coordinates in R3 . In loose analogy with the chemotactic efficiency [5], we compare the projected directional estimate with the true gradient direction represented as a unit vector on S2 . Consistent with observed timing of the cell’s response to cAMP stimulation, we find that the directional signal converges quickly enough for the cell to make a decision about which direction to move within the first two seconds following stimulus onset. 2 Methods 2.1 Monte Carlo simulations Using MCell and DReAMM [10, 11] we construct a spherical cell (radius R = 7.5µm, [12]) centered in a cubic volume (side length L = 30µm). N = 980 triangular tiles partition the surface (mesh generated by DOME1 ); each contained one cell surface receptor for cAMP with binding rate k+ = 4.4 × 107 sec−1 M−1 , first-order cAMP unbinding rate k− = 1.1 sec−1 [12] and Keq = k− /k+ = 25nMol cAMP. We established a baseline concentration of approximately 1nMol by releasing a cAMP bolus at time 0 inside the cube with zero-flux boundary conditions imposed on each wall. At t = 2 seconds we introduced a steady flux at the x = −L/2 wall of 1 molecule of cAMP per square micron per msec, adding signaling molecules from the left. Simultaneously, the x = +L/2 wall of the cube assumes absorbing boundary conditions. The new boundary conditions lead (at equilibrium) to a linear gradient of 2 nMol/30µm, ranging from ≈ 2.0 nMol at the flux source wall to ≈ 0 nMol at the absorbing wall (see Figure 1); the concentration profile approaches this new steady state with time constant of approximately 1.25 msec. Sampling boxes centered along the planes x = ±13.5µm measured the local concentration, allowing us to validate the expected model behavior. Figure 1: Gradient sensing simulations performed with MCell (a Monte Carlo simulator of cellular microphysiology, http://www.mcell.cnl.salk.edu/) and rendered with DReAMM (Design, Render, and Animate MCell Models, http://www.mcell.psc.edu/). The model cell comprised a sphere triangulated with 980 tiles with one cAMP receptor per tile. Cell radius R = 7.5µm; cube side L = 30µm. Left: Initial equilibrium condition, before imposition of gradient. [cAMP] ≈ 1nMol (c. 15,000 molecules in the volume outside the sphere). Right: Gradient condition after transient (c. 15,000 molecules; see Methods for details). 2.2 Analysis 2.2.1 Assumptions We make the following assumptions to simplify the analysis of the distribution of receptor activities at equilibrium, whether pre- or post-stimulus onset: 1. Independence. At equilibrium, the state of each receptor (bound vs unbound) is independent of the states of the other receptors. 2. Linear Gradient. At equilibrium under the imposed gradient condition, the concentration of ligand molecule varies linearly with position along the gradient axis. 3. Symmetry. 1 http://nwg.phy.bnl.gov/∼bviren/uno/other/ (a) Rotational equivariance of receptor activities. In the absence of an applied gradient signal, the probability distribution describing the receptor states is equivariant with respect to arbitrary rotations of the sphere. (b) Rotational invariance of gradient direction. The imposed gradient seen by a model cell is equally likely to be coming from any direction; therefore the gradient direction vector is uniformly distributed over S2 . (c) Axial equivariance about the gradient direction. Once a gradient direction is imposed, the probability distribution describing receptor states is rotationally equivariant with respect to rotations about the axis parallel with the gradient. Berg and Purcell [1] calculate the inaccuracy in concentration estimates due to nonindependence of adjacent receptors; for our parameters (effective receptor radius = 5nm, receptor spacing ∼ 1µm) the fractional error in estimating concentration differences due to receptor nonindependence is negligible ( 10−11 ) [1, 2]. Because we fix receptors to be in 1:1 correspondence with surface tiles, spherical symmetry and uniform distribution of the receptors are only approximate. The gradient signal communicated via diffusion does not involve sharp spatial changes on the scale of the distance between nearby receptors, therefore spherical symmetry and uniform identical receptor distribution are good analytic approximations of the model configuration. By rotational equivariance we mean that combining any rotation of the sphere with a corresponding rotation of the indices labeling the N receptors, {j = 1, · · · , N }, leads to a statistically indistinguishable distribution of receptor activities. This same spherical symmetry is reflected in the a priori distribution of gradient directions, which is uniform over the sphere (with density 1/4π). Spherical symmetry is broken by the gradient signal, which fixes a preferred direction in space. About this axis however, we assume the system retains the rotational symmetry of the cylinder. 2.2.2 Mutual information of the receptors In order to quantify the directional information available to the cell from its surface receptors we construct an explicit model for the receptor states and the cell’s estimated direction. We model the receptor states via a collection of random variables {Bj } and develop an expression for the entropy of {Bj }. Then in section 2.2.3 we present a method for projecting a temporally filtered estimated direction, g , into three (rather than N ) dimensions. ˆ N Let the random variables {Bj }j=1 represent the states of the N cAMP receptors on the cell surface; Bj = 1 if the receptor is bound to a molecule of cAMP, otherwise Bj = 0. Let xj ∈ S2 represent the direction from the center of the center of the cell to the j th receptor. Invoking assumption 2 above, we take the equilibrium concentration of cAMP at x to be c(x|g) = a + b(x · g) where g ∈ S2 is a unit vector in the direction of the gradient. The parameter a is the mean concentration over the cell surface, and b = R| c| is half the drop in concentration from one extreme on the cell surface to the other. Before the stimulus begins, the gradient direction is undefined. It can be shown (see Supplemental Materials) that the entropy of receptor states given a fixed gradient direction g, H[{Bj }|g], is given by an integral over the sphere: π 2π a + b cos(θ) sin(θ) dφ dθ (as N → ∞). (1) a + b cos(θ) + Keq 4π θ=0 φ=0 On the other hand, if the gradient direction remains unspecified, the entropy of receptor states is given by H[{Bj }|g] ∼ N Φ π 2π θ=0 φ=0 H[{Bj }] ∼ N Φ a + b cos(θ) a + b cos(θ) + Keq sin(θ) dφ dθ (as N → ∞), 4π − (p log2 (p) + (1 − p) log2 (1 − p)) , 0 < p < 1 0, p = 0 or 1 binary random variable with state probabilities p and (1 − p). where Φ[p] = (2) denotes the entropy for a In both equations (1) and (2), the argument of Φ is a probability taking values 0 ≤ p ≤ 1. In (1) the values of Φ are averaged over the sphere; in (2) Φ is evaluated after averaging probabilities. Because Φ[p] is convex for 0 ≤ p ≤ 1, the integral in equation 1 cannot exceed that in equation 2. Therefore the mutual information upon receiving the signal is nonnegative (as expected): ∆ M I[{Bj }; g] = H[{Bj }] − H[{Bj }|g] ≥ 0. The analytic solution for equation (1) involves the polylogarithm function. For the parameters shown in the simulation (a = 1.078 nMol, b = .512 nMol, Keq = 25 nMol), the mutual information with 980 receptors is 2.16 bits. As one would expect, the mutual information peaks when the mean concentration is close to the Keq of the receptor, exceeding 16 bits when a = 25, b = 12.5 and Keq = 25 (nMol). 2.2.3 Dimension reduction The estimate obtained above does not give tell us how quickly the directional information available to the cell evolves over time. Direct estimate of the mutual information from stochastic simulations is impractical because the aggregate random variables occupy a 980 dimensional space that a limited number of simulation runs cannot sample adequately. Instead, we construct a deterministic function from the set of 980 time courses of the receptors, {Bj (t)}, to an aggregate directional estimate in R3 . Because of the cylindrical symmetry inherent in the system, our directional estimator g is ˆ an unbiased estimator of the true gradient direction g. The estimator g (t) may be thought of as ˆ representing a downstream chemical process that accumulates directional information and decays with some time constant τ . Let {xj }N be the spatial locations of the N receptors on the cell’s j=1 surface. Each vector is associated with a weight wj . Whenever the j th receptor binds a cAMP molecule, wj is incremented by one; otherwise wj decays with time constant τ . We construct an instantaneous estimate of the gradient direction from the linear combination of receptor positions, N gτ (t) = j=1 wj (t)xj . This procedure reflects the accumulation and reabsorption of intracellular ˆ second messengers released from the cell membrane upon receptor binding. Before the stimulus is applied, the weighted directional estimates gτ are small in absolute magniˆ tude, with direction uniformly distributed on S2 . In order to determine the information gained as the estimate vector evolves after stimulus application, we wish to determine the change in entropy in an ensemble of such estimates. As the cell gains information about the direction of the gradient signal from its receptors, the entropy of the estimate should decrease, leading to a rise in mutual information. By repeating multiple runs (M = 600) of the simulation we obtain samples from the ensemble of direction estimates, given a particular stimulus direction, g. In the method of Kozachenko and Leonenko [13], adapted for the analysis of neural spike train data by Victor [14] (“KLV method”), the cumulative distribution function is approximated directly from the observed samples, and the entropy is estimated via a change of variables transformation (see below). This method may be formulated in vector spaces Rd for d > 1 ([13]), but it is not guaranteed to be unbiased in the multivariate case [15] and has not been extended to curved manifolds such as the sphere. In the present case, however, we may exploit the symmetries inherent in the model (Assumptions 3a-3c) to reduce the empirical entropy estimation problem to one dimension. Adapting the argument in [14] to the case of spherical data from a distribution with rotational symmetry about a given axis, we obtain an estimate of the entropy based on a series of observations of the angles {θ1 , · · · , θM } between the estimates gτ and the true gradient direction g (for details, see ˆ Supplemental Materials): 1 H∼ M M log2 (λk ) + log2 (2(M − 1)) + k=1 γ + log2 (2π) + log2 (sin(θk )) loge (2) (3) ∆ (as M → ∞) where after sorting the θk in monotonic order, λk = min(|θk − θk±1 |) is the distance between each angle and its nearest neighbor in the sample, and γ is the Euler-Mascheroni constant. As shown in Figure 2, this approximation agrees with the analytic result for the uniform distribution, Hunif = log2 (4π) ≈ 3.651. 3 Results Figure 3 shows the results of M = 600 simulation runs. Panel A shows the concentration averaged across a set of 1µm3 sample boxes, four in the x = −13.5µm plane and four in the x = +13.5µm Figure 2: Monte Carlo simulation results and information analysis. A: Average concentration profiles along two planes perpendicular to the gradient, at x = ±13.5µm. B: Estimated direction vector (x, y, and z components; x = dark blue trace) gτ , τ = 500 msec. C: Entropy of the ensemble of diˆ rectional vector estimates for different values of the intracellular filtering time constant τ . Given the directions of the estimates θk , φk on each of M runs, we calculate the entropy of the ensemble using equation (3). All time constants yield uniformly distributed directional estimates in the pre-stimulus period, 0 ≤ t ≤ 2 (sec). After stimulus onset, directional estimates obtained with shorter time constants respond more quickly but achieve smaller gains in mutual information (smaller reductions in entropy). Filtering time constants τ range from lightest to darkest colors: 20, 50, 100, 200, 500, 1000, 2000 msec. plane. The initial bolus of cAMP released into the volume at t = 0 sec is not uniformly distributed, but spreads out evenly within 0.25 sec. At t = 2.0 sec the boundary conditions are changed, causing a gradient to emerge along a realistic time course. Consistent with the analytic solution for the mean concentration (not shown), the concentration approaches equilibrium more rapidly near the absorbing wall (descending trace) than at the imposed flux wall (ascending trace). Panel B shows the evolution of a directional estimate vector gτ for a single run, with τ = 500 ˆ msec. During uniform conditions all vectors fluctuate near the origin. After gradient onset the variance increases and the x component (dark trace) becomes biased towards the gradient source (g = [−1, 0, 0]) while the y and z components still have a mean of zero. Across all 600 runs the mean of the y and z components remains close to zero, while the mean of the x component systematically departs from zero shortly after stimulus onset (not shown). Hence the directional estimator is unbiased (as required by symmetry). See Supplemental Materials for the population average of g . ˆ Panel C shows the time course of the entropy of the ensemble of normalized directional estimate vectors gτ /|ˆτ | over M = 600 simulations, for intracellular filtering time constants ranging from 20 ˆ g msec to 2000 msec (light to dark shading), calculated using equation (3). Following stimulus onset, entropy decreases steadily, showing an increase in information available to the amoeba about the direction of the stimulus; the mutual information at a given point in time is the difference between the entropy at that time and before stimulus onset. For a cell with roughly 1000 receptors the mutual information has increased at most by ∼ 2 bits of information by one second (for τ = 500 msec), and at most by ∼ 3 bits of information by two seconds (for τ =1000 or 2000 msec), under our stimulation protocol. A one bit reduction in uncertainty is equivalent to identifying the correct value of the x component (positive versus negative) when the stimulus direction is aligned along the x-axis. Alternatively, note that a one bit reduction results in going from the uniform distribution on the sphere to the uniform distribution on one hemisphere. For τ ≤ 100 msec, the weighted average with decay time τ never gains more than one bit of information about the stimulus direction, even at long times. This observation suggestions that signaling must involve some chemical components with lifetimes longer than 100 msec. The τ = 200 msec filter saturates after about one second, at ∼ 1 bit of information gain. Longer lived second messengers would respond more slowly to changes from the background stimulus distribution, but would provide better more informative estimates over time. The τ = 500 msec estimate gains roughly two bits of information within 1.5 seconds, but not much more over time. Heuristically, we may think of a two bit gain in information as corresponding to the change from a uniform distribution to one covering uniformly covering one quarter of S2 , i.e. all points within π/3 of the true direction. Within two seconds the τ = 1000 msec and τ = 2000 msec weighted averages have each gained approximately three bits of information, equivalent to a uniform distribution covering all points with 0.23π or 41o of the true direction. 4 Discussion & conclusions Clearly there is an opportunity for more precise control of experimental conditions to deepen our understanding of spatio-temporal information processing at the membranes of gradient-sensitive cells. Efforts in this direction are now using microfluidic technology to create carefully regulated spatial profiles for probing cellular responses [16]. Our results suggest that molecular processes relevant to these responses must have lasting effects ≥ 100 msec. We use a static, immobile cell. Could cell motion relative to the medium increase sensitivity to changes in the gradient? No: the Dictyostelium velocity required to affect concentration perception is on order 1cm sec−1 [1], whereas reported velocities are on the order µm sec−1 [5]. The chemotactic response mechanism is known to begin modifying the cell membrane on the edge facing up the gradient within two seconds after stimulus initiation [7, 6], suggesting that the cell strikes a balance between gathering data and deciding quickly. Indeed, our results show that the reported activation of the G-protein signaling system on the leading edge of a chemotactically responsive cell [7] rises at roughly the same rate as the available chemotactic information. Results such as these ([7, 6]) are obtained by introducing a pipette into the medium near the amoeba; the magnitude and time course of cAMP release are not precisely known, and when estimated the cAMP concentration at the cell surface is over 25 nMol by a full order of magnitude. Thomson and Kristan [17] show that for discrete probability distributions and for continuous distributions over linear spaces, stimulus discriminability may be better quantified using ideal observer analysis (mean squared error, for continuous variables) than information theory. The machinery of mean squared error (variance, expectation) do not carry over to the case of directional data without fundamental modifications [18]; in particular the notion of mean squared error is best represented by the mean resultant length 0 ≤ ρ ≤ 1, the expected length of the vector average of a collection of unit vectors representing samples from directional data. A resultant with length ρ ≈ 1 corresponds to a highly focused probability density function on the sphere. In addition to measuring the mutual information between the gradient direction and an intracellular estimate of direction, we also calculated the time evolution of ρ (see Supplemental Materials.) We find that ρ rapidly approaches 1 and can exceed 0.9, depending on τ . We found that in this case at least the behavior of the mean resultant length and the mutual information are very similar; there is no evidence of discrepancies of the sort described in [17]. We have shown that the mutual information between an arbitrarily oriented stimulus and the directional signal available at the cell’s receptors evolves with a time course consistent with observed reaction times of Dictyostelium amoeba. Our results reinforce earlier claims that information theory can play a role in understanding biochemical cellular communication. Acknowledgments MCell simulations were run on the Oberlin College Beowulf Cluster, supported by NSF grant CHE0420717. References [1] Howard C. Berg and Edward M. Purcell. Physics of chemoreception. Biophysical Journal, 20:193, 1977. [2] William Bialek and Sima Setayeshgar. Physical limits to biochemical signaling. PNAS, 102(29):10040– 10045, July 19 2005. [3] S. Qazi, A. Beltukov, and B.A. Trimmer. Simulation modeling of ligand receptor interactions at nonequilibrium conditions: processing of noisy inputs by ionotropic receptors. Math Biosci., 187(1):93–110, Jan 2004. [4] D. J. Spencer, S. K. Hampton, P. Park, J. P. Zurkus, and P. J. Thomas. The diffusion-limited biochemical signal-relay channel. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16. MIT Press, Cambridge, MA, 2004. [5] P.R. Fisher, R. Merkl, and G. Gerisch. Quantitative analysis of cell motility and chemotaxis in Dictyostelium discoideum by using an image processing system and a novel chemotaxis chamber providing stationary chemical gradients. J. Cell Biology, 108:973–984, March 1989. [6] Carole A. Parent, Brenda J. Blacklock, Wendy M. Froehlich, Douglas B. Murphy, and Peter N. Devreotes. G protein signaling events are activated at the leading edge of chemotactic cells. Cell, 95:81–91, 2 October 1998. [7] Xuehua Xu, Martin Meier-Schellersheim, Xuanmao Jiao, Lauren E. Nelson, and Tian Jin. Quantitative imaging of single live cells reveals spatiotemporal dynamics of multistep signaling events of chemoattractant gradient sensing in dictyostelium. Molecular Biology of the Cell, 16:676–688, February 2005. [8] Jan Wouter-Rappel, Peter. J Thomas, Herbert Levine, and William F. Loomis. Establishing direction during chemotaxis in eukaryotic cells. Biophys. J., 83:1361–1367, 2002. [9] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley, New York, 1990. [10] J. R. Stiles, D. Van Helden, T. M. Bartol, E.E. Salpeter, and M. M. Salpeter. Miniature endplate current rise times less than 100 microseconds from improved dual recordings can be modeled with passive acetylcholine diffusion from a synaptic vesicle. Proc. Natl. Acad. Sci. U.S.A., 93(12):5747–52, Jun 11 1996. [11] J. R. Stiles and T. M. Bartol. Computational Neuroscience: Realistic Modeling for Experimentalists, chapter Monte Carlo methods for realistic simulation of synaptic microphysiology using MCell, pages 87–127. CRC Press, Boca Raton, FL, 2001. [12] M. Ueda, Y. Sako, T. Tanaka, P. Devreotes, and T. Yanagida. Single-molecule analysis of chemotactic signaling in Dictyostelium cells. Science, 294:864–867, October 2001. [13] L.F. Kozachenko and N.N. Leonenko. Probl. Peredachi Inf. [Probl. Inf. Transm.], 23(9):95, 1987. [14] Jonathan D. Victor. Binless strategies for estimation of information from neural data. Physical Review E, 66:051903, Nov 11 2002. [15] Marc M. Van Hulle. Edgeworth approximation of multivariate differential entropy. Neural Computation, 17:1903–1910, 2005. [16] Loling Song, Sharvari M. Nadkarnia, Hendrik U. B¨ dekera, Carsten Beta, Albert Bae, Carl Franck, o Wouter-Jan Rappel, William F. Loomis, and Eberhard Bodenschatz. Dictyostelium discoideum chemotaxis: Threshold for directed motion. Euro. J. Cell Bio, 85(9-10):981–9, 2006. [17] Eric E. Thomson and William B. Kristan. Quantifying stimulus discriminability: A comparison of information theory and ideal observer analysis. Neural Computation, 17:741–778, 2005. [18] Kanti V. Mardia and Peter E. Jupp. Directional Statistics. John Wiley & Sons, West Sussex, England, 2000.
3 0.49688441 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
Author: Jeremy Lewi, Robert Butera, Liam Paninski
Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).
4 0.49603698 14 nips-2006-A Small World Threshold for Economic Network Formation
Author: Eyal Even-dar, Michael Kearns
Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1
5 0.49263734 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton
Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1
6 0.45501631 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
7 0.45316198 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
8 0.45289463 113 nips-2006-Learning Structural Equation Models for fMRI
9 0.45281622 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space
10 0.45235151 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
11 0.43829498 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields
12 0.430655 57 nips-2006-Conditional mean field
13 0.42274266 69 nips-2006-Distributed Inference in Dynamical Systems
14 0.42208201 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
15 0.4210023 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
16 0.4145667 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
17 0.39201725 77 nips-2006-Fast Computation of Graph Kernels
18 0.38835275 21 nips-2006-AdaBoost is Consistent
19 0.3789562 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time
20 0.37677467 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
topicId topicWeight
[(1, 0.081), (3, 0.018), (7, 0.086), (9, 0.052), (20, 0.019), (22, 0.083), (44, 0.108), (57, 0.069), (64, 0.013), (65, 0.042), (69, 0.032), (71, 0.025), (90, 0.01), (93, 0.281)]
simIndex simValue paperId paperTitle
1 0.83758807 59 nips-2006-Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons
Author: Elisabetta Chicca, Giacomo Indiveri, Rodney J. Douglas
Abstract: Cooperative competitive networks are believed to play a central role in cortical processing and have been shown to exhibit a wide set of useful computational properties. We propose a VLSI implementation of a spiking cooperative competitive network and show how it can perform context dependent computation both in the mean firing rate domain and in spike timing correlation space. In the mean rate case the network amplifies the activity of neurons belonging to the selected stimulus and suppresses the activity of neurons receiving weaker stimuli. In the event correlation case, the recurrent network amplifies with a higher gain the correlation between neurons which receive highly correlated inputs while leaving the mean firing rate unaltered. We describe the network architecture and present experimental data demonstrating its context dependent computation capabilities. 1
same-paper 2 0.81047636 98 nips-2006-Inferring Network Structure from Co-Occurrences
Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak
Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
4 0.56517255 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
5 0.56111073 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
6 0.56007326 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
7 0.55919808 65 nips-2006-Denoising and Dimension Reduction in Feature Space
8 0.55881965 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
9 0.55872715 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
10 0.55785394 21 nips-2006-AdaBoost is Consistent
11 0.55783278 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
12 0.55774838 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
13 0.55766034 139 nips-2006-Multi-dynamic Bayesian Networks
14 0.55736434 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
15 0.55612022 5 nips-2006-A Kernel Method for the Two-Sample-Problem
16 0.5557434 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
17 0.55472142 117 nips-2006-Learning on Graph with Laplacian Regularization
18 0.55295271 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
19 0.55257672 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
20 0.55245727 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments