nips nips2006 nips2006-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. [sent-3, score-0.896]
2 Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. [sent-4, score-0.663]
3 In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. [sent-6, score-0.622]
4 This reduces the number of samples needed to obtain accurate gradient estimates. [sent-7, score-0.216]
5 Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. [sent-8, score-0.49]
6 1 Introduction Policy Gradient (PG) methods are Reinforcement Learning (RL) algorithms that maintain a parameterized action-selection policy and update the policy parameters by moving them in the direction of an estimate of the gradient of a performance measure. [sent-9, score-1.131]
7 However, both the theoretical results and empirical evaluations have highlighted a major shortcoming of these algorithms, namely, the high variance of the gradient estimates. [sent-14, score-0.21]
8 , smaller than 1) discount factor in these algorithms [2, 3], however, this creates another problem by introducing bias into the gradient estimates. [sent-18, score-0.188]
9 Another solution, which does not involve biasing the gradient estimate, is to subtract a reinforcement baseline from the average reward estimate in the updates of PG algorithms (e. [sent-19, score-0.401]
10 Another approach for speeding-up policy gradient algorithms was recently proposed in [5] and extended in [6, 7]. [sent-22, score-0.622]
11 The idea is to replace the policy-gradient estimate with an estimate of the so-called natural policy-gradient. [sent-23, score-0.082]
12 This is motivated by the requirement that a change in the way the policy is parametrized should not influence the result of the policy update. [sent-24, score-0.868]
13 In terms of the policy update rule, the move to a natural-gradient rule amounts to linearly transforming the gradient using the inverse Fisher information matrix of the policy. [sent-25, score-0.659]
14 However, both conventional and natural policy gradient methods rely on Monte-Carlo (MC) techniques to estimate the gradient of the performance measure. [sent-26, score-0.869]
15 Monte-Carlo estimation is a frequentist procedure, and as such violates the likelihood principle [8]. [sent-27, score-0.093]
16 1 Moreover, although MC estimates are unbiased, they tend to produce high variance estimates, or alternatively, require excessive sample sizes (see [9] for a discussion). [sent-28, score-0.143]
17 1 The likelihood principle states that in a parametric statistical model, all the information about a data sample that is required for inferring the model parameters is contained in the likelihood function of that sample. [sent-29, score-0.109]
18 This is done by treating the first term f in the integrand as a random function, the randomness of which reflects our subjective uncertainty concerning its true identity. [sent-32, score-0.133]
19 , xM ) allows us to employ Bayes’ rule to compute a posterior distribution of f , conditioned on these samples. [sent-37, score-0.082]
20 In this paper, we propose a Bayesian framework for policy gradient, by modeling the gradient as a GP. [sent-39, score-0.622]
21 This reduces the number of samples needed to obtain accurate gradient estimates. [sent-40, score-0.216]
22 Moreover, estimates of the natural gradient and the gradient covariance are provided at little extra cost. [sent-41, score-0.456]
23 A stationary policy µ(·|x) ∈ P(A) is a probability distribution over actions, conditioned on the current state. [sent-49, score-0.454]
24 The MDP controlled by the policy µ induces a Markov chain over state-action pairs. [sent-50, score-0.434]
25 (1) t=0 P We denote by R(ξ) = T γ t r(xt , at ) the (possibly discounted, γ ∈ [0, 1]) cumulative return of the t=0 path ξ . [sent-56, score-0.184]
26 R(ξ) is a random variable both because the path ξ is a random variable, and because even for a given path, each of the rewards sampled in it may be stochastic. [sent-57, score-0.092]
27 (2) Gradient-based approaches to policy search in RL have recently received much attention as a means to sidetrack problems of partial observability and of policy oscillations and even divergence encountered in value-function based methods (see [11], Sec. [sent-60, score-0.868]
28 The gradient of the expected return η(θ) = η(µ(·|·; θ)) is given by 2 η(θ) = Z ¯ R(ξ) Pr(ξ; θ) Pr(ξ; θ)dξ, Pr(ξ; θ) (3) Pr(ξ;θ) where Pr(ξ; θ) = Pr(ξ|µ(·|·; θ)). [sent-67, score-0.347]
29 Since the initial state distribution P0 and the transition distribution P are independent of the policy parameters θ, we can write the score of a path ξ using Eq. [sent-69, score-0.517]
30 (4) Previous work on policy gradient methods used classical Monte-Carlo to estimate the gradient in Eq. [sent-75, score-0.851]
31 , ξM according to Pr(ξ; θ), and estimate the gradient η(θ) using the MC estimator M 1 X c η M C (θ) = R(ξi ) M i=1 3 log Pr(ξi ; θ) = Ti −1 M X 1 X R(ξi ) M i=1 t=0 log µ(at,i |xt,i ; θ). [sent-83, score-0.229]
32 (5) Bayesian Quadrature Bayesian quadrature (BQ) [10] is a Bayesian method for evaluating an integral using samples of its integrand. [sent-84, score-0.106]
33 ρM C is an unbiased estimate of ρ, with ˆ variance that diminishes to zero as M → ∞. [sent-91, score-0.083]
34 The choice of kernel function k allows us to incorporate prior knowledge on the smoothness properties of the integrand into the estimation procedure. [sent-99, score-0.12]
35 When we are provided with a set of samples DM = {(xi , yi )}M , i=1 where yi is a (possibly noisy) sample of f (xi ), we apply Bayes’ rule to condition the prior on these sampled values. [sent-100, score-0.139]
36 If the measurement noise is normally distributed, the result is a Normal posterior distribution of f |DM . [sent-101, score-0.088]
37 The expressions for the posterior mean and covariance are standard: (7) E(f (x)|DM ) = f0 (x) + kM (x) C M (y M − f 0 ), Cov(f (x), f (x )|DM ) = k(x, x ) − kM (x) C M kM (x ). [sent-102, score-0.087]
38 , yM ) , , [K M ]i,j = k(xi , xj ) , C M = (K M + ΣM )−1 , and [ΣM ]i,j is the measurement noise covariance between the ith and jth samples. [sent-112, score-0.069]
39 6 is also Gaussian, and the posterior moments are given by E(ρ|DM ) = Z E(f (x)|DM )p(x)dx , Var(ρ|DM ) = ZZ Cov(f (x), f (x )|DM )p(x)p(x )dxdx . [sent-118, score-0.069]
40 This affords us with flexibility in the choice of sample points, allowing us, for instance to actively design the samples (x1 , x2 , . [sent-131, score-0.083]
41 4 Bayesian Policy Gradient In this section, we use Bayesian quadrature to estimate the gradient of the expected return with respect to the policy parameters, and propose Bayesian policy gradient (BPG) algorithms. [sent-135, score-1.48]
42 In the frequentist approach to policy gradient our performance measure was η(θ) from Eq. [sent-136, score-0.653]
43 2, which is the result of averaging the cumulative return R(ξ) over all possible paths ξ and all possible returns accumulated in each path. [sent-137, score-0.265]
44 In the Bayesian approach we have an additional source of randomness, which is our subjective Bayesian uncertainty concerning the process generating the cumulative returns. [sent-138, score-0.077]
45 Since we are interested in optimizing performance rather than evaluating it, we evaluate the posterior distribution of the gradient of ηB (θ). [sent-142, score-0.253]
46 (12) Consequently, in BPG we cast the problem of estimating the gradient of the expected return in the form of Eq. [sent-144, score-0.347]
47 We will then proceed by calculating the posterior moments of the gradient ηB (θ) conditioned on the observed data. [sent-149, score-0.277]
48 Our particular choices of linear and quadratic Fisher kernels were guided by the requirement that the posterior moments of the gradient be analytically tractable. [sent-154, score-0.257]
49 Finally, n is the ` number of policy parameters, and G = E u(ξ)u(ξ) is the Fisher information matrix. [sent-165, score-0.434]
50 We can now use Models 1 and 2 to define algorithms for evaluating the gradient of the expected return with respect to the policy parameters. [sent-166, score-0.803]
51 The generic algorithm (for either model) takes a set of policy parameters θ and a sample size M as input, and returns an estimate of the posterior moments of the gradient of the expected return. [sent-169, score-0.891]
52 Consequently, every time we update the policy parameters we need to recompute G. [sent-171, score-0.45]
53 MC Estimation: At each step j, our BPG algorithm generates M sample paths using the current policy parameters θ j in order to estimate the gradient ηB (θ j ). [sent-175, score-0.839]
54 We can use these generated sample paths to estimate the Fisher information matrix G(θ j ) by replacing the expectation in G with emPM PTi −1 ˆ pirical averaging as GM C (θ j ) = PM1 T log µ(at |xt ; θ j ) log µ(at |xt ; θ j ) . [sent-176, score-0.219]
55 We can model the MDP dynamics using some parameterized model, and estimate the model parameters using maximum likelihood or Bayesian methods. [sent-181, score-0.094]
56 This would be a model-based approach to policy gradient, which would allow us to transfer information between different policies. [sent-182, score-0.434]
57 Here we use an on-line sparsification method from [15] to selectively add a new observed path to a set of dictionary paths D M , which are used as a basis for approximating the full solution. [sent-186, score-0.154]
58 The Bayesian policy gradient (BPG) algorithm is described in Alg. [sent-188, score-0.622]
59 This algorithm starts with an initial vector of policy parameters θ 0 and updates the parameters in the direction of the posterior mean of the gradient of the expected return, computed by Alg. [sent-190, score-0.833]
60 This is repeated N times, or alternatively, until the gradient estimate is sufficiently close to zero. [sent-192, score-0.229]
61 2) on the LQR problem, and compare it with a standard MC-based policy gradient (MCPG) algorithm. [sent-195, score-0.622]
62 1 A Bandit Problem In this simple example, we compare the BQ and MC estimates of the gradient (for a fixed set of policy parameters) using the same samples. [sent-197, score-0.661]
63 The policy, and therefore also the distribution over 2 paths is given by a ∼ N (θ1 = 0, θ2 = 1). [sent-200, score-0.105]
64 The score function of the path ξ = a and the Fisher information matrix are given by u(ξ) = [a, a2 − 1] and G = diag(1, 2), respectively. [sent-201, score-0.067]
65 Table 2 shows the exact gradient of the expected return and its MC and BQ estimates (using 10 and 100 samples) for two versions of the simple bandit problem corresponding to two different deterministic reward functions r(a) = a and r(a) = a2 . [sent-202, score-0.464]
66 The true gradient is analytically tractable and is reported as “Exact” in Table 2 for reference. [sent-204, score-0.188]
67 000011 „ Table 2: The true gradient of the expected return and its MC and BQ estimates for two bandit problems. [sent-237, score-0.437]
68 As shown in Table 2, the BQ estimate has much lower variance than the MC estimate for both small and large sample sizes. [sent-238, score-0.159]
69 The BQ estimate also has a lower bias than the MC estimate for the large sample size (M = 100), and almost the same bias for the small sample size (M = 10). [sent-239, score-0.238]
70 2 A Linear Quadratic Regulator In this section, we consider the following linear system in which the goal is to minimize the expected return over 20 steps. [sent-241, score-0.159]
71 Thus, it is an episodic problem with paths of length 20. [sent-242, score-0.105]
72 1a2 t t Transitions: xt+1 = xt + at + nx ; nx ∼ N (0, 0. [sent-246, score-0.131]
73 01) Policy Actions: at ∼ µ(·|xt ; θ) = N (λxt , σ 2 ) Parameters: θ = (λ , σ) We first compare the BQ and MC estimates of the gradient of the expected return for the policy induced by the parameters λ = −0. [sent-247, score-0.836]
74 We use several different sample sizes (number of paths used for gradient estimation) M = 5j , j = 1, . [sent-249, score-0.375]
75 For each sample size, we compute both the MC and BQ estimates 104 times, using the same samples. [sent-253, score-0.094]
76 The true gradient is estimated using MC with 107 sample paths for comparison purposes. [sent-254, score-0.348]
77 Figure 1 shows the mean squared error (MSE) (first column), and the mean absolute angular error (second column) of the MC and BQ estimates of the gradient for several different sample sizes. [sent-255, score-0.389]
78 The absolute angular error is the absolute value of the angle between the true gradient and the estimated gradient. [sent-256, score-0.28]
79 In this figure, the BQ gradient estimate was calculated using Model 1 without sparsification. [sent-257, score-0.229]
80 Therefore, we deal with a kernel matrix of size 6 with sparsification versus a kernel matrix of size M = 5j , j = 1, . [sent-262, score-0.13]
81 In Model 2, we can model this by the measurement noise t 2 covariance matrix Σ = T σr I, where T = 20 is the path length. [sent-272, score-0.136]
82 Since each reward rt is a Gaussian T −1 2 random variable with variance σr , the return R(ξ) = t=0 rt will also be a Gaussian random 2 variable with variance T σr . [sent-273, score-0.226]
83 These experiments indicate that the BQ gradient estimate has lower variance than its MC counter1 part. [sent-275, score-0.251]
84 In fact, whereas the performance of the MC estimate improves as M , the performance of the BQ estimate improves at a higher rate. [sent-276, score-0.082]
85 For each algorithm, we show the MSE (left) and the mean absolute angular error (right), as functions of the number of sample paths M . [sent-282, score-0.247]
86 Next, we use BPG to optimize the policy parameters in the LQR problem. [sent-285, score-0.45]
87 Figure 2 shows the performance of the BPG algorithm with the regular (BPG) and the natural (BPNG) gradient estimates, versus a MC-based policy gradient (MCPG) algorithm, for the sample sizes (number of sample paths used for estimating the gradient of a policy) M = 5, 10, 20, and 40. [sent-286, score-1.256]
88 2 computes the Fisher information matrix for each set of policy parameters, an estimate of the natural gradient is provided at little extra cost at each step. [sent-290, score-0.698]
89 The returns obtained by these methods are averaged over 104 runs for sample sizes 5 and 10, and over 103 runs for sample sizes 20 and 40. [sent-291, score-0.189]
90 The policy parameters are initialized randomly at each run. [sent-292, score-0.45]
91 In order to ensure that the learned parameters do not exceed an acceptable range, the policy parameters are defined as λ = −1. [sent-293, score-0.466]
92 Figure 2 shows that MCPG performs better than the BPG algorithm for the smallest sample size (M = 5), whereas for larger samples BPG dominates MCPG. [sent-303, score-0.106]
93 For a fixed sample size, each method starts with an initial learning rate, and decreases it according to the schedule αj = α0 (20/(20 + j)). [sent-306, score-0.073]
94 In ML estimation, we assume that the 2 transition probability function is P (xt+1 |xt , at ) = N (β1 xt + β2 at + β3 , β4 ), and then estimate its parameters by observing state transitions. [sent-337, score-0.14]
95 Moreover, as we increase the sample size, its performance converges to the performance of the BPG algorithm in which the Fisher information matrix is known (BPG). [sent-339, score-0.073]
96 The average return of the MCPG algorithm is also provided for comparison. [sent-356, score-0.155]
97 6 Discussion In this paper we proposed an alternative approach to conventional frequentist policy gradient estimation procedures, which is based on the Bayesian view. [sent-357, score-0.696]
98 Our algorithms use GPs to define a prior distribution over the gradient of the expected return, and compute the posterior, conditioned on the observed data. [sent-358, score-0.268]
99 2) uses only the posterior mean of the gradient in its updates, we hope that more elaborate algorithms can be devised that would make judicious use of the covariance information provided by the gradient estimation algorithm (Alg. [sent-362, score-0.505]
100 Policy gradient methods for reinforcement learning with function approximation. [sent-394, score-0.256]
wordName wordTfidf (topN-words)
[('bpg', 0.557), ('policy', 0.434), ('mc', 0.378), ('bq', 0.232), ('dm', 0.212), ('gradient', 0.188), ('pr', 0.146), ('bpng', 0.125), ('mcpg', 0.125), ('return', 0.119), ('fisher', 0.116), ('paths', 0.105), ('sparsi', 0.087), ('xt', 0.083), ('cov', 0.079), ('reinforcement', 0.068), ('zm', 0.068), ('bayesian', 0.062), ('lqr', 0.061), ('km', 0.058), ('updates', 0.058), ('sample', 0.055), ('integrand', 0.051), ('bandit', 0.051), ('pg', 0.051), ('path', 0.049), ('mdp', 0.046), ('rewards', 0.043), ('posterior', 0.043), ('ml', 0.042), ('angular', 0.042), ('eval', 0.042), ('hagan', 0.042), ('estimate', 0.041), ('expected', 0.04), ('estimates', 0.039), ('quadrature', 0.036), ('gp', 0.034), ('xm', 0.031), ('frequentist', 0.031), ('alberta', 0.029), ('samples', 0.028), ('measurement', 0.028), ('sizes', 0.027), ('reward', 0.027), ('sutton', 0.026), ('moments', 0.026), ('returns', 0.025), ('estimation', 0.025), ('absolute', 0.025), ('agent', 0.024), ('dxdx', 0.024), ('nx', 0.024), ('kernel', 0.024), ('covariance', 0.024), ('concerning', 0.024), ('size', 0.023), ('dx', 0.023), ('variance', 0.022), ('deg', 0.022), ('humanoid', 0.022), ('evaluating', 0.022), ('rl', 0.022), ('gps', 0.021), ('randomness', 0.021), ('conditioned', 0.02), ('integral', 0.02), ('table', 0.02), ('prior', 0.02), ('unbiased', 0.02), ('mean', 0.02), ('zz', 0.019), ('uncertainty', 0.019), ('rule', 0.019), ('actions', 0.019), ('likelihood', 0.019), ('average', 0.019), ('initial', 0.018), ('subjective', 0.018), ('nr', 0.018), ('violates', 0.018), ('parameterized', 0.018), ('rt', 0.018), ('conventional', 0.018), ('nitions', 0.018), ('mse', 0.018), ('matrix', 0.018), ('integrals', 0.017), ('provided', 0.017), ('noise', 0.017), ('possibly', 0.017), ('thorough', 0.016), ('parameters', 0.016), ('cumulative', 0.016), ('monte', 0.016), ('regular', 0.016), ('ti', 0.016), ('var', 0.016), ('mdps', 0.016), ('uncertain', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
2 0.25606176 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
3 0.24088782 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
4 0.20319398 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
5 0.18952286 154 nips-2006-Optimal Change-Detection and Spiking Neurons
Author: Angela J. Yu
Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1
6 0.15273121 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
7 0.1234615 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
8 0.10442709 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
9 0.095927507 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
10 0.087554 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
11 0.077125371 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
12 0.074848905 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
13 0.067414947 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
14 0.066723578 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
15 0.05674291 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
16 0.05288361 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
17 0.050907623 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
18 0.049104534 159 nips-2006-Parameter Expanded Variational Bayesian Methods
19 0.047389738 98 nips-2006-Inferring Network Structure from Co-Occurrences
20 0.045885418 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
topicId topicWeight
[(0, -0.161), (1, -0.039), (2, -0.283), (3, -0.191), (4, 0.019), (5, -0.048), (6, 0.087), (7, -0.09), (8, 0.361), (9, 0.032), (10, 0.043), (11, -0.028), (12, 0.011), (13, -0.098), (14, -0.067), (15, 0.055), (16, -0.037), (17, -0.069), (18, 0.084), (19, 0.066), (20, -0.008), (21, -0.059), (22, -0.05), (23, -0.045), (24, 0.04), (25, 0.055), (26, 0.064), (27, -0.055), (28, -0.024), (29, -0.024), (30, 0.054), (31, 0.054), (32, -0.046), (33, 0.037), (34, -0.012), (35, 0.017), (36, -0.029), (37, -0.033), (38, -0.027), (39, 0.014), (40, -0.135), (41, 0.023), (42, 0.028), (43, -0.116), (44, 0.054), (45, 0.04), (46, -0.085), (47, -0.025), (48, -0.03), (49, -0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.94637281 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
2 0.71398991 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
3 0.70880991 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
4 0.67169142 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
5 0.62088674 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
Author: Silvia Richter, Douglas Aberdeen, Jin Yu
Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1
6 0.56427294 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
7 0.55084288 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
8 0.48880622 154 nips-2006-Optimal Change-Detection and Spiking Neurons
9 0.43073913 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
10 0.41362613 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
11 0.39361697 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data
12 0.34192047 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
13 0.3213464 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
14 0.31061363 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
15 0.27688316 98 nips-2006-Inferring Network Structure from Co-Occurrences
16 0.26053947 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
17 0.23585141 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
18 0.22966224 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
19 0.22429997 124 nips-2006-Linearly-solvable Markov decision problems
20 0.20265855 159 nips-2006-Parameter Expanded Variational Bayesian Methods
topicId topicWeight
[(1, 0.077), (2, 0.025), (3, 0.03), (7, 0.047), (9, 0.039), (20, 0.032), (22, 0.081), (25, 0.333), (44, 0.058), (57, 0.074), (65, 0.04), (69, 0.026), (71, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.74658012 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
2 0.70143729 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
Author: Jason V. Davis, Inderjit S. Dhillon
Abstract: Gaussian data is pervasive and many learning algorithms (e.g., k-means) model their inputs as a single sample drawn from a multivariate Gaussian. However, in many real-life settings, each input object is best described by multiple samples drawn from a multivariate Gaussian. Such data can arise, for example, in a movie review database where each movie is rated by several users, or in time-series domains such as sensor networks. Here, each input can be naturally described by both a mean vector and covariance matrix which parameterize the Gaussian distribution. In this paper, we consider the problem of clustering such input objects, each represented as a multivariate Gaussian. We formulate the problem using an information theoretic approach and draw several interesting theoretical connections to Bregman divergences and also Bregman matrix divergences. We evaluate our method across several domains, including synthetic data, sensor network data, and a statistical debugging application. 1
3 0.64779073 154 nips-2006-Optimal Change-Detection and Spiking Neurons
Author: Angela J. Yu
Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1
4 0.4362182 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
Author: Jeremy Lewi, Robert Butera, Liam Paninski
Abstract: Adaptively optimizing experiments can significantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efficient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efficiently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efficiently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efficiency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be defined a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in finding a neuron’s “preferred stimulus,” is the stimulus which maximizes the firing rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to define the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identification approach: we define the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), specified by a finite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difficult to realize this goal in a computationally efficient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) find the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on finding solutions which scale well in high dimensions. We solve problem (1) by using efficient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efficient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous firing rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; briefly, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-fire cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the filter coefficients ki,t−l capture the dependence of the neuron’s instantaneous firing rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive fields. For convenience, we arrange all the stimulus coefficients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive field. The coefficients aj model the dependence on the observed recent activity r at time t − j (these terms may reflect e.g. refractory effects, burstiness, firing-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown filter coefficients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our first key task is to efficiently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive field for a simulated visual neuron. The neuron’s receptive field θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-fit by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justified by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simplifies matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to find the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the first term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efficiency; our simulations (data not shown) showed that, despite this improved efficiency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the firing rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The first term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefficients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefficients, k. B) The estimated spike history coefficients, a. C) The final estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we find that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undefined, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly defined by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by fixing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then find the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive field when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to find the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch significantly faster and more stable. To summarize, updating the posterior and finding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modification of Ct−1 , and there exist efficient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modification of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deflation which reduces the cost of matrix multiplications associated with finding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we fixed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to find the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefficients, k, and the spike history coefficients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our first simulation considered the use of our algorithm for learning the receptive field of a visually sensitive neuron. We took the neuron’s receptive field to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efficient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate fitted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive field of a neuron whose response depends on its past spiking. Here, the stimulus coefficients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modification of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modification. the spike history coefficients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefficients and the spike history coefficients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefficients. Our final set of results, Fig. 3, considers a neuron whose receptive field drifts non-systematically with time. We take the receptive field to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efficient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides significant improvements over random sampling; indeed, the differences in efficiency are large enough that the information-optimization strategy may permit robust system identification in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique significantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).
5 0.4336113 175 nips-2006-Simplifying Mixture Models through Function Approximation
Author: Kai Zhang, James T. Kwok
Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.
6 0.43223029 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
7 0.43205348 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
8 0.43132675 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
9 0.4310458 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
10 0.42947298 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.4287425 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
12 0.42829791 79 nips-2006-Fast Iterative Kernel PCA
13 0.42783999 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.42737859 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
15 0.42695877 20 nips-2006-Active learning for misspecified generalized linear models
16 0.42567351 61 nips-2006-Convex Repeated Games and Fenchel Duality
17 0.42458189 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
18 0.42448187 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
19 0.42401415 194 nips-2006-Towards a general independent subspace analysis
20 0.4229801 158 nips-2006-PG-means: learning the number of clusters in data