nips nips2008 knowledge-graph by maker-knowledge-mining

nips 2008 knowledge graph

papers list:

1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any ﬁnite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to inﬁnity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not speciﬁc to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the ﬁrst algorithm to achieve all four desirable features and be stable and convergent for all ﬁnite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a ﬁnite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are ﬁnite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we ﬁrst consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The ﬁnal i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to deﬁne the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we deﬁne the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a ﬁxed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to inﬁnity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the ﬁrst expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any ﬁxed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the ﬁrst k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is deﬁned by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) preﬁxed by an estimate of the matrix A . Although we ﬁnd this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the ﬁrst expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the ﬁrst k − 1 samples, with u1 = 0. The GTD(0) algorithm is deﬁned by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-deﬁned because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-deﬁned as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the veriﬁcation of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-deﬁned for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisﬁes 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefﬁcient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisﬁes E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisﬁed by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the deﬁnition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-deﬁned and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the ﬁltration deﬁned by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redeﬁned as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a ﬁnite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that deﬁning the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is inﬁnite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exempliﬁed by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to ﬁnd the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufﬁcient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsiﬁcation methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the ﬁrst off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have conﬁrmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efﬁciently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientiﬁc, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artiﬁcial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8

2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

Author: Laurent E. Ghaoui, Assane Gueye

Abstract: We consider the problem of bounding from above the log-partition function corresponding to second-order Ising models for binary distributions. We introduce a new bound, the cardinality bound, which can be computed via convex optimization. The corresponding error on the logpartition function is bounded above by twice the distance, in model parameter space, to a class of “standard” Ising models, for which variable inter-dependence is described via a simple mean ﬁeld term. In the context of maximum-likelihood, using the new bound instead of the exact log-partition function, while constraining the distance to the class of standard Ising models, leads not only to a good approximation to the log-partition function, but also to a model that is parsimonious, and easily interpretable. We compare our bound with the log-determinant bound introduced by Wainwright and Jordan (2006), and show that when the l1 -norm of the model parameter vector is small enough, the latter is outperformed by the new bound. 1 1.1

3 nips-2008-A Massively Parallel Digital Learning Processor

Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic

Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.

4 nips-2008-A Scalable Hierarchical Distributed Language Model

Author: Andriy Mnih, Geoffrey E. Hinton

Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1

5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

Author: Massih Amini, Nicolas Usunier, François Laviolette

Abstract: We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The ﬁrst one involves the margin distribution of the classiﬁer and a risk bound on its associate Gibbs classiﬁer. The bound is tight when so is the Gibbs’s bound and when the errors of the majority vote classiﬁer is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of conﬁdence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classiﬁer on the examples for whose margins are above a ﬁxed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is ﬁxed manually. 1

6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context

Author: Abhinav Gupta, Jianbo Shi, Larry S. Davis

Abstract: We present an approach that combines bag-of-words and spatial models to perform semantic and syntactic analysis for recognition of an object based on its internal appearance and its context. We argue that while object recognition requires modeling relative spatial locations of image features within the object, a bag-of-word is sufﬁcient for representing context. Learning such a model from weakly labeled data involves labeling of features into two classes: foreground(object) or “informative” background(context). We present a “shape-aware” model which utilizes contour information for efﬁcient and accurate labeling of features in the image. Our approach iterates between an MCMC-based labeling and contour based labeling of features to integrate co-occurrence of features and shape similarity. 1

7 nips-2008-A computational model of hippocampal function in trace conditioning

Author: Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

Abstract: We introduce a new reinforcement-learning model for the role of the hippocampus in classical conditioning, focusing on the differences between trace and delay conditioning. In the model, all stimuli are represented both as unindividuated wholes and as a series of temporal elements with varying delays. These two stimulus representations interact, producing different patterns of learning in trace and delay conditioning. The model proposes that hippocampal lesions eliminate long-latency temporal elements, but preserve short-latency temporal elements. For trace conditioning, with no contiguity between cue and reward, these long-latency temporal elements are necessary for learning adaptively timed responses. For delay conditioning, the continued presence of the cue supports conditioned responding, and the short-latency elements suppress responding early in the cue. In accord with the empirical data, simulated hippocampal damage impairs trace conditioning, but not delay conditioning, at medium-length intervals. With longer intervals, learning is impaired in both procedures, and, with shorter intervals, in neither. In addition, the model makes novel predictions about the response topography with extended cues or post-training lesions. These results demonstrate how temporal contiguity, as in delay conditioning, changes the timing problem faced by animals, rendering it both easier and less susceptible to disruption by hippocampal lesions. The hippocampus is an important structure in many types of learning and memory, with prominent involvement in spatial navigation, episodic and working memories, stimulus conﬁguration, and contextual conditioning. One empirical phenomenon that has eluded many theories of the hippocampus is the dependence of aversive trace conditioning on an intact hippocampus (but see Rodriguez & Levy, 2001; Schmajuk & DiCarlo, 1992; Yamazaki & Tanaka, 2005). For example, trace eyeblink conditioning disappears following hippocampal lesions (Solomon et al., 1986; Moyer, Jr. et al., 1990), induces hippocampal neurogenesis (Gould et al., 1999), and produces unique activity patterns in hippocampal neurons (McEchron & Disterhoft, 1997). In this paper, we present a new abstract computational model of hippocampal function during trace conditioning. We build on a recent extension of the temporal-difference (TD) model of conditioning (Ludvig, Sutton & Kehoe, 2008; Sutton & Barto, 1990) to demonstrate how the details of stimulus representation can qualitatively alter learning during trace and delay conditioning. By gently tweaking this stimulus representation and reducing long-latency temporal elements, trace conditioning is severely impaired, whereas delay conditioning is hardly affected. In the model, the hippocampus is responsible for maintaining these long-latency elements, thus explaining the selective importance of this brain structure in trace conditioning. The difference between trace and delay conditioning is one of the most basic operational distinctions in classical conditioning (e.g., Pavlov, 1927). Figure 1 is a schematic of the two training procedures. In trace conditioning, a conditioned stimulus (CS) is followed some time later by a reward or uncon1 Trace Delay Stimulus Reward Figure 1: Event timelines in trace and delay conditioning. Time ﬂows from left-to-right in the diagram. A vertical bar represents a punctate (short) event, and the extended box is a continuously available stimulus. In delay conditioning, the stimulus and reward overlap, whereas, in trace conditioning, there is a stimulus-free gap between the two punctate events. ditioned stimulus (US); the two stimuli are separated by a stimulus-free gap. In contrast, in delay conditioning, the CS remains on until presentation of the US. Trace conditioning is learned more slowly than delay conditioning, with poorer performance often observed even at asymptote. In both eyeblink conditioning (Moyer, Jr. et al., 1990; Solomon et al., 1986; Tseng et al., 2004) and fear conditioning (e.g., McEchron et al., 1998), hippocampal damage severely impairs the acquisition of conditioned responding during trace conditioning, but not delay conditioning. These selective hippocampal deﬁcits with trace conditioning are modulated by the inter-stimulus interval (ISI) between CS onset and US onset. With very short ISIs (∼300 ms in eyeblink conditioning in rabbits), there is little deﬁcit in the acquisition of responding during trace conditioning (Moyer, Jr. et al., 1990). Furthermore, with very long ISIs (>1000 ms), delay conditioning is also impaired by hippocampal lesions (Beylin et al., 2001). These interactions between ISI and the hippocampaldependency of conditioning are the primary data that motivate the new model. 1 TD Model of Conditioning Our full model of conditioning consists of three separate modules: the stimulus representation, learning algorithm, and response rule. The explanation of hippocampal function relies mostly on the details of the stimulus representation. To illustrate the implications of these representational issues, we have chosen the temporal-difference (TD) learning algorithm from reinforcement learning (Sutton & Barto, 1990, 1998) that has become the sine qua non for modeling reward learning in dopamine neurons (e.g., Ludvig et al., 2008; Schultz, Dayan, & Montague, 1997), and a simple, leaky-integrator response rule described below. We use these for simplicity and consistency with prior work; other learning algorithms and response rules might also yield similar conclusions. 1.1 Stimulus Representation In the model, stimuli are not coherent wholes, but are represented as a series of elements or internal microstimuli. There are two types of elements in the stimulus representation: the ﬁrst is the presence microstimulus, which is exactly equivalent to the external stimulus (Sutton & Barto, 1990). This microstimulus is available whenever the corresponding stimulus is on (see Fig. 3). The second type of elements are the temporal microstimuli or spectral traces, which are a series of successively later and gradually broadening elements (see Grossberg & Schmajuk, 1989; Machado, 1997; Ludvig et al., 2008). Below, we show how the interaction between these two types of representational elements produces different styles of learning in delay and trace conditioning, resulting in differential sensitivity of these procedures to hippocampal manipulation. The temporal microstimuli are created in the model through coarse coding of a decaying memory trace triggered by stimulus onset. Figure 2 illustrates how this memory trace (left panel) is encoded by a series of basis functions evenly spaced across the height of the trace (middle panel). Each basis function effectively acts as a receptive ﬁeld for trace height: As the memory trace fades, different basis functions become more or less active, each with a particular temporal proﬁle (right panel). These activity proﬁles for the temporal microstimuli are then used to generate predictions of the US. For the basis functions, we chose simple Gaussians: 1 (y − µ)2 f (y, µ, σ) = √ exp(− ). 2σ 2 2π 2 (1) 0.4 Microstimulus Level Trace Height 1 0.75 + 0.5 0.25 0 0 20 40 60 Time Step 0.3 0.2 0.1 0 Temporal Basis Functions 0 20 40 60 Time Step Figure 2: Creating Microstimuli. The memory traces for a stimulus (left) are coarsely coded by a series of temporal basis functions (middle). The resultant time courses (right) of the temporal microstimuli are used to predict future occurrence of the US. A single basis function (middle) and approximately corresponding microstimulus (right) have been darkened. The inset in the right panel shows the levels of several microstimuli at the time indicated by the dashed line. Given these basis functions, the microstimulus levels xt (i) at time t are determined by the corresponding memory trace height: xt (i) = f (yt , i/m, σ)yt , (2) where f is the basis function deﬁned above and m is the number of temporal microstimuli per stimulus. The trace level yt was set to 1 at stimulus onset and decreased exponentially, controlled by a single decay parameter, which was allowed to vary to simulate the effects of hippocampal lesions. Every stimulus, including the US, was represented by a single memory trace and resultant microstimuli. 1.2 Hippocampal Damage We propose that hippocampal damage results in the selective loss of the long-latency temporal elements of the stimulus representation. This idea is implemented in the model through a decrease in the memory decay constant from .985 to .97, approximately doubling the decay rate of the memory trace that determines the microstimuli. In effect, we assume that hippocampal damage results in a memory trace that decays more quickly, or, equivalently, is more susceptible to interference. Figure 3 shows the effects of this parameter manipulation on the time course of the elements in the stimulus representation. The presence microstimulus is not affected by this manipulation, but the temporal microstimuli are compressed for both the CS and the US. Each microstimulus has a briefer time course, and, as a group, they cover a shorter time span. Other means for eliminating or reducing the long-latency temporal microstimuli are certainly possible and would likely be compatible with our theory. For example, if one assumes that the stimulus representation contains multiple memory traces with different time constants, each with a separate set of microstimuli, then eliminating the slower memory traces would also remove the long-latency elements, and many of the results below hold (simulations not shown). The key point is that hippocampal damage reduces the number and magnitude of long-latency microstimuli. 1.3 Learning and Responding The model approaches conditioning as a reinforcement-learning prediction problem, wherein the agent tries to predict the upcoming rewards or USs. The model learns through linear TD(λ) (Ludvig et al., 2008; Schultz et al., 1997; Sutton, 1988; Sutton & Barto, 1990, 1998). At each time step, the US prediction (Vt ) is determined by: n T Vt (x) = wt x 0 = wt (i)x(i) i=1 3 , 0 (3) Microstimulus Level Normal Hippocampal 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 500 1000 0 500 1000 Time (ms) Figure 3: Hippocampal effects on the stimulus representation. The left panel presents the stimulus representation in delay conditioning with the normal parameter settings, and the right panel presents the altered stimulus representation following simulated hippocampal damage. In the hippocampal representation, the temporal microstimuli for both CS (red, solid lines) and US (green, dashed lines) are all briefer and shallower. The presence microstimuli (blue square wave and black spike) are not affected by the hippocampal manipulation. where x is a vector of the activation levels x(i) for the various microstimuli, wt is a corresponding vector of adjustable weights wt (i) at time step t, and n is the total number of all microstimuli. The US prediction is constrained to be non-negative, with negative values rectiﬁed to 0. As is standard in TD models, this US prediction is compared to the reward received and the previous US prediction to generate a TD error (δt ): δt = rt + γVt (xt ) − Vt (xt−1 ), (4) where γ is a discount factor that determines the temporal horizon of the US prediction. This TD error is then used to update the weight vector based on the following update rule: wt+1 = wt + αδt et , (5) where α is a step-size parameter and et is a vector of eligibility trace levels (see Sutton & Barto, 1998), which together help determine the speed of learning. Each microstimulus has its own corresponding eligibility trace which continuously decays, but accumulates whenever that microstimulus is present: et+1 = γλet + xt , (6) where γ is the discount factor as above and λ is a decay parameter that determines the plasticity window. These US predictions are translated into responses through a simple, thresholded leakyintegrator response rule: at+1 = υat + Vt+1 (xt ) θ , (7) where υ is a decay constant, and θ is a threshold on the value function V . Our model is deﬁned by Equations 1-7 and 7 additional parameters, which were ﬁxed at the following values for the simulations below: λ = .95, α = .005, γ = .97, n = 50, σ = .08, υ = .93, θ = .25. In the simulated experiments, one time step was interpreted as 10 ms. 4 CR Magnitude ISI250 5 4 3 3 Delay!Normal 2 Delay!HPC Trace!Normal 1 Trace!HPC 0 250 500 50 3 2 1 50 ISI1000 5 4 4 0 ISI500 5 2 1 250 500 0 50 250 500 Trials Figure 4: Learning in the model for trace and delay conditioning with and without hippocampal (HPC) damage. The three panels present training with different interstimulus intervals (ISI). 2 Results We simulated 12 total conditions with the model: trace and delay conditioning, both with and without hippocampal damage, for short (250 ms), medium (500 ms), and long (1000 ms) ISIs. Each simulated experiment was run for 500 trials, with every 5th trial an unreinforced probe trial, during which no US was presented. For delay conditioning, the CS lasted the same duration as the ISI and terminated with US presentation. For trace conditioning, the CS was present for 5 time steps (50 ms). The US always lasted for a single time step, and an inter-trial interval of 5000 ms separated all trials (onset to onset). Conditioned responding (CR magnitude) was measured as the maximum height of the response curve on a given trial. 0.8 CR Magnitude US Prediction Figure 4 summarizes our results. The ﬁgure depicts how the CR magnitude changed across the 500 trials of acquisition training. In general, trace conditioning produced lower levels of responding than delay conditioning, but this effect was most pronounced with the longest ISI. The effects of simulated hippocampal damage varied with the ISI. With the shortest ISI (250 ms; left panel), there was little effect on responding in either trace or delay conditioning. There was a small deﬁcit early in training with trace conditioning, but this difference disappeared quickly with further training. With the longest ISI (1000 ms; right panel), there was a profound effect on responding in both trace and delay conditioning, with trace conditioning completely eliminated. The intermediate ISI (500 ms; middle panel) produced the most complex and interesting results. With this interval, there was only a minor deﬁcit in delay conditioning, but a substantial drop in trace conditioning, especially early in training. This pattern of results roughly matches the empirical data, capturing the selective deﬁcit in trace conditioning caused by hippocampal lesions (Solomon et al., 1986) as well as the modulation of this deﬁcit by ISI (Beylin et al., 2001; Moyer, Jr. et al., 1990). Delay Trace 0.6 0.4 0.2 0 0 250 500 750 Time (ms) 5 4 3 2 1 0 0 250 500 750 Time (ms) Figure 5: Time course of US prediction and CR magnitude for both trace (red, dashed line) and delay conditioning (blue, solid line) with a 500-ms ISI. 5 These differences in sensitivity to simulated hippocampal damage arose despite similar model performance during normal trace and delay conditioning. Figure 5 shows the time course of the US prediction (left panel) and CR magnitude (right panel) after trace and delay conditioning on a probe trial with a 500-ms ISI. In both instances, the US prediction grew throughout the trial as the usual time of the US became imminent. Note the sharp drop off in US prediction for delay conditioning exactly as the CS terminates. This change reﬂects the disappearance of the presence microstimulus, which supports much of the responding in delay conditioning (see Fig. 6). In both procedures, even after the usual time of the US (and CS termination in the case of delay conditioning), there was still some residual US prediction. These US predictions were caused by the long-latency microstimuli, which did not disappear exactly at CS offset, and were ordinarily (on non-probe trials) countered by negative weights on the US microstimuli. The CR magnitude tracked the US prediction curve quite closely, peaking around the time the US would have occurred for both trace and delay conditioning. There was little difference in either curve between trace and delay conditioning, yet altering the stimulus representation (see Fig. 3) had a more pronounced effect on trace conditioning. An examination of the weight distribution for trace and delay conditioning explains why hippocampal damage had a more pronounced effect on trace than delay conditioning. Figure 6 depicts some representative microstimuli (left column) as well as their corresponding weights (right columns) following trace or delay conditioning with or without simulated hippocampal damage. For clarity in the ﬁgure, we have grouped the weights into four categories: positive (+), large positive (+++), negative (-), and large negative (--). The left column also depicts how the model poses the computational problem faced by an animal during conditioning; the goal is to sum together weighted versions of the available microstimuli to produce the ideal US prediction curve in the bottom row. In normal delay conditioning, the model placed a high positive weight on the presence microstimulus, but balanced that with large negative weights on the early CS microstimuli, producing a prediction topography that roughly matched the ideal prediction (see Fig. 5, left panel). In normal trace conditioning, the model only placed a small positive weight on the presence microstimulus, but supplemented that with large positive weights on both the early and late CS microstimuli, also producing a prediction topography that roughly matched the ideal prediction. Weights Normal HPC Lesion Delay CS Presence Stimulus CS Early Microstimuli CS Late Microstimuli US Early Microstimuli Trace Delay Trace +++ + +++ + -- + -- + + +++ N/A N/A - -- - - Ideal Summed Prediction Figure 6: Schematic of the weights (right columns) on various microstimuli following trace and delay conditioning. The left column illustrates four representative microstimuli: the presence microstimulus, an early CS microstimulus, a late CS microstimulus, and a US microstimulus. The ideal prediction is the expectation of the sum of future discounted rewards. 6 Following hippocampal lesions, the late CS microstimuli were no longer available (N/A), and the system could only use the other microstimuli to generate the best possible prediction proﬁle. In delay conditioning, the loss of these long-latency microstimuli had a small effect, notable only with the longest ISI (1000 ms) with these parameter settings. With trace conditioning, the loss of the long-latency microstimuli was catastrophic, as these microstimuli were usually the major basis for the prediction of the upcoming US. As a result, trace conditioning became much more difﬁcult (or impossible in the case of the 1000-ms ISI), even though delay conditioning was less affected. The most notable (and deﬁning) difference between trace and delay conditioning is that the CS and US overlap in delay conditioning, but not trace conditioning. In our model, this overlap is necessary, but not sufﬁcient, for the the unique interaction between the presence microstimulus and temporal microstimuli in delay conditioning. For example, if the CS were extended to stay on beyond the time of US occurrence, this contiguity would be maintained, but negative weights on the early CS microstimuli would not sufﬁce to suppress responding throughout this extended CS. In this case, the best solution to predicting the US for the model might be to put high weights on the long-latency temporal microstimuli (as in trace conditioning; see Fig 6), which would not persist as long as the now extended presence microstimulus. Indeed, with a CS that was three times as long as the ISI, we found that the US prediction, CR magnitude, and underlying weights were completely indistinguishable from trace conditioning (simulations not shown). Thus, the model predicts that this extended delay conditioning should be equally sensitive to hippocampal damage as trace conditioning for the same ISIs. This empirical prediction is a fundamental test of the representational assumptions underlying the model. The particular mechanism that we chose for simulating the loss of the long-latency microstimuli (increasing the decay rate of the memory trace) also leads to a testable model prediction. If one were to pre-train an animal with trace conditioning and then perform hippocampal lesions, there should be some loss of responding, but, more importantly, those CRs that do occur should appear earlier in the interval because the temporal microstimuli now follow a shorter time course (see Fig. 3). There is some evidence for additional short-latency CRs during trace conditioning in lesioned animals (e.g., Port et al., 1986; Solomon et al., 1986), but, to our knowledge, this precise model prediction has not been rigorously evaluated. 3 Discussion and Conclusion We evaluated a novel computational model for the role of the hippocampus in trace conditioning, based on a reinforcement-learning framework. We extended the microstimulus TD model presented by Ludvig et al. (2008) by suggesting a role for the hippocampus in maintaining long-latency elements of the temporal stimulus representation. The current model also introduced an additional element to the stimulus representation (the presence microstimulus) and a simple response rule for translating prediction into actions; we showed how these subtle innovations yield interesting interactions when comparing trace and delay conditioning. In addition, we adduced a pair of testable model predictions about the effects of extended stimuli and post-training lesions. There are several existing theories for the role of the hippocampus in trace conditioning, including the modulation of timing (Solomon et al., 1986), establishment of contiguity (e.g., Wallenstein et al., 1998), and overcoming of task difﬁculty (Beylin et al., 2001). Our new model provides a computational mechanism that links these three proposed explanations. In our model, for similar ISIs, delay conditioning requires learning to suppress responding early in the CS, whereas trace conditioning requires learning to create responding later in the trial, near the time of the US (see Fig. 6). As a result, for the same ISI, delay conditioning requires changing weights associated with earlier microstimuli than trace conditioning, though in opposite directions. These early microstimuli reach higher activation levels (see Fig. 2), producing higher eligibility traces, and are therefore learned about more quickly. This differential speed of learning for short-latency temporal microstimuli corresponds with much behavioural data that shorter ISIs tend to improve both the speed and asymptote of learning in eyeblink conditioning (e.g., Schneiderman & Gormerzano, 1964). Thus, the contiguity between the CS and US in delay conditioning alters the timing problem that the animal faces, effectively making the time interval to be learned shorter, and rendering the task easier for most ISIs. In future work, it will be important to characterize the exact mathematical properties that constrain the temporal microstimuli. Our simple Gaussian basis function approach sufﬁces for the datasets 7 examined here (cf. Ludvig et al., 2008), but other related mathematical functions are certainly possible. For example, replacing the temporal microstimuli in our model with the spectral traces of Grossberg & Schmajuk (1989) produces results that are similar to ours, but using sequences of Gamma-shaped functions tends to fail, with longer intervals learned too slowly relative to shorter intervals. One important characteristic of the microstimulus series seems to be that the heights of individual elements should not decay too quickly. Another key challenge for future modeling is reconciling this abstract account of hippocampal function in trace conditioning with approaches that consider greater physiological detail (e.g., Rodriguez & Levy, 2001; Yamazaki & Tanaka, 2005). The current model also contributes to our understanding of the TD models of dopamine (e.g., Schultz et al., 1997) and classical conditioning (Sutton & Barto, 1990). These models have often given short shrift to issues of stimulus representation, focusing more closely on the properties of the learning algorithm (but see Ludvig et al., 2008). Here, we reveal how the interaction of various stimulus representations in conjunction with the TD learning rule produces a viable model of some of the differences between trace and delay conditioning. References Beylin, A. V., Gandhi, C. C, Wood, G. E., Talk, A. C., Matzel, L. D., & Shors, T. J. (2001). The role of the hippocampus in trace conditioning: Temporal discontinuity or task difﬁculty? Neurobiology of Learning & Memory, 76, 447-61. Gould, E., Beylin, A., Tanapat, P., Reeves, A., & Shors, T. J. (1999). Learning enhances adult neurogenesis in the hippocampal formation. Nature Neuroscience, 2, 260-5. Grossberg, S., & Schmajuk, N. A. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79-102. Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20, 3034-54. Machado, A. (1997). Learning the temporal dynamics of behavior. Psychological Review, 104, 241-265. McEchron, M. D., Bouwmeester, H., Tseng, W., Weiss, C., & Disterhoft, J. F. (1998). Hippocampectomy disrupts auditory trace fear conditioning and contextual fear conditioning in the rat. Hippocampus, 8, 63846. McEchron, M. D., Disterhoft, J. F. (1997). Sequence of single neuron changes in CA1 hippocampus of rabbits during acquisition of trace eyeblink conditioned responses. Journal of Neurophysiology, 78, 1030-44. Moyer, J. R., Jr., Deyo, R. A., & Disterhoft, J. F. (1990). Hippocampectomy disrupts trace eye-blink conditioning in rabbits. Behavioral Neuroscience, 104, 243-52. Pavlov, I. P. (1927). Conditioned Reﬂexes. London: Oxford University Press. Port, R. L., Romano, A. G., Steinmetz, J. E., Mikhail, A. A., & Patterson, M. M. (1986). Retention and acquisition of classical trace conditioned responses by rabbits with hippocampal lesions. Behavioral Neuroscience, 100, 745-752. Rodriguez, P., & Levy, W. B. (2001). A model of hippocampal activity in trace conditioning: Where’s the trace? Behavioral Neuroscience, 115, 1224-1238. Schmajuk, N. A., & DiCarlo, J. J. (1992). Stimulus conﬁguration, classical conditioning, and hippocampal function. Psychological Review, 99, 268-305. Schneiderman, N., & Gormezano, I. (1964). Conditioning of the nictitating membrane of the rabbit as a function of CS-US interval. Journal of Comparative and Physiological Psychology, 57, 188-195. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593-9. Solomon, P. R., Vander Schaaf, E. R., Thompson, R. F., & Weisz, D. J. (1986). Hippocampus and trace conditioning of the rabbit’s classically conditioned nictitating membrane response. Behavioral Neuroscience, 100, 729-744. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and Computational Neuroscience: Foundations of Adaptive Networks (pp. 497-537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Tseng, W., Guan, R., Disterhoft, J. F., & Weiss, C. (2004). Trace eyeblink conditioning is hippocampally dependent in mice. Hippocampus, 14, 58-65. Wallenstein, G., Eichenbaum, H., & Hasselmo, M. (1998). The hippocampus as an associator of discontiguous events. Trends in Neuroscience, 21, 317-323. Yamazaki, T., & Tanaka, S. (2005). A neural network model for trace conditioning. International Journal of Neural Systems, 15, 23-30. 8

8 nips-2008-A general framework for investigating how far the decoding process in the brain can be simplified

Author: Masafumi Oizumi, Toshiyuki Ishii, Kazuya Ishibashi, Toshihiko Hosoya, Masato Okada

Abstract: “How is information decoded in the brain?” is one of the most difﬁcult and important questions in neuroscience. Whether neural correlation is important or not in decoding neural activities is of special interest. We have developed a general framework for investigating how far the decoding process in the brain can be simpliﬁed. First, we hierarchically construct simpliﬁed probabilistic models of neural responses that ignore more than Kth-order correlations by using a maximum entropy principle. Then, we compute how much information is lost when information is decoded using the simpliﬁed models, i.e., “mismatched decoders”. We introduce an information theoretically correct quantity for evaluating the information obtained by mismatched decoders. We applied our proposed framework to spike data for vertebrate retina. We used 100-ms natural movies as stimuli and computed the information contained in neural activities about these movies. We found that the information loss is negligibly small in population activities of ganglion cells even if all orders of correlation are ignored in decoding. We also found that if we assume stationarity for long durations in the information analysis of dynamically changing stimuli like natural movies, pseudo correlations seem to carry a large portion of the information. 1

9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets

Author: Gerald Quon, Yee W. Teh, Esther Chan, Timothy Hughes, Michael Brudno, Quaid D. Morris

Abstract: We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientiﬁc interest in species such as human and mouse. We present Brownian Factor Phylogenetic Analysis, a statistical model that makes a number of signiﬁcant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efﬁcacy of our method on a microarray dataset proﬁling diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects. 1

10 nips-2008-A rational model of preference learning and choice prediction by children

Author: Christopher G. Lucas, Thomas L. Griffiths, Fei Xu, Christine Fawcett

Abstract: Young children demonstrate the ability to make inferences about the preferences of other agents based on their choices. However, there exists no overarching account of what children are doing when they learn about preferences or how they use that knowledge. We use a rational model of preference learning, drawing on ideas from economics and computer science, to explain the behavior of children in several recent experiments. Speciﬁcally, we show how a simple econometric model can be extended to capture two- to four-year-olds’ use of statistical information in inferring preferences, and their generalization of these preferences. 1

11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response

Author: Alexander Braunstein, Zhi Wei, Shane T. Jensen, Jon D. Mcauliffe

Abstract: Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method. 1

12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes

Author: Ben Calderhead, Mark Girolami, Neil D. Lawrence

Abstract: Identiﬁcation and comparison of nonlinear dynamical system models using noisy and sparse experimental data is a vital task in many ﬁelds, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods. 1

13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

Author: Sanmay Das, Malik Magdon-Ismail

Abstract: We study the proﬁt-maximization problem of a monopolistic market-maker who sets two-sided prices in an asset market. The sequential decision problem is hard to solve because the state space is a function. We demonstrate that the belief state is well approximated by a Gaussian distribution. We prove a key monotonicity property of the Gaussian state update which makes the problem tractable, yielding the ﬁrst optimal sequential market-making algorithm in an established model. The algorithm leads to a surprising insight: an optimal monopolist can provide more liquidity than perfectly competitive market-makers in periods of extreme uncertainty, because a monopolist is willing to absorb initial losses in order to learn a new valuation rapidly so she can extract higher proﬁts later. 1

14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Author: Tong Zhang

Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefﬁcients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneﬁcial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1

15 nips-2008-Adaptive Martingale Boosting

Author: Phil Long, Rocco Servedio

Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classiﬁers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classiﬁcation noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classiﬁers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classiﬁcation noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with conﬁdencerated weak hypotheses that output real values rather than Boolean predictions.

16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF

Author: Jonathan L. Roux, Alain D. Cheveigné, Lucas C. Parra

Abstract: How does one extract unknown but stereotypical events that are linearly superimposed within a signal with variable latencies and variable amplitudes? One could think of using template matching or matching pursuit to ﬁnd the arbitrarily shifted linear components. However, traditional matching approaches require that the templates be known a priori. To overcome this restriction we use instead semi Non-Negative Matrix Factorization (semiNMF) that we extend to allow for time shifts when matching the templates to the signal. The algorithm estimates templates directly from the data along with their non-negative amplitudes. The resulting method can be thought of as an adaptive template matching procedure. We demonstrate the procedure on the task of extracting spikes from single channel extracellular recordings. On these data the algorithm essentially performs spike detection and unsupervised spike clustering. Results on simulated data and extracellular recordings indicate that the method performs well for signalto-noise ratios of 6dB or higher and that spike templates are recovered accurately provided they are suﬃciently diﬀerent. 1

17 nips-2008-Algorithms for Infinitely Many-Armed Bandits

Author: Yizao Wang, Jean-yves Audibert, Rémi Munos

Abstract: We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-conﬁdence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1

18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering

Author: Dilan Gorur, Yee W. Teh

Abstract: We propose an efﬁcient sequential Monte Carlo inference scheme for the recently proposed coalescent clustering model [1]. Our algorithm has a quadratic runtime while those in [1] is cubic. In experiments, we were surprised to ﬁnd that in addition to being more efﬁcient, it is also a better sequential Monte Carlo sampler than the best in [1], when measured in terms of variance of estimated likelihood and effective sample size. 1

19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

Author: Gabriele Schweikert, Gunnar Rätsch, Christian Widmer, Bernhard Schölkopf

Abstract: We study the problem of domain transfer for a supervised classiﬁcation task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We ﬁnd that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classiﬁcation performance.

20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning

Author: Zenglin Xu, Rong Jin, Irwin King, Michael Lyu

Abstract: We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efﬁcient methods, i.e., Semi-Inﬁnite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can signiﬁcantly improve efﬁciency by saving on average 91.9% of computational time over the SILP method and 70.3% over the SD method. 1

21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

23 nips-2008-An ideal observer model of infant object perception

24 nips-2008-An improved estimator of Variance Explained in the presence of noise

25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference

27 nips-2008-Artificial Olfactory Brain for Mixture Identification

28 nips-2008-Asynchronous Distributed Learning of Topic Models

29 nips-2008-Automatic online tuning for fast Gaussian summation

30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

31 nips-2008-Bayesian Exponential Family PCA

32 nips-2008-Bayesian Kernel Shaping for Learning Control