nips nips2002 nips2002-130 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. [sent-6, score-0.376]
2 The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. [sent-8, score-0.181]
3 This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. [sent-9, score-0.157]
4 Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. [sent-10, score-0.192]
5 The approach is demonstrated with an example that shows the efficiency gains over naive enumeration. [sent-11, score-0.052]
6 1 Introduction The Markov game framework has received increased attention as a rigorous model for defining and determining optimal behavior in multiagent systems. [sent-12, score-0.195]
7 Littman [7] demonstrated that reinforcement learning could be applied to Markov games, albeit at the expense of solving one linear program for each state visited during learning. [sent-14, score-0.304]
8 This computational (and conceptual) burden is probably one factor behind the relative dearth of ambitious Markov game applications using reinforcement learning. [sent-15, score-0.201]
9 We applied the LSPI reinforcement learning algorithm [5] with function approximation to a two-player soccer game and a router/server flow control problem and derived very good results. [sent-17, score-0.234]
10 While the theoretical results [6] are general and apply to any reinforcement learning algorithm, we preferred to use LSPI because LSPI’s efficient use of data meant that we solved fewer linear programs during learning. [sent-18, score-0.08]
11 1 The term Markov game in this paper refers to the zero-sum case unless stated otherwise. [sent-19, score-0.157]
12 Since soccer, routing, and many other natural applications of the Markov game framework tend to involve multiple participants it would be very useful to generalize recent advances in multiagent cooperative MDPs [2, 4] to Markov games. [sent-20, score-0.195]
13 These methods use a factored value function architecture and determine the optimal action using a cost network [1] and a communication structure which is derived directly from the structure of the value function. [sent-21, score-0.263]
14 LSPI has been successfuly combined with such methods; in empirical experiments, the number of state visits required to achieve good performance scaled linearly with the number of agents despite the exponential growth in the joint action space [4]. [sent-22, score-0.27]
15 In this paper, we integrate these ideas and we present an algorithm for learning good strategies for a team of agents that plays against an opponent team. [sent-23, score-0.358]
16 In such games, players within one team collaborate, whereas players in different teams compete. [sent-24, score-0.318]
17 The key component of this work is a method for computing efficiently the best strategy for a team, given an approximate factored value function which is a linear combination of features defined over the state space and subsets of the joint action space for both sides. [sent-25, score-0.307]
18 2 Markov Games A two-player zero-sum Markov game is defined as a 6-tuple (S, A, O, P, R, γ), where: S = {s1 , s2 , . [sent-27, score-0.134]
19 , sn } is a finite set of game states; A = {a1 , a2 , . [sent-30, score-0.134]
20 We will refer to the first player as the maximizer and the second player as the minimizer2 . [sent-37, score-0.34]
21 Note that if either player is permitted only a single action, the Markov game becomes an MDP for the other player. [sent-38, score-0.22]
22 A policy π for a player in a Markov game is a mapping, π : S → Ω(A), which yields probability distributions over the maximizer’s actions for each state in S. [sent-39, score-0.696]
23 Unlike MDPs, the optimal policy for a Markov game may be stochastic, i. [sent-40, score-0.353]
24 By convention, for any policy π, π(s) denotes the probability distribution over actions in state s and π(s, a) denotes the probability of action a in state s. [sent-43, score-0.608]
25 The maximizer is interested in maximizing its expected, discounted return in the minimax sense, that is, assuming the worst case of an optimal minimizer. [sent-44, score-0.269]
26 Since the underlying rewards are zero-sum, it is sufficient to view the minimizer as acting to minimize the maximizer’s return. [sent-45, score-0.065]
27 For any policy π, we can define Qπ (s, a, o) as the expected total discounted reward of the maximizer when following policy π after the players take actions a and o for the first step. [sent-46, score-0.932]
28 P (s, a, o, s ) min o ∈O s ∈S a ∈A Given any Q function, the maximizer can choose actions so as to maximize its value: V (s) = max min π (s)∈Ω(A) o∈O Q(s, a, o)π (s, a) . [sent-48, score-0.532]
29 (1) a∈A We will refer to the policy π chosen by Eq. [sent-49, score-0.219]
30 This policy can be determined in any state s by solving the following linear program: V (s) ∀a ∈ A, π (s, a) ≥ 0 π (s, a) = 1 Maximize: Subject to: a∈A ∀o ∈ O, V (s) ≤ Q(s, a, o)π (s, a) . [sent-52, score-0.327]
31 a∈A If Q = Qπ , the minimax policy is an improved policy compared to π. [sent-53, score-0.52]
32 A policy iteration algorithm can be implemented for Markov games in a manner analogous to policy iteration for MDPs by fixing a policy πi , solving for Qπi , choosing πi+1 as the minimax policy with respect to Qπi and iterating. [sent-54, score-1.231]
33 This algorithm converges to the optimal minimax policy π ∗ . [sent-55, score-0.301]
34 We consider the standard approach of approximating the Q function as the linear combination of k basis functions φj with weights wj , that is Q(s, a, o) = φ(s, a, o) w. [sent-57, score-0.186]
35 Least-Squares Policy Iteration (LSPI) [5] is an approximate policy iteration algorithm that learns policies using a corpus of stored samples. [sent-60, score-0.3]
36 LSPI applies also with minor modifications to Markov games [6]. [sent-61, score-0.14]
37 In particular, at each iteration, LSPI evaluates the current policy using the stored samples and keeps the learned weights to represent implicitly the improved minimax policy for the next iteration by solving the linear program above. [sent-62, score-0.753]
38 The modified update equations account for the minimizer’s action and the distribution over next maximizer actions since the minimax policy is, in general, stochastic. [sent-63, score-0.772]
39 The policy π (s ) for state s is computed using the linear program above. [sent-65, score-0.394]
40 The action o is the minimizing opponent action in computing π(s ) and can be identified by the tight constraint on V (s ). [sent-66, score-0.386]
41 The key step in generalizing LSPI to team Markov games is finding efficient means to perform these operations despite the exponentially large joint action space. [sent-68, score-0.461]
42 4 Least Squares Policy Iteration for Team Markov Games A team Markov game is a Markov game where a team of N maximizers is playing against a team of M minimizers. [sent-69, score-0.89]
43 Maximizer i chooses actions from Ai , so the team chooses ¯ actions a = (a1 , a2 , . [sent-70, score-0.684]
44 Minimizer ¯ i chooses actions from Oi , so the minimizer team chooses actions o = (o1 , o2 , . [sent-77, score-0.749]
45 The minimax policy π for the maximizer team in any given state s can ¯ ¯ be computed (naively) by solving the following linear program: V (s) ¯ ∀ a ∈ A, π(s, a) ≥ 0 ¯ ¯ π(s, a) = 1 ¯ Maximize: Subject to: a∈A ¯ ¯ ¢ ∀ o ∈ O, V (s) ≤ ¯ ¯ π(s, a)Q(s, a, o) . [sent-85, score-0.761]
46 ¯ ¯ ¯ a∈A ¯ ¯ ¯ ¯ Since |A| is exponential in N and |O| is exponential in M , the linear program above has an exponential number of variables and constraints and would be intractable to solve, unless we make certain assumptions about Q. [sent-86, score-0.403]
47 We assume a factored approximation [2] of the Q function, given as a linear combination of k localized basis functions. [sent-87, score-0.167]
48 Each basis function can be thought of as an individual player’s perception of the environment, so each φj need not depend upon every feature of the state or the actions taken by every player in the game. [sent-88, score-0.367]
49 For example, if φ4 depends only on the actions of maximizers {4, 5, 8}, and the actions of minimizers {3, 2, 7}, then a 4 ∈ A4 × A5 × A8 ¯ and o4 ∈ O3 × O2 × O7 . [sent-90, score-0.529]
50 Under this locality assumption, the approximate (factored) value ¯ function is k Q(s, a, o) = ¯ ¯ φj (s, aj , oj )wj , ¯ ¯ j=1 where the assignments to the aj ’s and oj ’s are consistent with a and o. [sent-91, score-1.481]
51 Given this form ¯ ¯ ¯ ¯ of the value function the linear program can be simplified significantly. [sent-92, score-0.153]
52 From the last expression, it is clear that we can use πj (s, aj ) as the variables ¯ of the linear program. [sent-94, score-0.575]
53 The number of these variables will typically be much smaller than the number of variables π(s, a), depending on the size of the A j ’s. [sent-95, score-0.054]
54 However, we must ¯ add constraints to ensure that the local probability distributions πj (s) are consistent with a ¯ global distribution over the entire joint action space A. [sent-96, score-0.183]
55 , k : πj (s, aj ) = 1 ¯ ¯ a j ∈A j ¯ ∀ j = 1, . [sent-100, score-0.52]
56 ¯ ¯ : For consistency, we must ensure that all marginals over common variables are identical: ∀1≤j < h ≤ k : ∀ a ∈ Aj ∩ Ah , ¯ πj (s, aj ) = ¯ ¯ ¯ a j ∈A j \ A h ¯ πh (s, ah ) ¯ ¯ ¯ a h ∈A h \ A j ¯ k ∀ o ∈ O, V (s) ≤ ¯ ¯ φj (s, aj , oj )πj (s, aj ) . [sent-104, score-1.863]
57 ¯ ¯ ¯ wj ¯ a j ∈A j ¯ j=1 At this point we have eliminated the exponential dependency from the number of variables and partially from the number of constraints. [sent-105, score-0.297]
58 The last set of (exponentially many) constraints can be replaced by a single non-linear constraint: k o∈O ¯ ¯ V (s) ≤ min φj (s, aj , oj )πj (s, aj ) . [sent-106, score-1.354]
59 ¯ ¯ ¯ wj j=1 ¯ a j ∈A j ¯ We now show how this non-linear constraint can be turned into a number of linear constraints which is not exponential in M in general. [sent-107, score-0.287]
60 The main idea is to embed a cost network inside the linear program [2]. [sent-108, score-0.132]
61 In particular, we define an elimination order for the o i ’s in o ¯ and, for each oi in turn, we push the min operator for just oi as far inside the summation as possible, keeping only terms that have some dependency on o i or no dependency on any of the opponent team actions. [sent-109, score-1.417]
62 We replace this smaller min expression over o i with a new function fi (represent by a set of new variables in the linear program) that depends on the other opponent actions that appear in this min expression. [sent-110, score-0.674]
63 Finally, we introduce a set of linear constraints for the value of fi that express the fact that fi is the minimum of the eliminated expression in all cases. [sent-111, score-0.423]
64 We repeat this elimination process until all o i ’s and therefore all min operators are eliminated. [sent-112, score-0.202]
65 More formally, at step i of the elimination, let Bi be the set of basis functions that have not been eliminated up to that point and Fi be the set of the new functions that have not been eliminated yet. [sent-113, score-0.098]
66 For simplicity, we assume that the elimination order is o 1 , o2 , . [sent-114, score-0.145]
67 , oM (in practice the elimination order needs to be chosen carefully in advance since a poor elimination ordering could have serious adverse effects on efficiency). [sent-117, score-0.29]
68 At the very beginning of the elimination process, B1 = {φ1 , φ2 , . [sent-118, score-0.145]
69 When eliminating oi at step i, define Ei ⊆ Bi ∪ Fi to be those functions that contain oi in their domain or have no ¯ dependency on any opponent action. [sent-122, score-0.973]
70 We generate a new function f i (oi ) that depends on all the opponent actions that appear in Ei excluding oi : ¢ ¯ a j ∈A j ¯ £ ¯ fk (ok ) φj (s, aj , oj )πj (s, aj ) + ¯ ¯ ¯ wj . [sent-123, score-2.219]
71 ¥ φj ∈Ei ¤ ¡ oi ∈Oi ¯ fi (oi ) = min fk ∈Ei We introduce a new variable in the linear program for each possible setting of the domain ¯ oi of the new function fi (oi ). [sent-124, score-1.359]
72 We also introduce a set of constraints for these variables: ¯ ¯ ¯ ∀ oi ∈ Oi , ∀ oi : fi (oi ) ≤ ¯ a j ∈A j ¯ φj ∈Ei ¯ fk (ok ) φj (s, aj , oj )πj (s, aj ) + ¯ ¯ ¯ wj fk ∈Ei These constraints ensure that the new function is the minimum over the possible choices for oi . [sent-125, score-2.997]
73 Now, we define Bi+1 = Bi − Ei and Fi+1 = Fi − Ei + {fi } and we continue with the elimination of action oi+1 . [sent-126, score-0.234]
74 Notice that oi does not appear anywhere in Bi+1 or Fi+1 . [sent-127, score-0.429]
75 Summarizing everything, the reduced linear program is Maximize: Subject to: fM ¯ ∀ j = 1, . [sent-129, score-0.132]
76 The total number of variables and/or constraints is now exponentially dependent only on the number of players that appear together as a group in any of the basis functions or the intermediate functions and distributions. [sent-136, score-0.209]
77 It should be emphasized that this reduced linear program solves the same problem as the naive linear program and yields the same solution (albeit in a factored form). [sent-137, score-0.392]
78 ¯ ¯ £ ¯ a ∈A ¯ The action o is the minimizing opponent’s action in computing π(s ). [sent-140, score-0.205]
79 Unfortunately, ¯ the number of terms in the summation within the first update equation is exponential in N . [sent-141, score-0.075]
80 However, the vector φ(s, a, o) − γ a ∈A π(s , a )φ(s , a , o ) can be computed on a ¯ ¯ ¯ ¯ ¯ ¯ ¯ component-by-component basis avoiding this exponential blowup. [sent-142, score-0.082]
81 A related question is how to find o , the minimizing opponent’s joint action in computing ¯ π(s ). [sent-144, score-0.144]
82 This can be done after the linear program is solved by going through the f i ’s in reverse order (compared to the elimination order) and finding the choice for o i that imposes ¯ ¯ a tight constraint on fi (oi ) conditioned on the minimizing choice for oi that has been found ¯ so far. [sent-145, score-0.913]
83 The only complication is that the linear program has no incentive to maximize f i (oi ) unless it contributes to maximizing the final value. [sent-146, score-0.191]
84 Thus, a constraint that appears to be tight may not correspond to the actual minimizing choice. [sent-147, score-0.086]
85 The solution to this is to do ¯ a forward pass first (according to the elimination order) marking the f i (oi )’s that really come from tight constraints. [sent-148, score-0.202]
86 Then, the backward pass described above will find the true ¯ minimizing choices by using only the marked fi (oi )’s. [sent-149, score-0.19]
87 The last question is how to sample an action a from the global distribution defined by ¯ the smaller distributions. [sent-150, score-0.089]
88 We begin with all actions uninstantiated and we go through all πj (s)’s. [sent-151, score-0.214]
89 For each j, we marginalize out the instantiated actions (if any) from π j (s) to generate the conditional probability and then we sample jointly the actions that remain in the distribution. [sent-152, score-0.428]
90 We repeat with the next j until all actions are instantiated. [sent-153, score-0.214]
91 Notice that this operation can be performed in a distributed manner, that is, at execution time only agents whose actions appear in the same πj (s) need to communicate to sample actions jointly. [sent-154, score-0.504]
92 This communication structure is directly derived from the structure of the basis functions. [sent-155, score-0.058]
93 Since experimental results are still in progress, we demonstrate the efficiency gained over exponential enumeration with an example. [sent-157, score-0.058]
94 Consider a problem with N = 5 maximizers and M = 4 minimizers. [sent-158, score-0.07]
95 Assume also that each maximizer or minimizer has 5 actions to choose from. [sent-159, score-0.447]
96 The naive solution would require solving a linear program with 3126 variables and 3751 constraints for any representation of the value function. [sent-160, score-0.294]
97 Consider now the following factored value function: Q(s, a, o) = φ1 (s, a1 , a2 , o1 , o2 )w1 + φ2 (s, a1 , a3 , o1 , o3 )w2 + ¯ ¯ φ3 (s, a2 , a4 , o3 )w3 + φ4 (s, a3 , a5 , o4 )w4 + φ5 (s, a1 , o3 , o4 )w5 . [sent-161, score-0.119]
98 Our approach permits a tradeoff between simple architectures with limited representational capability and sparse communication and complex architectures with rich representations and more complex coordination structure. [sent-164, score-0.141]
99 It is our belief that the algorithm presented in this paper can be used successfully in real-world, large-scale domains where the available knowledge about the underlying structure can be exploited to derive powerful and sufficient factored representations. [sent-165, score-0.098]
100 Markov games as a framework for multi-agent reinforcement learning. [sent-190, score-0.192]
wordName wordTfidf (topN-words)
[('aj', 0.52), ('oi', 0.405), ('lspi', 0.229), ('policy', 0.219), ('actions', 0.214), ('oj', 0.21), ('team', 0.184), ('maximizer', 0.168), ('elimination', 0.145), ('fi', 0.145), ('games', 0.14), ('game', 0.134), ('wj', 0.134), ('opponent', 0.122), ('program', 0.104), ('factored', 0.098), ('markov', 0.094), ('ei', 0.094), ('action', 0.089), ('player', 0.086), ('minimax', 0.082), ('ronald', 0.078), ('carlos', 0.07), ('maximizers', 0.07), ('michail', 0.07), ('fk', 0.07), ('players', 0.067), ('minimizer', 0.065), ('lagoudakis', 0.061), ('multiagent', 0.061), ('exponential', 0.058), ('min', 0.057), ('guestrin', 0.056), ('reinforcement', 0.052), ('agents', 0.052), ('mdps', 0.049), ('iteration', 0.048), ('constraints', 0.047), ('ah', 0.047), ('ok', 0.046), ('state', 0.043), ('dependency', 0.041), ('tight', 0.039), ('om', 0.037), ('solving', 0.037), ('eliminated', 0.037), ('maximize', 0.036), ('chooses', 0.036), ('bi', 0.036), ('duke', 0.035), ('durham', 0.035), ('palyers', 0.035), ('fm', 0.035), ('communication', 0.034), ('parr', 0.031), ('minimizers', 0.031), ('soccer', 0.031), ('naive', 0.03), ('coordination', 0.028), ('daphne', 0.028), ('joint', 0.028), ('linear', 0.028), ('variables', 0.027), ('minimizing', 0.027), ('nc', 0.026), ('canada', 0.026), ('reward', 0.026), ('notice', 0.025), ('basis', 0.024), ('subject', 0.024), ('appear', 0.024), ('permits', 0.023), ('unless', 0.023), ('vancouver', 0.022), ('gains', 0.022), ('albeit', 0.021), ('value', 0.021), ('squares', 0.021), ('ef', 0.02), ('exponentially', 0.02), ('ciency', 0.02), ('constraint', 0.02), ('koller', 0.02), ('tradeoff', 0.02), ('ai', 0.019), ('discounted', 0.019), ('december', 0.019), ('expense', 0.019), ('ensure', 0.019), ('pass', 0.018), ('planning', 0.018), ('architectures', 0.018), ('approximation', 0.017), ('summation', 0.017), ('policies', 0.017), ('ow', 0.017), ('stored', 0.016), ('consistency', 0.016), ('naively', 0.015), ('dearth', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
2 0.20852098 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
3 0.19946125 78 nips-2002-Efficient Learning Equilibrium
Author: Ronen I. Brafman, Moshe Tennenholtz
Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1
4 0.19167641 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
Author: Xiaofeng Wang, Tuomas Sandholm
Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.
5 0.14964163 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
Author: Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, Alexander Russell
Abstract: We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.
6 0.14630994 134 nips-2002-Learning to Take Concurrent Actions
7 0.13819054 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
8 0.12222175 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
9 0.10883138 111 nips-2002-Independent Components Analysis through Product Density Estimation
10 0.10322866 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
11 0.097924419 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
12 0.094681427 152 nips-2002-Nash Propagation for Loopy Graphical Games
13 0.089003153 20 nips-2002-Adaptive Caching by Refetching
14 0.083607189 205 nips-2002-Value-Directed Compression of POMDPs
15 0.083113819 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
16 0.071922444 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
17 0.063848682 185 nips-2002-Speeding up the Parti-Game Algorithm
18 0.043110061 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
19 0.040950876 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
20 0.03868793 199 nips-2002-Timing and Partial Observability in the Dopamine System
topicId topicWeight
[(0, -0.126), (1, -0.03), (2, -0.346), (3, -0.098), (4, 0.012), (5, -0.116), (6, 0.086), (7, -0.115), (8, -0.028), (9, -0.123), (10, 0.098), (11, 0.106), (12, 0.065), (13, -0.043), (14, 0.008), (15, 0.057), (16, -0.012), (17, -0.017), (18, 0.058), (19, 0.06), (20, -0.018), (21, -0.075), (22, -0.059), (23, -0.056), (24, -0.077), (25, -0.072), (26, 0.004), (27, 0.025), (28, 0.032), (29, 0.008), (30, 0.023), (31, -0.037), (32, 0.012), (33, 0.059), (34, -0.005), (35, 0.027), (36, -0.024), (37, 0.044), (38, -0.015), (39, -0.078), (40, 0.062), (41, 0.011), (42, 0.153), (43, -0.003), (44, 0.016), (45, 0.064), (46, 0.101), (47, -0.065), (48, -0.075), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.96331882 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
2 0.6769287 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
3 0.67620385 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
Author: Xiaofeng Wang, Tuomas Sandholm
Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.
4 0.67261851 134 nips-2002-Learning to Take Concurrent Actions
Author: Khashayar Rohanimanesh, Sridhar Mahadevan
Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1
5 0.63417536 78 nips-2002-Efficient Learning Equilibrium
Author: Ronen I. Brafman, Moshe Tennenholtz
Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1
6 0.59239668 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
7 0.57285023 20 nips-2002-Adaptive Caching by Refetching
8 0.52584738 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
9 0.4829585 185 nips-2002-Speeding up the Parti-Game Algorithm
10 0.44288322 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
11 0.44123808 152 nips-2002-Nash Propagation for Loopy Graphical Games
12 0.41362989 111 nips-2002-Independent Components Analysis through Product Density Estimation
13 0.38821504 205 nips-2002-Value-Directed Compression of POMDPs
14 0.36192593 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
15 0.30413148 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
16 0.24685864 42 nips-2002-Bias-Optimal Incremental Problem Solving
17 0.24448605 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
18 0.2267233 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
19 0.22669899 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions
20 0.22437507 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
topicId topicWeight
[(11, 0.014), (23, 0.039), (41, 0.015), (42, 0.046), (54, 0.118), (55, 0.032), (57, 0.012), (68, 0.022), (73, 0.024), (74, 0.137), (83, 0.34), (92, 0.027), (98, 0.066)]
simIndex simValue paperId paperTitle
1 0.95788687 200 nips-2002-Topographic Map Formation by Silicon Growth Cones
Author: Brian Taba, Kwabena A. Boahen
Abstract: We describe a self-configuring neuromorphic chip that uses a model of activity-dependent axon remodeling to automatically wire topographic maps based solely on input correlations. Axons are guided by growth cones, which are modeled in analog VLSI for the first time. Growth cones migrate up neurotropin gradients, which are represented by charge diffusing in transistor channels. Virtual axons move by rerouting address-events. We refined an initially gross topographic projection by simulating retinal wave input. 1 Neuromorphic Systems Neuromorphic engineers are attempting to match the computational efficiency of biological systems by morphing neurocircuitry into silicon circuits [1]. One of the most detailed implementations to date is the silicon retina described in [2] . This chip comprises thirteen different cell types, each of which must be individually and painstakingly wired. While this circuit-level approach has been very successful in sensory systems, it is less helpful when modeling largely unelucidated and exceedingly plastic higher processing centers in cortex. Instead of an explicit blueprint for every cortical area, what is needed is a developmental rule that can wire complex circuits from minimal specifications. One candidate is the famous
2 0.93588942 91 nips-2002-Field-Programmable Learning Arrays
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
3 0.82788938 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters
Author: Rubén Morales-menéndez, Nando D. Freitas, David Poole
Abstract: This paper discusses the application of particle filtering algorithms to fault diagnosis in complex industrial processes. We consider two ubiquitous processes: an industrial dryer and a level tank. For these applications, we compared three particle filtering variants: standard particle filtering, Rao-Blackwellised particle filtering and a version of RaoBlackwellised particle filtering that does one-step look-ahead to select good sampling regions. We show that the overhead of the extra processing per particle of the more sophisticated methods is more than compensated by the decrease in error and variance.
same-paper 4 0.80247676 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
5 0.64332646 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
6 0.53837568 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
7 0.52856213 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
8 0.5263536 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
9 0.52288115 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
10 0.51333761 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
11 0.50427473 74 nips-2002-Dynamic Structure Super-Resolution
12 0.49842349 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
13 0.49512935 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
14 0.49467862 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity
15 0.49382135 39 nips-2002-Bayesian Image Super-Resolution
16 0.49181789 4 nips-2002-A Differential Semantics for Jointree Algorithms
17 0.49132392 152 nips-2002-Nash Propagation for Loopy Graphical Games
18 0.4890849 2 nips-2002-A Bilinear Model for Sparse Coding
19 0.48890829 89 nips-2002-Feature Selection by Maximum Marginal Diversity
20 0.48677325 52 nips-2002-Cluster Kernels for Semi-Supervised Learning