nips nips2003 nips2003-55 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ciamac C. Moallemi, Benjamin V. Roy
Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. [sent-4, score-1.357]
2 This protocol requires only local communication and simple computations which are distributed among devices. [sent-5, score-0.541]
3 As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. [sent-7, score-0.717]
4 Our approach builds on policy gradient methods for optimization of Markov decision processes. [sent-8, score-0.445]
5 The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. [sent-9, score-1.07]
6 We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. [sent-10, score-0.565]
7 1 Introduction This paper is motivated by the potential of policy gradient methods as a general approach to designing simple scalable distributed optimization protocols for networks of electronic devices. [sent-11, score-0.697]
8 We offer a general framework for such protocols that builds on ideas from the policy gradient literature. [sent-12, score-0.465]
9 We also explore a specific example involving a network of sensors that aggregates data. [sent-13, score-0.276]
10 In this context, we propose a distributed optimization protocol that minimizes power consumption, delay, and buffer overflow. [sent-14, score-0.676]
11 The proposed approach for designing protocols based on policy gradient methods comprises one contribution of this paper. [sent-15, score-0.43]
12 In addition, this paper offers fundamental contributions to the policy gradient literature. [sent-16, score-0.343]
13 In particular, the kind of protocol we propose can be viewed as extending policy gradient methods to a context involving a team of agents optimizing system behavior through asynchronous distributed computation and parsimonious local communication. [sent-17, score-0.98]
14 Our main theoretical contribution is to show that the dynamics of our protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. [sent-18, score-0.53]
15 Associated with this network is a discrete-time dynamical system with a finite state space W. [sent-23, score-0.173]
16 Each ith action process only transitions when the state w(k) transitions to an element of Wi . [sent-45, score-0.18]
17 At the time of i transition, the probability that ai (k) becomes any ai ∈ Ai is given by πθi (ai |w(k)). [sent-46, score-0.402]
18 While the system is in state w ∈ W and action a ∈ A is applied, each component i receives a reward ri (w, a). [sent-57, score-0.363]
19 The average reward received by the network is r(w, a) = n 1 i=1 ri (w, a). [sent-58, score-0.349]
20 (1) Here, > 0 is a constant step size and χi (k) is a noisy estimate of the gradient θi λ(θ(k)) computed at component i based on the component’s historically observed states, actions, and rewards, in addition to communication with other components. [sent-63, score-0.285]
21 Our goal is to develop an estimator χi (k) that can be used in an adaptive, asynchronous, and decentralized context, and to establish the convergence of the resulting stochastic approximation scheme. [sent-64, score-0.28]
22 Our approach builds on policy gradient algorithms that have been proposed in recent years ([5, 7, 8, 3, 4, 2]). [sent-65, score-0.378]
23 As a starting point, consider a gradient estimation method that is a decentralized variation of the OLPOMDP algorithm of [3, 4, 1]. [sent-66, score-0.268]
24 In this algorithm, each β component i maintains and updates an eligibility vector zi (t) ∈ RNi , defined by k (2) β zi (k) = β k− =0 i θi πθi ( ) (ai ( )|w( )) 1{w( )∈Wi } , i πθi ( ) (ai ( )|w( )) β for some β ∈ (0, 1). [sent-67, score-0.414]
25 Note that while the credit vector zi (t) can be computed using only local information, the gradient estimate χi (t) cannot be computed without knowl¯ edge of the global reward r(x(t), a(t)) at each time. [sent-69, score-0.47]
26 In this paper, we present a simple scalable distributed protocol through which rewards occurring locally at each node are communicated over time across the network and gradient estimates are generated at each node based on local information. [sent-71, score-0.978]
27 Another feature of the protocol is that it is completely decentralized – there is no central processor that aggregates and disseminates rewards. [sent-75, score-0.577]
28 As such, the protocol is robust to isolated changes or failures in the network. [sent-76, score-0.356]
29 In addition to design of the protocol, a significant contribution is in the protocol’s analysis, which we believe requires new ideas beyond what has been employed in the prior policy gradient literature. [sent-77, score-0.343]
30 3 A General Framework for Protocols We will make the following assumption regarding the policies, which is common in the policy gradient literature ([7, 8, 3, 4, 2]). [sent-78, score-0.386]
31 For all i and every w ∈ Wi , ai ∈ Ai , πθi (ai |w) is a continuously differentiable function of θi . [sent-80, score-0.179]
32 Further, for every i, there exists a bounded function Li (w, ai , θ) such i i that for all w ∈ Wi , ai ∈ Ai , θi πθi (ai |w) = πθi (ai |w)Li (w, ai , θ). [sent-81, score-0.537]
33 Consider the following gradient estimator: (3) β χi (k) = zi (k) 1 n n k dα ( , k)rj ( ), ij j=1 =0 where we use the shorthand rj ( ) = rj (w( ), a( )). [sent-83, score-0.372]
34 Indeed, dα ( , k) is the fraction of the reward ij rj ( ) at component j that is learned by component i at time k ≥ . [sent-85, score-0.347]
35 This type of convergence is central to the convergence of the stochastic approximation iteration (1). [sent-102, score-0.269]
36 4 Example: A Sensor Network In this section, we present a model of a wireless network of sensors that gathers and communicates data to a central base station. [sent-107, score-0.429]
37 Our example is motivated by issues arising in the development of sensor network technology being carried out by commercial producers of electronic devices. [sent-108, score-0.643]
38 However, we will not take into account the many complexities associated with real sensor networks. [sent-109, score-0.451]
39 Rather, our objective is to pose a simplified model that motivates and provides a context for discussion of our distributed optimization protocol. [sent-110, score-0.208]
40 1 System Description Consider a network of n sensors and a central base station. [sent-112, score-0.301]
41 Each sensor gathers packets of data through observation of its environment, and these packets of data are relayed through the network to the base station via multi-hop wireless communication. [sent-113, score-1.071]
42 Each sensor retains a queue of packets, each obtained either through sensing or via transmission from another sensor. [sent-114, score-0.678]
43 Packets in a queue are indistinguishable – each is of equal size and must be transferred to the central base station. [sent-115, score-0.299]
44 We take the state of a sensor to be the number of packets in the queue and denote the state of the ith sensor at time k by xi (k). [sent-116, score-1.393]
45 The number of packets in a queue cannot exceed a finite buffer size, which we denote by x. [sent-117, score-0.438]
46 These include (1) packetizing of an observation (2) reception of a packet from another sensor, (3) transmission of a packet to another sensor, (4) awakening from a period of sleep, (5) termination of a period of attempted reception, (6) termination of a period of attempted transmission. [sent-119, score-0.627]
47 At the time of a triggering event, the sensor must decide on its next action. [sent-120, score-0.603]
48 The action taken by the ith sensor at time k is denoted by ai (k). [sent-124, score-0.815]
49 The base station will be thought of as a sensor that has an infinite buffer and perpetually attempts reception. [sent-125, score-0.753]
50 For each i, there is a set N(i) of entities with which the ith sensor can directly communicate. [sent-126, score-0.54]
51 If the ith sensor is attempting transmission of a packet and there is at least one element of N(i) that is simultaneously attempting reception and is closer to the base station than component i, the packet is transferred to the queue of that element. [sent-127, score-1.561]
52 Note that if among the elements of N(i) that are attempting reception, all are further away from the base station than component i, no packet is transmitted. [sent-129, score-0.415]
53 Observations are made and packetized by each sensor at random times. [sent-130, score-0.501]
54 2 Control Policies and Objective Every sensor employs a control policy that selects an action based on its queue length each time a triggering event occurs. [sent-134, score-1.12]
55 Each ith sensor’s control policy is parameterized by a vector θi ∈ R2 . [sent-136, score-0.35]
56 Given θi , at an event time, if the ith sensor has a non-empty queue, it chooses to transmit with probability θi1 . [sent-137, score-0.683]
57 If the ith sensor does not transmit and its queue is not full, it chooses to receive with probability θi2 . [sent-138, score-0.775]
58 If the sensor does not transmit or receive, then it sleeps. [sent-139, score-0.538]
59 Assume that each sensor has a finite power supply. [sent-141, score-0.451]
60 In order to guarantee a minimum lifespan for the network, we will require that each sensor sleeps at least a fraction fs of the time. [sent-142, score-0.451]
61 If, at any given time, a sensor has not slept for a total fraction of a least fs of the preceding time Ts , it is forced to sleep and hence not allowed to transmit or receive. [sent-144, score-0.689]
62 The objective is to minimize a weighted sum of the average delay and average number of dropped packets per unit of time. [sent-145, score-0.266]
63 Delay can be thought of as the amount of time a packet spends in the network before arriving at the base station. [sent-146, score-0.47]
64 ,θn K→∞ 1 K K−1 k=0 1 n n (xi (k) + ξDi (k)) , i=1 where Di (k) is the number of packets dropped by sensor i at time k, and ξ is a weight reflecting the relative importance of delay and dropped packets. [sent-150, score-0.834]
65 5 Distributed Optimization Protocol We now describe a simple protocol by which components a the network can communicate rewards, in a fashion that satisfies the requirements of Theorem 1 and hence will produce good gradient estimates. [sent-151, score-0.714]
66 This protocol communicates the rewards across the network over time using a distributed averaging procedure. [sent-152, score-0.771]
67 Imagine each component i in the network is given a real value Ri . [sent-154, score-0.191]
68 Our goal is to design an asynchronous distributed n protocol through which each node will obtain the average R = i=1 Ri /n. [sent-155, score-0.517]
69 In the context of our distributed optimization protocol, we will assume that each component i maintains a scalar value Yi (k) at time k representing an estimate of the global reward. [sent-164, score-0.429]
70 , rn (k)) is a vector of rewards occurring at time k. [sent-178, score-0.208]
71 Since the process (w(k), a(k)) is aperiodic and irreducible (Assumption 1), this assumption guarantees that every edge on a connected subset of edges is sampled infinitely often. [sent-184, score-0.213]
72 Policy parameters are updated at each component according to the rule: β (4) θi (k + 1) = θi (k) + zi (k)(1 − α)Yi (k). [sent-185, score-0.209]
73 The following theorem, which relies on a general stochastic approximation result from [6] together with custom analysis available in our appendix [9], establishes the convergence of the distributed stochastic iteration method defined by (4). [sent-187, score-0.353]
74 We expect our protocol to be used in an online fashion, where it is ideal to be adaptive to long-term changes in network topology or dynamics of the environment. [sent-203, score-0.49]
75 In practical numerical implementations, choices of the policy parameters θi would be constrained to bounded sets of Hi ⊂ RNi . [sent-206, score-0.237]
76 1 Relation to the Example In the example of Section 4, one approach to implementing our distributed optimization protocol involves passing messages associated with the optimization protocol alongside normal network traffic, as we will now explain. [sent-211, score-1.075]
77 Each ith sensor should maintain and update β two vectors: a parameter vector θi (k) ∈ R2 and an eligibility vector zi (k). [sent-212, score-0.733]
78 If a triggering event occurs at sensor i at time k, the eligibility vector is updated according to β zi (k) = βz β (k − 1) + i θi πθi (k) (ai (k)|w(k)) i πθi (k) (ai (k)|w(k)) . [sent-213, score-0.886]
79 Furthermore, each sensor maintains an estimate Yi (k) of the global reward. [sent-215, score-0.53]
80 At each time k, each ith sensor observes a reward (negative cost) of ri (k) = −xi (k) − ξDi (k). [sent-216, score-0.799]
81 If two neighboring sensors are both not asleep at a time k, they communicate their global reward estimates from the previous time. [sent-217, score-0.311]
82 If the ith sensor is not involved in a reward communication event at that time, its global reward estimate is updated according to Yi (k) = αYi (k − 1) + ri (k). [sent-218, score-1.099]
83 On the other hand, at any time k that there is a communication event, its global reward estimate is updated according to Yi (k) = ri (k) + α(Yi (k) + αYj (k))/2, where j is the index of the sensor with which communication occurs. [sent-219, score-0.957]
84 If communication occurs with multiple neighbors, the corresponding global reward estimates are averaged pairwise in an arbitrary order. [sent-220, score-0.254]
85 In this ˆ context, the graph E contains an edge for each pair of neighbors in the sensor network, where the neighborhood relations are capture by N, as introduced in Section 4. [sent-222, score-0.546]
86 To optimize performance over time, each ith sensor would update its parameter values according to our stochastic approximation iteration (4). [sent-223, score-0.638]
87 To highlight the simplicity of this protocol, note that each sensor need only maintain and update a few numerical values. [sent-224, score-0.483]
88 Furthermore, the only communication required by the optimization protocol is that an extra scalar numerical value be transmitted and an extra scalar numerical value be received during the reception or transmission of any packet. [sent-225, score-0.868]
89 Here, at every time step, an observation arrives at a sensor with a 0. [sent-227, score-0.495]
90 02 probability, and each sensor maintains a queue of up to 20 observations. [sent-228, score-0.645]
91 Policy parameters θi1 and θi2 for each sensor i are constrained to lie in the interval [0. [sent-229, score-0.451]
92 (Note that for this set of parameters, the chance of a buffer overflow is very small, and hence did not occur in our simulations. [sent-232, score-0.192]
93 ) A baseline policy is defined by having leaf nodes transmit with maximum probability, and interior nodes splitting their time roughly evenly between transmission and reception, when not forced to sleep by the power constraint. [sent-233, score-0.488]
94 Applying our decentralized optimization method to this example, it is clear in Figure 2 that the performance of the network is quickly and dramatically improved. [sent-234, score-0.331]
95 Further, the algorithm achieves qualitatively similar performance to gradient optimization using the centralized OLPOMDP method of [3, 4, 1], hence decentralization comes at no cost. [sent-236, score-0.239]
96 6 Remarks and Further Issues We are encouraged by the simplicity and scalability of the distributed optimization protocol we have presented. [sent-237, score-0.518]
97 We believe that this protocol represents both an interesting direction for practical applications involving networks of electronic devices and a significant step in the policy gradient literature. [sent-238, score-0.798]
98 However, there is an important outstanding issue that needs to be addressed to assess the potential of this approach: whether or not parameters can be adapted fast enough for this protocol to be useful in applications. [sent-239, score-0.356]
99 3 0 1 2 3 4 5 Iteration 6 7 8 9 10 6 x 10 to this issue: (1) variance of gradient estimates and (2) convergence rate of the underlying ODE. [sent-248, score-0.203]
100 Acknowledgements The authors thank Abbas El Gamal, Abtin Keshavarzian, Balaji Prabhakar, and Elif Uysal for stimulating conversations on sensor network models and applications. [sent-251, score-0.585]
wordName wordTfidf (topN-words)
[('sensor', 0.451), ('protocol', 0.356), ('policy', 0.205), ('ai', 0.179), ('packet', 0.174), ('buffer', 0.158), ('qs', 0.149), ('queue', 0.148), ('gradient', 0.138), ('network', 0.134), ('packets', 0.132), ('reward', 0.131), ('decentralized', 0.13), ('reception', 0.13), ('zi', 0.118), ('ws', 0.11), ('triggering', 0.108), ('distributed', 0.095), ('rewards', 0.092), ('communication', 0.09), ('ith', 0.089), ('transmit', 0.087), ('protocols', 0.087), ('ri', 0.084), ('wi', 0.081), ('transmission', 0.079), ('ji', 0.075), ('base', 0.075), ('eligibility', 0.075), ('olpomdp', 0.075), ('dropped', 0.073), ('sleep', 0.073), ('station', 0.069), ('yi', 0.068), ('optimization', 0.067), ('asynchronous', 0.066), ('convergence', 0.065), ('delay', 0.061), ('electronic', 0.058), ('rj', 0.058), ('component', 0.057), ('control', 0.056), ('stanford', 0.056), ('event', 0.056), ('action', 0.052), ('actions', 0.052), ('communicate', 0.052), ('sensors', 0.051), ('edge', 0.05), ('aggregates', 0.05), ('ciamac', 0.05), ('communicates', 0.05), ('limk', 0.05), ('marbach', 0.05), ('moallemi', 0.05), ('packetized', 0.05), ('rni', 0.05), ('stochastic', 0.05), ('lim', 0.049), ('iteration', 0.048), ('scalable', 0.047), ('context', 0.046), ('maintains', 0.046), ('appendix', 0.045), ('neighborhood', 0.045), ('time', 0.044), ('aperiodic', 0.043), ('baxter', 0.043), ('gathers', 0.043), ('ode', 0.043), ('spends', 0.043), ('assumption', 0.043), ('central', 0.041), ('involving', 0.041), ('scalar', 0.041), ('attempting', 0.04), ('communicated', 0.039), ('irreducible', 0.039), ('state', 0.039), ('rn', 0.039), ('nitely', 0.038), ('edges', 0.038), ('ordinary', 0.036), ('bartlett', 0.036), ('builds', 0.035), ('termination', 0.035), ('consumption', 0.035), ('transferred', 0.035), ('delays', 0.035), ('wireless', 0.035), ('policies', 0.035), ('establish', 0.035), ('hence', 0.034), ('updated', 0.034), ('theorem', 0.034), ('global', 0.033), ('occurring', 0.033), ('team', 0.033), ('events', 0.032), ('numerical', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 55 nips-2003-Distributed Optimization in Adaptive Networks
Author: Ciamac C. Moallemi, Benjamin V. Roy
Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
2 0.17008182 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.16962156 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
4 0.15252694 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
5 0.1426466 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring
Author: Cynthia Archer, Todd K. Leen, António M. Baptista
Abstract: As part of an environmental observation and forecasting system, sensors deployed in the Columbia RIver Estuary (CORIE) gather information on physical dynamics and changes in estuary habitat. Of these, salinity sensors are particularly susceptible to biofouling, which gradually degrades sensor response and corrupts critical data. Automatic fault detectors have the capability to identify bio-fouling early and minimize data loss. Complicating the development of discriminatory classifiers is the scarcity of bio-fouling onset examples and the variability of the bio-fouling signature. To solve these problems, we take a novelty detection approach that incorporates a parameterized bio-fouling model. These detectors identify the occurrence of bio-fouling, and its onset time as reliably as human experts. Real-time detectors installed during the summer of 2001 produced no false alarms, yet detected all episodes of sensor degradation before the field staff scheduled these sensors for cleaning. From this initial deployment through February 2003, our bio-fouling detectors have essentially doubled the amount of useful data coming from the CORIE sensors. 1
6 0.11762726 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
7 0.11496966 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
8 0.10966564 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
9 0.10089295 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
10 0.092686452 42 nips-2003-Bounded Finite State Controllers
11 0.091140643 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
12 0.088984743 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
13 0.081592493 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
14 0.078407943 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
15 0.077661462 62 nips-2003-Envelope-based Planning in Relational MDPs
16 0.075834796 68 nips-2003-Eye Movements for Reward Maximization
17 0.073811263 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron
18 0.073299617 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
19 0.072660439 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
20 0.071561351 155 nips-2003-Perspectives on Sparse Bayesian Learning
topicId topicWeight
[(0, -0.233), (1, 0.221), (2, -0.083), (3, 0.003), (4, 0.0), (5, 0.041), (6, -0.024), (7, -0.106), (8, -0.049), (9, 0.007), (10, -0.029), (11, 0.016), (12, 0.07), (13, 0.022), (14, -0.103), (15, -0.002), (16, -0.093), (17, -0.058), (18, -0.02), (19, 0.044), (20, 0.077), (21, 0.046), (22, 0.047), (23, 0.143), (24, -0.143), (25, -0.129), (26, 0.039), (27, 0.058), (28, -0.092), (29, -0.033), (30, -0.099), (31, 0.069), (32, -0.065), (33, 0.198), (34, -0.082), (35, 0.09), (36, 0.046), (37, 0.144), (38, -0.008), (39, 0.052), (40, -0.056), (41, -0.005), (42, 0.004), (43, 0.105), (44, 0.032), (45, -0.081), (46, -0.045), (47, 0.027), (48, 0.029), (49, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.96950209 55 nips-2003-Distributed Optimization in Adaptive Networks
Author: Ciamac C. Moallemi, Benjamin V. Roy
Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
2 0.68410772 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
3 0.67028326 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring
Author: Cynthia Archer, Todd K. Leen, António M. Baptista
Abstract: As part of an environmental observation and forecasting system, sensors deployed in the Columbia RIver Estuary (CORIE) gather information on physical dynamics and changes in estuary habitat. Of these, salinity sensors are particularly susceptible to biofouling, which gradually degrades sensor response and corrupts critical data. Automatic fault detectors have the capability to identify bio-fouling early and minimize data loss. Complicating the development of discriminatory classifiers is the scarcity of bio-fouling onset examples and the variability of the bio-fouling signature. To solve these problems, we take a novelty detection approach that incorporates a parameterized bio-fouling model. These detectors identify the occurrence of bio-fouling, and its onset time as reliably as human experts. Real-time detectors installed during the summer of 2001 produced no false alarms, yet detected all episodes of sensor degradation before the field staff scheduled these sensors for cleaning. From this initial deployment through February 2003, our bio-fouling detectors have essentially doubled the amount of useful data coming from the CORIE sensors. 1
4 0.57732022 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
5 0.5452984 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
6 0.53449559 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
7 0.50420982 78 nips-2003-Gaussian Processes in Reinforcement Learning
8 0.48336184 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
9 0.46441728 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
10 0.44511947 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
11 0.43347478 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron
12 0.41265446 42 nips-2003-Bounded Finite State Controllers
13 0.3919321 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
14 0.39009646 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
15 0.38594323 187 nips-2003-Training a Quantum Neural Network
16 0.3840889 68 nips-2003-Eye Movements for Reward Maximization
17 0.35742185 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
18 0.35394865 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
19 0.34606677 102 nips-2003-Large Scale Online Learning
20 0.33628219 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
topicId topicWeight
[(0, 0.047), (2, 0.266), (11, 0.037), (29, 0.012), (30, 0.016), (35, 0.055), (49, 0.011), (53, 0.11), (69, 0.02), (71, 0.053), (76, 0.049), (85, 0.072), (91, 0.139), (99, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.84496754 55 nips-2003-Distributed Optimization in Adaptive Networks
Author: Ciamac C. Moallemi, Benjamin V. Roy
Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
2 0.62335432 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.62233377 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
4 0.61598599 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
Author: Maneesh Sahani
Abstract: Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a significant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1
5 0.61475444 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
6 0.61385816 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
7 0.61253947 113 nips-2003-Learning with Local and Global Consistency
8 0.61233974 30 nips-2003-Approximability of Probability Distributions
9 0.60994786 81 nips-2003-Geometric Analysis of Constrained Curves
10 0.60945612 68 nips-2003-Eye Movements for Reward Maximization
11 0.60871142 73 nips-2003-Feature Selection in Clustering Problems
12 0.60686564 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
13 0.60602915 158 nips-2003-Policy Search by Dynamic Programming
14 0.60546076 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
15 0.60493505 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
16 0.60456741 126 nips-2003-Measure Based Regularization
17 0.6041109 143 nips-2003-On the Dynamics of Boosting
18 0.60343504 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
19 0.60332519 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
20 0.60100818 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms