nips nips2003 nips2003-55 knowledge-graph by maker-knowledge-mining

55 nips-2003-Distributed Optimization in Adaptive Networks

Source: pdf

Author: Ciamac C. Moallemi, Benjamin V. Roy

Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overﬂow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. [sent-4, score-1.357]

2 This protocol requires only local communication and simple computations which are distributed among devices. [sent-5, score-0.541]

3 As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overﬂow in a sensor network. [sent-7, score-0.717]

4 Our approach builds on policy gradient methods for optimization of Markov decision processes. [sent-8, score-0.445]

5 The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. [sent-9, score-1.07]

6 We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. [sent-10, score-0.565]

7 1 Introduction This paper is motivated by the potential of policy gradient methods as a general approach to designing simple scalable distributed optimization protocols for networks of electronic devices. [sent-11, score-0.697]

8 We offer a general framework for such protocols that builds on ideas from the policy gradient literature. [sent-12, score-0.465]

9 We also explore a speciﬁc example involving a network of sensors that aggregates data. [sent-13, score-0.276]

10 In this context, we propose a distributed optimization protocol that minimizes power consumption, delay, and buffer overﬂow. [sent-14, score-0.676]

11 The proposed approach for designing protocols based on policy gradient methods comprises one contribution of this paper. [sent-15, score-0.43]

12 In addition, this paper offers fundamental contributions to the policy gradient literature. [sent-16, score-0.343]

13 In particular, the kind of protocol we propose can be viewed as extending policy gradient methods to a context involving a team of agents optimizing system behavior through asynchronous distributed computation and parsimonious local communication. [sent-17, score-0.98]

14 Our main theoretical contribution is to show that the dynamics of our protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. [sent-18, score-0.53]

15 Associated with this network is a discrete-time dynamical system with a ﬁnite state space W. [sent-23, score-0.173]

16 Each ith action process only transitions when the state w(k) transitions to an element of Wi . [sent-45, score-0.18]

17 At the time of i transition, the probability that ai (k) becomes any ai ∈ Ai is given by πθi (ai |w(k)). [sent-46, score-0.402]

18 While the system is in state w ∈ W and action a ∈ A is applied, each component i receives a reward ri (w, a). [sent-57, score-0.363]

19 The average reward received by the network is r(w, a) = n 1 i=1 ri (w, a). [sent-58, score-0.349]

20 (1) Here, > 0 is a constant step size and χi (k) is a noisy estimate of the gradient θi λ(θ(k)) computed at component i based on the component’s historically observed states, actions, and rewards, in addition to communication with other components. [sent-63, score-0.285]

21 Our goal is to develop an estimator χi (k) that can be used in an adaptive, asynchronous, and decentralized context, and to establish the convergence of the resulting stochastic approximation scheme. [sent-64, score-0.28]

22 Our approach builds on policy gradient algorithms that have been proposed in recent years ([5, 7, 8, 3, 4, 2]). [sent-65, score-0.378]

23 As a starting point, consider a gradient estimation method that is a decentralized variation of the OLPOMDP algorithm of [3, 4, 1]. [sent-66, score-0.268]

24 In this algorithm, each β component i maintains and updates an eligibility vector zi (t) ∈ RNi , deﬁned by k (2) β zi (k) = β k− =0 i θi πθi ( ) (ai ( )|w( )) 1{w( )∈Wi } , i πθi ( ) (ai ( )|w( )) β for some β ∈ (0, 1). [sent-67, score-0.414]

25 Note that while the credit vector zi (t) can be computed using only local information, the gradient estimate χi (t) cannot be computed without knowl¯ edge of the global reward r(x(t), a(t)) at each time. [sent-69, score-0.47]

26 In this paper, we present a simple scalable distributed protocol through which rewards occurring locally at each node are communicated over time across the network and gradient estimates are generated at each node based on local information. [sent-71, score-0.978]

27 Another feature of the protocol is that it is completely decentralized – there is no central processor that aggregates and disseminates rewards. [sent-75, score-0.577]

28 As such, the protocol is robust to isolated changes or failures in the network. [sent-76, score-0.356]

29 In addition to design of the protocol, a signiﬁcant contribution is in the protocol’s analysis, which we believe requires new ideas beyond what has been employed in the prior policy gradient literature. [sent-77, score-0.343]

30 3 A General Framework for Protocols We will make the following assumption regarding the policies, which is common in the policy gradient literature ([7, 8, 3, 4, 2]). [sent-78, score-0.386]

31 For all i and every w ∈ Wi , ai ∈ Ai , πθi (ai |w) is a continuously differentiable function of θi . [sent-80, score-0.179]

32 Further, for every i, there exists a bounded function Li (w, ai , θ) such i i that for all w ∈ Wi , ai ∈ Ai , θi πθi (ai |w) = πθi (ai |w)Li (w, ai , θ). [sent-81, score-0.537]

33 Consider the following gradient estimator: (3) β χi (k) = zi (k) 1 n n k dα ( , k)rj ( ), ij j=1 =0 where we use the shorthand rj ( ) = rj (w( ), a( )). [sent-83, score-0.372]

34 Indeed, dα ( , k) is the fraction of the reward ij rj ( ) at component j that is learned by component i at time k ≥ . [sent-85, score-0.347]

35 This type of convergence is central to the convergence of the stochastic approximation iteration (1). [sent-102, score-0.269]

36 4 Example: A Sensor Network In this section, we present a model of a wireless network of sensors that gathers and communicates data to a central base station. [sent-107, score-0.429]

37 Our example is motivated by issues arising in the development of sensor network technology being carried out by commercial producers of electronic devices. [sent-108, score-0.643]

38 However, we will not take into account the many complexities associated with real sensor networks. [sent-109, score-0.451]

39 Rather, our objective is to pose a simpliﬁed model that motivates and provides a context for discussion of our distributed optimization protocol. [sent-110, score-0.208]

40 1 System Description Consider a network of n sensors and a central base station. [sent-112, score-0.301]

41 Each sensor gathers packets of data through observation of its environment, and these packets of data are relayed through the network to the base station via multi-hop wireless communication. [sent-113, score-1.071]

42 Each sensor retains a queue of packets, each obtained either through sensing or via transmission from another sensor. [sent-114, score-0.678]

43 Packets in a queue are indistinguishable – each is of equal size and must be transferred to the central base station. [sent-115, score-0.299]

44 We take the state of a sensor to be the number of packets in the queue and denote the state of the ith sensor at time k by xi (k). [sent-116, score-1.393]

45 The number of packets in a queue cannot exceed a ﬁnite buffer size, which we denote by x. [sent-117, score-0.438]

46 These include (1) packetizing of an observation (2) reception of a packet from another sensor, (3) transmission of a packet to another sensor, (4) awakening from a period of sleep, (5) termination of a period of attempted reception, (6) termination of a period of attempted transmission. [sent-119, score-0.627]

47 At the time of a triggering event, the sensor must decide on its next action. [sent-120, score-0.603]

48 The action taken by the ith sensor at time k is denoted by ai (k). [sent-124, score-0.815]

49 The base station will be thought of as a sensor that has an inﬁnite buffer and perpetually attempts reception. [sent-125, score-0.753]

50 For each i, there is a set N(i) of entities with which the ith sensor can directly communicate. [sent-126, score-0.54]

51 If the ith sensor is attempting transmission of a packet and there is at least one element of N(i) that is simultaneously attempting reception and is closer to the base station than component i, the packet is transferred to the queue of that element. [sent-127, score-1.561]

52 Note that if among the elements of N(i) that are attempting reception, all are further away from the base station than component i, no packet is transmitted. [sent-129, score-0.415]

53 Observations are made and packetized by each sensor at random times. [sent-130, score-0.501]

54 2 Control Policies and Objective Every sensor employs a control policy that selects an action based on its queue length each time a triggering event occurs. [sent-134, score-1.12]

55 Each ith sensor’s control policy is parameterized by a vector θi ∈ R2 . [sent-136, score-0.35]

56 Given θi , at an event time, if the ith sensor has a non-empty queue, it chooses to transmit with probability θi1 . [sent-137, score-0.683]

57 If the ith sensor does not transmit and its queue is not full, it chooses to receive with probability θi2 . [sent-138, score-0.775]

58 If the sensor does not transmit or receive, then it sleeps. [sent-139, score-0.538]

59 Assume that each sensor has a ﬁnite power supply. [sent-141, score-0.451]

60 In order to guarantee a minimum lifespan for the network, we will require that each sensor sleeps at least a fraction fs of the time. [sent-142, score-0.451]

61 If, at any given time, a sensor has not slept for a total fraction of a least fs of the preceding time Ts , it is forced to sleep and hence not allowed to transmit or receive. [sent-144, score-0.689]

62 The objective is to minimize a weighted sum of the average delay and average number of dropped packets per unit of time. [sent-145, score-0.266]

63 Delay can be thought of as the amount of time a packet spends in the network before arriving at the base station. [sent-146, score-0.47]

64 ,θn K→∞ 1 K K−1 k=0 1 n n (xi (k) + ξDi (k)) , i=1 where Di (k) is the number of packets dropped by sensor i at time k, and ξ is a weight reﬂecting the relative importance of delay and dropped packets. [sent-150, score-0.834]

65 5 Distributed Optimization Protocol We now describe a simple protocol by which components a the network can communicate rewards, in a fashion that satisﬁes the requirements of Theorem 1 and hence will produce good gradient estimates. [sent-151, score-0.714]

66 This protocol communicates the rewards across the network over time using a distributed averaging procedure. [sent-152, score-0.771]

67 Imagine each component i in the network is given a real value Ri . [sent-154, score-0.191]

68 Our goal is to design an asynchronous distributed n protocol through which each node will obtain the average R = i=1 Ri /n. [sent-155, score-0.517]

69 In the context of our distributed optimization protocol, we will assume that each component i maintains a scalar value Yi (k) at time k representing an estimate of the global reward. [sent-164, score-0.429]

70 , rn (k)) is a vector of rewards occurring at time k. [sent-178, score-0.208]

71 Since the process (w(k), a(k)) is aperiodic and irreducible (Assumption 1), this assumption guarantees that every edge on a connected subset of edges is sampled inﬁnitely often. [sent-184, score-0.213]

72 Policy parameters are updated at each component according to the rule: β (4) θi (k + 1) = θi (k) + zi (k)(1 − α)Yi (k). [sent-185, score-0.209]

73 The following theorem, which relies on a general stochastic approximation result from [6] together with custom analysis available in our appendix [9], establishes the convergence of the distributed stochastic iteration method deﬁned by (4). [sent-187, score-0.353]

74 We expect our protocol to be used in an online fashion, where it is ideal to be adaptive to long-term changes in network topology or dynamics of the environment. [sent-203, score-0.49]

75 In practical numerical implementations, choices of the policy parameters θi would be constrained to bounded sets of Hi ⊂ RNi . [sent-206, score-0.237]

76 1 Relation to the Example In the example of Section 4, one approach to implementing our distributed optimization protocol involves passing messages associated with the optimization protocol alongside normal network trafﬁc, as we will now explain. [sent-211, score-1.075]

77 Each ith sensor should maintain and update β two vectors: a parameter vector θi (k) ∈ R2 and an eligibility vector zi (k). [sent-212, score-0.733]

78 If a triggering event occurs at sensor i at time k, the eligibility vector is updated according to β zi (k) = βz β (k − 1) + i θi πθi (k) (ai (k)|w(k)) i πθi (k) (ai (k)|w(k)) . [sent-213, score-0.886]

79 Furthermore, each sensor maintains an estimate Yi (k) of the global reward. [sent-215, score-0.53]

80 At each time k, each ith sensor observes a reward (negative cost) of ri (k) = −xi (k) − ξDi (k). [sent-216, score-0.799]

81 If two neighboring sensors are both not asleep at a time k, they communicate their global reward estimates from the previous time. [sent-217, score-0.311]

82 If the ith sensor is not involved in a reward communication event at that time, its global reward estimate is updated according to Yi (k) = αYi (k − 1) + ri (k). [sent-218, score-1.099]

83 On the other hand, at any time k that there is a communication event, its global reward estimate is updated according to Yi (k) = ri (k) + α(Yi (k) + αYj (k))/2, where j is the index of the sensor with which communication occurs. [sent-219, score-0.957]

84 If communication occurs with multiple neighbors, the corresponding global reward estimates are averaged pairwise in an arbitrary order. [sent-220, score-0.254]

85 In this ˆ context, the graph E contains an edge for each pair of neighbors in the sensor network, where the neighborhood relations are capture by N, as introduced in Section 4. [sent-222, score-0.546]

86 To optimize performance over time, each ith sensor would update its parameter values according to our stochastic approximation iteration (4). [sent-223, score-0.638]

87 To highlight the simplicity of this protocol, note that each sensor need only maintain and update a few numerical values. [sent-224, score-0.483]

88 Furthermore, the only communication required by the optimization protocol is that an extra scalar numerical value be transmitted and an extra scalar numerical value be received during the reception or transmission of any packet. [sent-225, score-0.868]

89 Here, at every time step, an observation arrives at a sensor with a 0. [sent-227, score-0.495]

90 02 probability, and each sensor maintains a queue of up to 20 observations. [sent-228, score-0.645]

91 Policy parameters θi1 and θi2 for each sensor i are constrained to lie in the interval [0. [sent-229, score-0.451]

92 (Note that for this set of parameters, the chance of a buffer overﬂow is very small, and hence did not occur in our simulations. [sent-232, score-0.192]

93 ) A baseline policy is deﬁned by having leaf nodes transmit with maximum probability, and interior nodes splitting their time roughly evenly between transmission and reception, when not forced to sleep by the power constraint. [sent-233, score-0.488]

94 Applying our decentralized optimization method to this example, it is clear in Figure 2 that the performance of the network is quickly and dramatically improved. [sent-234, score-0.331]

95 Further, the algorithm achieves qualitatively similar performance to gradient optimization using the centralized OLPOMDP method of [3, 4, 1], hence decentralization comes at no cost. [sent-236, score-0.239]

96 6 Remarks and Further Issues We are encouraged by the simplicity and scalability of the distributed optimization protocol we have presented. [sent-237, score-0.518]

97 We believe that this protocol represents both an interesting direction for practical applications involving networks of electronic devices and a signiﬁcant step in the policy gradient literature. [sent-238, score-0.798]

98 However, there is an important outstanding issue that needs to be addressed to assess the potential of this approach: whether or not parameters can be adapted fast enough for this protocol to be useful in applications. [sent-239, score-0.356]

99 3 0 1 2 3 4 5 Iteration 6 7 8 9 10 6 x 10 to this issue: (1) variance of gradient estimates and (2) convergence rate of the underlying ODE. [sent-248, score-0.203]

100 Acknowledgements The authors thank Abbas El Gamal, Abtin Keshavarzian, Balaji Prabhakar, and Elif Uysal for stimulating conversations on sensor network models and applications. [sent-251, score-0.585]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sensor', 0.451), ('protocol', 0.356), ('policy', 0.205), ('ai', 0.179), ('packet', 0.174), ('buffer', 0.158), ('qs', 0.149), ('queue', 0.148), ('gradient', 0.138), ('network', 0.134), ('packets', 0.132), ('reward', 0.131), ('decentralized', 0.13), ('reception', 0.13), ('zi', 0.118), ('ws', 0.11), ('triggering', 0.108), ('distributed', 0.095), ('rewards', 0.092), ('communication', 0.09), ('ith', 0.089), ('transmit', 0.087), ('protocols', 0.087), ('ri', 0.084), ('wi', 0.081), ('transmission', 0.079), ('ji', 0.075), ('base', 0.075), ('eligibility', 0.075), ('olpomdp', 0.075), ('dropped', 0.073), ('sleep', 0.073), ('station', 0.069), ('yi', 0.068), ('optimization', 0.067), ('asynchronous', 0.066), ('convergence', 0.065), ('delay', 0.061), ('electronic', 0.058), ('rj', 0.058), ('component', 0.057), ('control', 0.056), ('stanford', 0.056), ('event', 0.056), ('action', 0.052), ('actions', 0.052), ('communicate', 0.052), ('sensors', 0.051), ('edge', 0.05), ('aggregates', 0.05), ('ciamac', 0.05), ('communicates', 0.05), ('limk', 0.05), ('marbach', 0.05), ('moallemi', 0.05), ('packetized', 0.05), ('rni', 0.05), ('stochastic', 0.05), ('lim', 0.049), ('iteration', 0.048), ('scalable', 0.047), ('context', 0.046), ('maintains', 0.046), ('appendix', 0.045), ('neighborhood', 0.045), ('time', 0.044), ('aperiodic', 0.043), ('baxter', 0.043), ('gathers', 0.043), ('ode', 0.043), ('spends', 0.043), ('assumption', 0.043), ('central', 0.041), ('involving', 0.041), ('scalar', 0.041), ('attempting', 0.04), ('communicated', 0.039), ('irreducible', 0.039), ('state', 0.039), ('rn', 0.039), ('nitely', 0.038), ('edges', 0.038), ('ordinary', 0.036), ('bartlett', 0.036), ('builds', 0.035), ('termination', 0.035), ('consumption', 0.035), ('transferred', 0.035), ('delays', 0.035), ('wireless', 0.035), ('policies', 0.035), ('establish', 0.035), ('hence', 0.034), ('updated', 0.034), ('theorem', 0.034), ('global', 0.033), ('occurring', 0.033), ('team', 0.033), ('events', 0.032), ('numerical', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 55 nips-2003-Distributed Optimization in Adaptive Networks

Author: Ciamac C. Moallemi, Benjamin V. Roy

2 0.17008182 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

3 0.16962156 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

4 0.15252694 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a ﬁnite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

5 0.1426466 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

Author: Cynthia Archer, Todd K. Leen, António M. Baptista

Abstract: As part of an environmental observation and forecasting system, sensors deployed in the Columbia RIver Estuary (CORIE) gather information on physical dynamics and changes in estuary habitat. Of these, salinity sensors are particularly susceptible to biofouling, which gradually degrades sensor response and corrupts critical data. Automatic fault detectors have the capability to identify bio-fouling early and minimize data loss. Complicating the development of discriminatory classiﬁers is the scarcity of bio-fouling onset examples and the variability of the bio-fouling signature. To solve these problems, we take a novelty detection approach that incorporates a parameterized bio-fouling model. These detectors identify the occurrence of bio-fouling, and its onset time as reliably as human experts. Real-time detectors installed during the summer of 2001 produced no false alarms, yet detected all episodes of sensor degradation before the ﬁeld staﬀ scheduled these sensors for cleaning. From this initial deployment through February 2003, our bio-fouling detectors have essentially doubled the amount of useful data coming from the CORIE sensors. 1

6 0.11762726 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

7 0.11496966 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

8 0.10966564 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

9 0.10089295 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

10 0.092686452 42 nips-2003-Bounded Finite State Controllers

11 0.091140643 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

12 0.088984743 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

13 0.081592493 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

14 0.078407943 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

15 0.077661462 62 nips-2003-Envelope-based Planning in Relational MDPs

16 0.075834796 68 nips-2003-Eye Movements for Reward Maximization

17 0.073811263 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

18 0.073299617 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

19 0.072660439 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

20 0.071561351 155 nips-2003-Perspectives on Sparse Bayesian Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.233), (1, 0.221), (2, -0.083), (3, 0.003), (4, 0.0), (5, 0.041), (6, -0.024), (7, -0.106), (8, -0.049), (9, 0.007), (10, -0.029), (11, 0.016), (12, 0.07), (13, 0.022), (14, -0.103), (15, -0.002), (16, -0.093), (17, -0.058), (18, -0.02), (19, 0.044), (20, 0.077), (21, 0.046), (22, 0.047), (23, 0.143), (24, -0.143), (25, -0.129), (26, 0.039), (27, 0.058), (28, -0.092), (29, -0.033), (30, -0.099), (31, 0.069), (32, -0.065), (33, 0.198), (34, -0.082), (35, 0.09), (36, 0.046), (37, 0.144), (38, -0.008), (39, 0.052), (40, -0.056), (41, -0.005), (42, 0.004), (43, 0.105), (44, 0.032), (45, -0.081), (46, -0.045), (47, 0.027), (48, 0.029), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96950209 55 nips-2003-Distributed Optimization in Adaptive Networks

Author: Ciamac C. Moallemi, Benjamin V. Roy

2 0.68410772 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng

Abstract: Autonomous helicopter ﬂight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter ﬂight. We ﬁrst ﬁt a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to ﬂy a number of maneuvers taken from an RC helicopter competition.

3 0.67028326 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

Author: Cynthia Archer, Todd K. Leen, António M. Baptista

4 0.57732022 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

5 0.5452984 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

6 0.53449559 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

7 0.50420982 78 nips-2003-Gaussian Processes in Reinforcement Learning

8 0.48336184 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

9 0.46441728 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

10 0.44511947 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

11 0.43347478 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

12 0.41265446 42 nips-2003-Bounded Finite State Controllers

13 0.3919321 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

14 0.39009646 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

15 0.38594323 187 nips-2003-Training a Quantum Neural Network

16 0.3840889 68 nips-2003-Eye Movements for Reward Maximization

17 0.35742185 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

18 0.35394865 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

19 0.34606677 102 nips-2003-Large Scale Online Learning

20 0.33628219 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.047), (2, 0.266), (11, 0.037), (29, 0.012), (30, 0.016), (35, 0.055), (49, 0.011), (53, 0.11), (69, 0.02), (71, 0.053), (76, 0.049), (85, 0.072), (91, 0.139), (99, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84496754 55 nips-2003-Distributed Optimization in Adaptive Networks

Author: Ciamac C. Moallemi, Benjamin V. Roy

2 0.62335432 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

3 0.62233377 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

4 0.61598599 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

Author: Maneesh Sahani

Abstract: Signiﬁcant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a signiﬁcant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1

5 0.61475444 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

6 0.61385816 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

7 0.61253947 113 nips-2003-Learning with Local and Global Consistency

8 0.61233974 30 nips-2003-Approximability of Probability Distributions

9 0.60994786 81 nips-2003-Geometric Analysis of Constrained Curves

10 0.60945612 68 nips-2003-Eye Movements for Reward Maximization

11 0.60871142 73 nips-2003-Feature Selection in Clustering Problems

12 0.60686564 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

13 0.60602915 158 nips-2003-Policy Search by Dynamic Programming

14 0.60546076 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

15 0.60493505 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

16 0.60456741 126 nips-2003-Measure Based Regularization

17 0.6041109 143 nips-2003-On the Dynamics of Boosting

18 0.60343504 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

19 0.60332519 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

20 0.60100818 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms