nips nips2001 nips2001-177 knowledge-graph by maker-knowledge-mining

177 nips-2001-Switch Packet Arbitration via Queue-Learning


Source: pdf

Author: Timothy X. Brown

Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In packet switches, packets queue at switch inputs and contend for outputs. [sent-2, score-1.519]

2 The best policy depends on the current state of the switch and current traffic patterns. [sent-4, score-0.301]

3 This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. [sent-5, score-0.114]

4 We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection. [sent-6, score-0.147]

5 This paper focuses on packet arbitration for data packet switches. [sent-13, score-1.457]

6 Packet switches are unlike telephone circuit switches in that packet transmissions are uncoordinated and clusters of traffic can simultaneously contend for switch resources. [sent-14, score-0.977]

7 A packet arbitrator decides the order packets are sent through the switch in order to minimize packet queueing delays and the switch resources needed. [sent-15, score-2.456]

8 Switch performance depends on the arbitration policy and the pattern of traffic entering the switch. [sent-16, score-0.205]

9 A number of packet arbitration strategies have been developed for switches. [sent-17, score-0.803]

10 Many have fixed policies for sending packets that do not depend on the actual patterns of traffic in the network [10]. [sent-18, score-0.67]

11 Theoretical work has shown consideration of future packet arrivals can have significant impact on the switch performance but is computationally intractable (NP-Hard) to use [4]. [sent-20, score-0.99]

12 As we will show, a dynamic arbitration policy is difficult since the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. [sent-21, score-0.306]

13 In this paper, we consider the problem of finding an arbitration policy that dynamically and efficiently adapts to traffic conditions. [sent-22, score-0.231]

14 6 0 Arrivals Queues 2311 2 1 0 1 0 1 1 0 0 3 (b) (c) Figure 1: The packet arbitration model. [sent-31, score-0.803]

15 (a) In each time slot, packet sources generate packets on average at input for output . [sent-32, score-1.347]

16 (b) Packets arrive at an input-queued switch and are stored in queues. [sent-33, score-0.204]

17 The number label on each packet indicates to which output the packet is destined. [sent-34, score-1.356]

18 (c) The corresponding queue states, where indicates the number of packets waiting at input destined for output . [sent-35, score-0.903]

19 2 Problem Description   traffic sources generating traffic at each of inputs to The problem is comprised of a packet data switch as shown in Figure 1. [sent-38, score-0.886]

20 Time is divided into discrete time slots and in each time slot each source generates 0 or 1 packets. [sent-39, score-0.25]

21 Each packet that arrives at the input is labeled with which of the outputs the packet is headed. [sent-40, score-1.378]

22 In every time slot, the switch takes packets from inputs and delivers them at their intended output. [sent-41, score-0.812]

23 We describe the specific models for each used in this paper and then state the packet arbitration problem. [sent-42, score-0.84]

24 ¤    $ §  ¥ ¨§¥ ¤  At input , a traffic source generates a packet destined for output with probability the beginning of each time slot. [sent-46, score-0.869]

25 If is the load on input and is the load on output , then for stability we require and . [sent-47, score-0.285]

26 at A¥ 9 ¡ B&§¦¤ @¢   The matrix only represents long term average loads between input and output . [sent-48, score-0.15]

27 We treat the case where packet arrivals are uncorrelated over time and between sources so that in each time slot, a packet arrives at input with probability and given that we have an arrival, it is destined for output with probability . [sent-49, score-1.719]

28 2 The Switch The switch alternates between accepting newly arriving packets and sending packets in every time slot. [sent-52, score-1.42]

29 At the start of the time slot the switch sends packets waiting in the input queues and delivers them to the correct output where they are sent on. [sent-53, score-1.332]

30 Let represent the set of packets sent where if a packet is sent from input to output and otherwise. [sent-54, score-1.531]

31 The packets it can send are limited by the input and output constraints: the switch can send at most one packet per input and can deliver at most one packet to each output. [sent-55, score-2.446]

32 After sending packets, the new arrivals are added at the input and the switch moves to the next time slot. [sent-56, score-0.48]

33 3 The Input Queues Because the traffic sources are un-coordinated, it is possible for multiple packets to arrive in one time slot at different inputs, but destined for the same output. [sent-59, score-0.871]

34 Because of the output constraint, only one such packet may be sent and the others buffered in queues, one queue per input. [sent-60, score-0.926]

35 Thus packet queueing is unavoidable and the goal is to limit the delays due to queueing. [sent-61, score-0.726]

36 The queues are random access which means packets can be sent in any order from a queue. [sent-62, score-0.748]

37 For the purposes of this paper, all packets waiting at an input and destined for the same output are considered equivalent. [sent-63, score-0.813]

38 Let be a matrix where is the number of packets waiting at input for output as shown in Figure 1c. [sent-64, score-0.721]

39 4 Packet Arbitration £ The packet arbitration problem is: Given the state of the input queues, , choose a set of packets to send, , so at most one packet is sent from each input and at most one packet is delivered to each output. [sent-66, score-2.917]

40 We want a packet arbitration policy that minimizes the expected packet wait time. [sent-67, score-1.645]

41 F F When is sent the remaining packets must wait at least one more time slot before they can be the total number of packets in all the input queues, let be the number be sent. [sent-68, score-1.559]

42 Let of new arrivals, and let be the number of packets sent. [sent-69, score-0.548]

43 By Little’s theorem, the is increased by the number of packets that remain: expected wait time is proportional to the expected number of packets waiting in each time slot [10]). [sent-71, score-1.513]

44 The input and output constraints are met with equality if is a subset of a permutation matrix (zeros everywhere except that every row has at most one one and every column has one one). [sent-75, score-0.174]

45 In each time slot at each input, a packet can arrive for one of outputs or not at all. [sent-77, score-0.862]

46 Traditionally switching solves these problems by not considering the possible next arrivals, and using a search algorithm with time-complexity polynomial in that considers only the current state . [sent-83, score-0.12]

47 For instance the problem can be formulated as a so-called matching problem and polynomial algorithms exist that will send the largest possible [2, 6, 8]. [sent-84, score-0.233]

48  £ F While maximizing the packets sent in every time slot may seem like a solution, the problem is more interesting than this. [sent-85, score-0.873]

49 In general, many possible will maximize the number of packets that are sent. [sent-86, score-0.56]

50 Which one can we send now so that we will be in the best possible state for future time slots? [sent-87, score-0.217]

51 Further, it can be shown that to minimize the total wait it may be necessary to send less than the maximum number of packets in the current time slot [4]. [sent-89, score-0.988]

52 So, we look to a solution that efficiently finds policies that minimize the total wait by adapting to the current traffic pattern. [sent-90, score-0.155]

53 (1) Packet rates are fast, up to millions of packets per second so that many training examples are available. [sent-92, score-0.564]

54 They only increase packet delays somewhat, and so it is possible to freely learn in an online system. [sent-94, score-0.723]

55 New packets, , arrive and the packet arbitrator can choose to send any valid . [sent-97, score-0.849]

56 The task of the learner is to determine a packet arbitration policy that minimizes the total average cost. [sent-99, score-0.885]

57 We use the Tauberian approximation, that is, we assume the discount factor is close enough to 1 so that the discounted reward policy is equivalent to the average reward policy [5]. [sent-100, score-0.156]

58 Since minimizing the expected value of this cost is equivalent to minimizing the expected wait time, this formulation provides an exact match between RL and the problem task. [sent-101, score-0.174]

59 A single time slot consists of two stages: new arrivals are added to the queues and then packets are sent (see Figure 2). [sent-110, score-1.053]

60 We compute it after packets are sent since we can use the notion of afterstates to choose the action. [sent-112, score-0.726]

61 Since the packet sending process is deterministic, we know the state following the send action. [sent-113, score-0.911]

62 ¤ $ F where With afterstates, the action (which set of packets to send) depends on both the current state and the event. [sent-118, score-0.64]

63 The best action is the one that results in the lowest value function in the next state (which is known deterministically given , , and ). [sent-119, score-0.097]

64 2 Choosing the Action We compare every action with the action of not sending any packets. [sent-122, score-0.196]

65 The best action, is the set of packets meeting the input and output constraints that will reduce the value function the most compared to not sending any packets. [sent-123, score-0.752]

66 Packets in contend with other packets at input and other packets destined for output . [sent-125, score-1.311]

67 If we send a packet from , then no packet at the same input or output will be sent. [sent-126, score-1.528]

68 &§¥  interact primarily with packets in the same row and column. [sent-128, score-0.579]

69 Packets in other rows and columns only have an indirect effect on the value of sending a packet from . [sent-129, score-0.765]

70 Let be the number of packets in every subqueue after arrivals in state and before the decision. [sent-131, score-0.768]

71 Let be the reduction in the value function if one packet is sent from subqueue ( if the subqueue is empty). [sent-132, score-0.858]

72 3 Decomposing the Value Function The interaction between queues in the same row or the same column is captured primarily by the input and output constraints. [sent-136, score-0.22]

73 Every estimates its associated value function based on the packets at input and packets destined for output . [sent-140, score-1.299]

74 Let Many forms of be the total number of packets waiting at input . [sent-143, score-0.688]

75 Let be the total number of packets waiting for output . [sent-144, score-0.691]

76 ¢ It follows the value of sending a packet (compared to not sending a packet) from is  With these variables we define a linear approximation with parameters : (1) ¢ E! [sent-162, score-0.858]

77  &§¥ ¤ (  No explicit exploration is used since from the perspective of , enough stochasticity already exists in the packet arrival and send processes. [sent-167, score-0.806]

78 To assist the switch early in the learning, the switch sends the packets from a maximum matching in each time slot (instead of the packets selected by queue learning). [sent-168, score-1.82]

79 This initial assist period during the training was found to bring the switch into a good operating regime from which it could learn a better policy. [sent-169, score-0.252]

80 Each substate computes the value of sending a packet versus not sending a packet, and a polynomial algorithm computes the action that maximizes the total value across substates subject to the input and output constraints. [sent-171, score-1.049]

81 For learning, the number of rate is 366k time slots/s or less than 30 sec for floating point operations per time slot is approximately where is the number of parameters in the linear approximation. [sent-174, score-0.221]

82 At the above packet rate, for an switch, this translates into 650 MFLOPS which is within existing highend microprocessor capacity. [sent-175, score-0.654]

83 For computation of the packets to send, the cost is approximately to compute the weights. [sent-176, score-0.573]

84 To compute the packets to send, the decomposition has a natural parallel implementation that can divide it among processors. [sent-180, score-0.574]

85 In each experiment, the queue-learning was trained for an initial period, and then the mean wait time, is measured over a test period. [sent-184, score-0.11]

86 One alternative sends the largest number of packets in every time slot. [sent-186, score-0.649]

87 We simulate this arbitrator and measure the mean packet wait time, . [sent-188, score-0.81]

88 The best possible switch is a so-called output-queued switch [10]. [sent-189, score-0.376]

89 Such a switch is difficult to build at high-speeds, but we can compute its mean packet wait time, , via simulation. [sent-190, score-0.96]

90 Thus if our queue-learning solution is no better than a max send arbitrator, the gain will be 0 and if we achieve the performance of the output-queued switch, the gain will be 1. [sent-192, score-0.149]

91 is a uniform load of packets per input per time slot with each packet uniformly destined for one of the outputs. [sent-195, score-1.658]

92 The uniform load is a common baseline scenario for evaluating switches. [sent-197, score-0.112]

93 E 3   13 (     and are random matrices where the sum of loads per row and column are 0. [sent-198, score-0.105]

94 99 % # # # E ¢ $ C E 3 # #   E # ¡#     Parameter Discount, Learn Rate, Assist Period Train Period Test Period time slots time slots time slots 3) ) $ F 3) 2 Table 2: Simulation Results. [sent-203, score-0.225]

95 The random load is a more realistic in that loads tend to vary among the different input/output pairs. [sent-213, score-0.153]

96 6 Conclusion This paper showed that queue learning is able to learn a policy that significantly reduces the wait times of packets in a high-speed switch. [sent-220, score-0.819]

97 This is able to gain 10% to 84% of the possible reductions in wait times. [sent-222, score-0.133]

98 The gains are also largest when the switch load is least uniform which is what is most likely to be encountered in practice. [sent-224, score-0.322]

99 The queue-learning approach is able to use its estimates of the future impact of its packet send decisions in a consistent framework that is able to bridge the majority of the gap between current input queueing and optimal output queueing. [sent-226, score-0.967]

100 , “Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning,” in Advances NIPS 11, ed. [sent-250, score-0.101]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('packet', 0.654), ('packets', 0.548), ('switch', 0.182), ('arbitration', 0.149), ('slot', 0.145), ('traf', 0.136), ('arrivals', 0.13), ('send', 0.127), ('sent', 0.118), ('wait', 0.11), ('load', 0.096), ('sending', 0.093), ('destined', 0.092), ('queue', 0.09), ('queues', 0.082), ('waiting', 0.08), ('loads', 0.057), ('policy', 0.056), ('output', 0.048), ('switches', 0.047), ('rl', 0.046), ('afterstates', 0.046), ('arbitrator', 0.046), ('slots', 0.045), ('input', 0.045), ('action', 0.042), ('queueing', 0.04), ('reinforcement', 0.039), ('state', 0.037), ('sends', 0.036), ('switching', 0.035), ('admission', 0.034), ('atm', 0.034), ('subqueue', 0.034), ('delays', 0.032), ('time', 0.03), ('assist', 0.03), ('contend', 0.03), ('routing', 0.03), ('matching', 0.029), ('actions', 0.029), ('period', 0.025), ('arrival', 0.025), ('brown', 0.024), ('polynomial', 0.023), ('arrive', 0.022), ('allocation', 0.022), ('sources', 0.022), ('transitions', 0.019), ('every', 0.019), ('row', 0.018), ('discount', 0.018), ('delivers', 0.018), ('optical', 0.018), ('value', 0.018), ('telephone', 0.017), ('decomposes', 0.017), ('service', 0.017), ('policies', 0.017), ('uniform', 0.016), ('decisions', 0.016), ('largest', 0.016), ('per', 0.016), ('ef', 0.016), ('learn', 0.015), ('inputs', 0.015), ('total', 0.015), ('telecommunications', 0.015), ('arrives', 0.014), ('decomposing', 0.014), ('td', 0.014), ('column', 0.014), ('compute', 0.014), ('problem', 0.013), ('current', 0.013), ('reward', 0.013), ('primarily', 0.013), ('singh', 0.013), ('impact', 0.013), ('dynamically', 0.013), ('decomposition', 0.012), ('gains', 0.012), ('possible', 0.012), ('network', 0.012), ('exponentially', 0.012), ('channel', 0.012), ('nips', 0.012), ('grow', 0.011), ('gain', 0.011), ('future', 0.011), ('outputs', 0.011), ('aspect', 0.011), ('permutation', 0.011), ('expected', 0.011), ('networks', 0.011), ('states', 0.011), ('cost', 0.011), ('empty', 0.011), ('minimizes', 0.011), ('online', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 177 nips-2001-Switch Packet Arbitration via Queue-Learning

Author: Timothy X. Brown

Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.

2 0.056633621 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

Author: Michael Zibulevsky, Pavel Kisilev, Yehoshua Y. Zeevi, Barak A. Pearlmutter

Abstract: We consider a problem of blind source separation from a set of instantaneous linear mixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representation according to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified on noise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation quality over previously reported results. 1

3 0.053075731 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

4 0.051905654 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.051715732 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1

6 0.051313862 121 nips-2001-Model-Free Least-Squares Policy Iteration

7 0.047676895 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

8 0.046953779 59 nips-2001-Direct value-approximation for factored MDPs

9 0.046467502 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

10 0.045602467 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.043416474 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

12 0.039896347 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

13 0.035823159 36 nips-2001-Approximate Dynamic Programming via Linear Programming

14 0.035498247 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

15 0.035263568 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

16 0.033496276 126 nips-2001-Motivated Reinforcement Learning

17 0.031148268 93 nips-2001-Incremental A*

18 0.027846495 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

19 0.027427984 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

20 0.026906919 148 nips-2001-Predictive Representations of State


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.081), (1, -0.051), (2, 0.096), (3, -0.006), (4, 0.003), (5, 0.036), (6, 0.002), (7, -0.008), (8, 0.005), (9, -0.048), (10, -0.008), (11, -0.041), (12, 0.017), (13, 0.013), (14, -0.003), (15, 0.004), (16, 0.027), (17, -0.017), (18, 0.035), (19, -0.001), (20, -0.001), (21, 0.021), (22, -0.019), (23, 0.003), (24, -0.033), (25, 0.01), (26, 0.021), (27, -0.004), (28, 0.003), (29, 0.011), (30, 0.056), (31, -0.023), (32, -0.037), (33, -0.04), (34, -0.031), (35, 0.029), (36, -0.02), (37, 0.096), (38, -0.083), (39, -0.002), (40, 0.03), (41, 0.034), (42, 0.013), (43, -0.002), (44, -0.049), (45, -0.131), (46, -0.073), (47, 0.062), (48, -0.018), (49, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88222861 177 nips-2001-Switch Packet Arbitration via Queue-Learning

Author: Timothy X. Brown

Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.

2 0.49585336 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

Author: Bram Bakker

Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1

3 0.46643141 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

4 0.44780207 128 nips-2001-Multiagent Planning with Factored MDPs

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.

5 0.44226077 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

6 0.44151536 40 nips-2001-Batch Value Function Approximation via Support Vectors

7 0.43507826 59 nips-2001-Direct value-approximation for factored MDPs

8 0.40075758 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

9 0.39991903 36 nips-2001-Approximate Dynamic Programming via Linear Programming

10 0.39403352 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

11 0.38196817 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

12 0.37972769 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

13 0.37636802 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

14 0.36111608 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

15 0.35538879 93 nips-2001-Incremental A*

16 0.34638748 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network

17 0.34172377 126 nips-2001-Motivated Reinforcement Learning

18 0.33280861 13 nips-2001-A Natural Policy Gradient

19 0.33108786 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

20 0.32474867 108 nips-2001-Learning Body Pose via Specialized Maps


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.025), (17, 0.03), (19, 0.028), (27, 0.058), (30, 0.055), (38, 0.029), (59, 0.02), (61, 0.411), (72, 0.042), (79, 0.039), (83, 0.027), (91, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77940786 177 nips-2001-Switch Packet Arbitration via Queue-Learning

Author: Timothy X. Brown

Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.

2 0.73355484 145 nips-2001-Perceptual Metamers in Stereoscopic Vision

Author: B. T. Backus

Abstract: Theories of cue combination suggest the possibility of constructing visual stimuli that evoke different patterns of neural activity in sensory areas of the brain, but that cannot be distinguished by any behavioral measure of perception. Such stimuli, if they exist, would be interesting for two reasons. First, one could know that none of the differences between the stimuli survive past the computations used to build the percepts. Second, it can be difficult to distinguish stimulus-driven components of measured neural activity from top-down components (such as those due to the interestingness of the stimuli). Changing the stimulus without changing the percept could be exploited to measure the stimulusdriven activity. Here we describe stimuli in which vertical and horizontal disparities trade during the construction of percepts of slanted surfaces, yielding stimulus equivalence classes. Equivalence class membership changed after a change of vergence eye posture alone, without changes to the retinal images. A formal correspondence can be drawn between these “perceptual metamers” and more familiar “sensory metamers” such as color metamers. 1

3 0.63209999 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

Author: Igor V. Cadez, Padhraic Smyth

Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1

4 0.33599213 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

Author: Bram Bakker

Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1

5 0.3358224 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

6 0.33270973 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.3319751 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

8 0.33158606 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

9 0.33085129 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

10 0.33075082 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

11 0.33038628 183 nips-2001-The Infinite Hidden Markov Model

12 0.32956707 169 nips-2001-Small-World Phenomena and the Dynamics of Information

13 0.32956445 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

14 0.32951635 182 nips-2001-The Fidelity of Local Ordinal Encoding

15 0.32945055 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

16 0.32935995 56 nips-2001-Convolution Kernels for Natural Language

17 0.32931274 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

18 0.32893986 121 nips-2001-Model-Free Least-Squares Policy Iteration

19 0.3288894 89 nips-2001-Grouping with Bias

20 0.32853776 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes