nips nips2011 nips2011-300 knowledge-graph by maker-knowledge-mining

300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

Source: pdf

Author: Joel Veness, Marc Lanctot, Michael Bowling

Abstract: Monte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning technique for decision-making in single-agent and adversarial environments. The stochastic nature of the Monte-Carlo simulations introduces errors in the value estimates, both in terms of bias and variance. Whilst reducing bias (typically through the addition of domain knowledge) has been studied in the MCTS literature, comparatively little effort has focused on reducing variance. This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates. We demonstrate how these techniques can be applied to MCTS and explore their efﬁcacy on three different stochastic, single-agent settings: Pig, Can’t Stop and Dominion. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. [sent-10, score-0.234]

2 In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates. [sent-11, score-0.965]

3 These value estimates are used to direct the growth of the search tree and to estimate the value under the optimal policy from each internal node. [sent-15, score-0.252]

4 Somewhat surprisingly however, the application of classical variance reduction techniques to MCTS has remained unexplored. [sent-19, score-0.234]

5 In this paper we survey some common variance reduction ideas and show how they can be used to improve the efﬁciency of MCTS. [sent-20, score-0.256]

6 We found that substantial increases in performance can be obtained by using the appropriate combination of variance reduction techniques. [sent-22, score-0.197]

7 To the best of our knowledge, our work constitutes the ﬁrst investigation of classical variance reduction techniques in the context of MCTS. [sent-23, score-0.234]

8 The transition probability kernel gives rise to the state transition kernel P(s, a, s ) := P0 ({s } × R | s, a), which gives the probability of transitioning from state s to state s if action a is taken in s. [sent-32, score-0.283]

9 , An−1 , Sn , Rn describing the execution of the system up to n time n from a state st , the return from st is deﬁned as Xst := i=t+1 Ri . [sent-40, score-0.373]

10 An optimal policy, denoted by π ∗ , is a policy that maximizes the expected return E [Xst ] for all states st ∈ S. [sent-42, score-0.279]

11 2 Online Monte-Carlo Planning in MDPs If the state space is small, an optimal action can be computed ofﬂine for each state using techniques such as exhaustive Expectimax Search [18] or Q-Learning [23]. [sent-45, score-0.296]

12 This effectively amortizes the planning effort across multiple time steps, and implicitly focuses the approximation effort on the relevant parts of the state space. [sent-49, score-0.284]

13 A popular way to construct an online planning algorithm is to use a depth-limited version of an exhaustive search technique (such as Expectimax Search) in conjunction with iterative deepening [18]. [sent-50, score-0.321]

14 Depth-limited exhaustive search is generally outperformed by Monte-Carlo planning techniques in these situations. [sent-54, score-0.275]

15 A canonical example of online Monte-Carlo planning is 1-ply rollout-based planning [3]. [sent-55, score-0.325]

16 At each time t < n, given a starting state st , for each at ∈ A and with t < i < n, E [Xst ,at | Ai ∼ π(· | Si )] is estimated by generating trajectories St+1 , Rt+1 , . [sent-57, score-0.248]

17 One of the main advantages of rollout based planning compared with exhaustive depth-limited search is that a much larger search horizon can be used. [sent-65, score-0.371]

18 The disadvantage however is that if π is suboptimal, then E [Xst ,a | Ai ∼ π(· | Si )] < E [Xst ,a | Ai ∼ π ∗ (· | Si )] for at least one state-action pair (st , a) ∈ S × A, which implies that at least some value estimates constructed by 1-ply rollout-based planning are biased. [sent-66, score-0.215]

19 Monte-Carlo Tree Search algorithms improve on this procedure, by providing a means to construct asymptotically consistent estimates of the return under the optimal policy from simulation trajectories. [sent-69, score-0.25]

20 Like rolloutbased planning, it uses a default policy to generate trajectories of agent-system interaction. [sent-71, score-0.209]

21 Initially, the search tree consists of a 2 single node, which represents the current state st at time t. [sent-73, score-0.31]

22 Associated with each state-action pair (s, a) ∈ S × A is an estimate Xs,a of the return m under the optimal policy and a count Ts,a ∈ N representing the number of times this state-action 0 ¯0 pair has been visited after m simulations, with Ts,a := 0 and Xs,a := 0. [sent-76, score-0.224]

23 Selection involves traversing a path from the root node to a leaf node in the following manner: for each non-leaf, internal node representing some state s on this path, the UCB [1] criterion is applied to select an action until a leaf node corresponding to state sl is reached. [sent-78, score-0.413]

24 Provided sl is non-terminal, the expansion phase is then executed, by selecting an action Al ∼ π(· | sl ), observing a successor state Sl+1 = sl+1 , and then adding a node to the search tree so that Tm+1 = Tm ∪ {sl+1 }. [sent-82, score-0.445]

25 Notice that for all (s, a) ∈ S × A, the ¯m value estimate Xs,a corresponds to the average return of the realized simulation trajectories passing through state-action pair (s, a). [sent-90, score-0.265]

26 After the desired number of simulations k has been performed in ¯k state st , the action with the highest expected return at := argmaxa∈A Xst ,a is selected. [sent-91, score-0.415]

27 The next section describes how the accuracy of UCT’s value estimates can be improved by adapting classical variance reduction techniques to MCTS. [sent-95, score-0.27]

28 3 Variance Reduction in MCTS This section describes how three variance reduction techniques — control variates, common random numbers and antithetic variates — can be applied to the UCT algorithm. [sent-96, score-0.998]

29 Each subsection begins with a short overview of each variance reduction technique, followed by a description of how UCT can be modiﬁed to efﬁciently incorporate it. [sent-97, score-0.197]

30 Whilst we restrict our attention to planning in MDPs using the UCT algorithm, the ideas and techniques we present are quite general. [sent-98, score-0.212]

31 For example, similar modiﬁcations could be made to the Sparse Sampling [14] or AMS [5] algorithms for planning in MDPs, or to the POMCP algorithm [22] for planning in POMDPs. [sent-99, score-0.29]

32 , (Xn , Yn ) and setting c = c∗ , the control variate enhanced estimator n 1 ¯ Xcv := [Xi + c∗ (Yi − µY )] (2) n i=1 3 is obtained, with variance 1 Cov[X, Y ]2 Var[X] − . [sent-113, score-0.262]

33 n Var[Y ] Thus the total variance reduction is dependent on the strength of correlation between X and Y . [sent-114, score-0.197]

34 For the optimal value of c, the variance reduction obtained by using Z in place of X is 100×Corr[X, Y ]2 percent. [sent-115, score-0.197]

35 Control variates can be applied recursively, by redeﬁning the return Xs,a for every state-action pair (s, a) ∈ S × A to Zs,a := Xs,a + cs,a (Ys,a − E[Ys,a ]) , (3) provided E [Ys,a ] exists and is known for all (s, a) ∈ S × A, and Ys,a is a function of the random variables At , St+1 , Rt+1 , . [sent-123, score-0.436]

36 Furthermore, as E [Zst ,at | Ai ∼ π(· | Si )] = E [Xst ,at | Ai ∼ π(· | Si )], for all policies π, for all (st , at ) ∈ S × A and for all t < i < n, the inductive argument [15] used to establish the asymptotic consistency of UCT still applies when control variates are introduced in this fashion. [sent-128, score-0.4]

37 Finding appropriate control variates whose expectations are known in advance can prove difﬁcult. [sent-129, score-0.4]

38 This situation is further complicated in UCT where we seek a set of control variates {Ys,a } for all (s, a) ∈ S × A. [sent-130, score-0.4]

39 Drawing inspiration from advantage sum estimators [25], we now provide a general class of control variates designed for application in UCT. [sent-131, score-0.4]

40 Given a realization of a random simulation trajectory St = st , At = at , St+1 = st+1 , At+1 = at+1 , . [sent-132, score-0.25]

41 , Sn = sn , consider control variates of the form n−1 Yst ,at := i=t I[b(Si+1 )] − P[b(Si+1 ) | Si =si , Ai =ai ], (4) where b : S → {true, false} denotes a boolean function of state and I denotes the binary indicator function. [sent-135, score-0.547]

42 Thus, using control variates of this form simpliﬁes the task to specifying a state property that is strongly correlated with the return, such that P[b(Si+1 ) | Si = si , Ai = ai ] is known for all (si , ai ) ∈ (S, A), for all t ≤ i < n. [sent-137, score-0.741]

43 This considerably reduces the effort required to ﬁnd an appropriate set of control variates for UCT. [sent-138, score-0.439]

44 Rather than directly reducing the variance of the individual return estimates, common random numbers can instead be applied to reduce the variance of the estimated differences 4 ¯m ¯m in return Xs,a − Xs,a , for each pair of distinct actions a, a ∈ A in a state s. [sent-148, score-0.561]

45 This has the beneﬁt ¯m of reducing the effect of variance in both determining the action at := argmaxa∈A Xs,a selected m m ¯m by UCT in state st and the actions argmaxa∈A Xs,a + c log(Ts )/Ts,a selected by UCB as the search tree is constructed. [sent-149, score-0.573]

46 Our approach is to use the same chance outcomes to determine the trajectories originating from state-action pairs (s, a) and (s, a ) if j i m Ts,a = Ts,a , for any a, a ∈ A and i, j ∈ N. [sent-151, score-0.266]

47 This idea can be applied recursively, provided that the shared chance events from the current state do not conﬂict with those deﬁned at any possible ancestor state. [sent-154, score-0.239]

48 (5) 4 2 The method of antithetic variates exploits this identity, by deliberately introducing a negative corˆ ˆ relation between h1 (X) and h2 (Y). [sent-164, score-0.638]

49 Like the technique of common random numbers, antithetic variates can be applied to UCT by modifying the way simulation trajectories are sampled. [sent-168, score-0.837]

50 Whenever a node representing (si , ai ) ∈ S × A is visited during the backup phase of UCT, the realized trajectory m si+1 , ri+1 , ai+1 , . [sent-169, score-0.279]

51 The next time this node is visited during the selection phase, the previous trajectory is used to predetermine one or more antithetic events that will (partially) drive subsequent state transitions for the current simulation trajectory. [sent-173, score-0.605]

52 This technique can be applied to all state-action pairs inside the tree, provided that the antithetic events determined by any state-action pair do not overlap with the antithetic events deﬁned by any possible ancestor. [sent-175, score-0.852]

53 4 Empirical Results This section begins with a description of our test domains, and how our various variance reduction ideas can be applied to them. [sent-176, score-0.227]

54 1 Test Domains Pig is a turn-based jeopardy dice game that can be played with one or more players [20]. [sent-179, score-0.343]

55 Players roll two dice each turn and keep a turn total. [sent-180, score-0.507]

56 At each decision point, they have two actions, roll and stop. [sent-181, score-0.3]

57 Normally, dice rolls add to the players turn total, with the following exceptions: if a single is rolled the turn total will be reset and the turn ended; if a is rolled then the players turn will end along with their total score being reset to 0. [sent-183, score-0.527]

58 Can’t Stop is a dice game where the goal is to obtain three complete columns by reaching the highest level in each of the 2-12 columns [19]. [sent-185, score-0.267]

59 This done by repeatedly rolling 4 dice and playing zero or more pairing combinations. [sent-186, score-0.286]

60 If the dice are rolled and no legal pairing combination can be made, the player loses all of the progress made towards completing columns on this turn. [sent-189, score-0.362]

61 A key component of the game involves correctly assessing the risk associated with not being able to make a legal dice pairing given the current board conﬁguration. [sent-191, score-0.38]

62 It involves acquiring cards by spending the money cards in your current deck. [sent-193, score-0.279]

63 The control variates used for all domains were of the form speciﬁed by Equation 4 in Section 3. [sent-204, score-0.44]

64 In Pig, we used a boolean function that returned true if we had just performed the roll action and obtained at least one . [sent-206, score-0.402]

65 This control variate has an intuitive interpretation, since we would expect the return from a single trajectory to be an underestimate if it contained more rolls with a than expected, and an overestimate if it contained less rolls with a than expected. [sent-207, score-0.343]

66 In Can’t Stop, we used similarly inspired boolean function that returned true if we could not make a legal pairing from our most recent roll of the 4 dice. [sent-208, score-0.415]

67 In Dominion, we used a boolean function that returned whether we had just played an action that let us randomly draw a hand with 8 or more money to spend. [sent-209, score-0.219]

68 2, we need to specify the future chance events to be shared across all of the trajectories originating from each state. [sent-222, score-0.286]

69 Since a player’s ﬁnal score in Pig is strongly dependent on their dice rolls, it is natural to consider sharing one or more future dice roll outcomes. [sent-223, score-0.649]

70 By exploiting the property in Pig that each roll event is independent of the current state, our implementation shares a batch of roll outcomes large enough to drive a complete simulation trajectory. [sent-224, score-0.653]

71 So that these chance events don’t conﬂict, we limited the sharing of roll events to just the root node. [sent-225, score-0.56]

72 We found this scheme to be superior to sharing a smaller number of future roll outcomes and applying the ideas in Section 3. [sent-227, score-0.392]

73 Here we implemented common random numbers by recursively sharing preshufﬂed deck conﬁgurations across the actions at each state. [sent-230, score-0.205]

74 The motivation for this kind of sharing is that it should reduce the chance of one action appearing better than another simply because of “luckier” shufﬂes. [sent-231, score-0.228]

75 3, we need to describe how the antithetic events are constructed from previous simulation trajectories. [sent-234, score-0.443]

76 In Pig, a negative correlation between the returns of pairs of simulation trajectories can be induced by forcing the roll outcomes in the second trajectory to oppose those occurring in the ﬁrst trajectory. [sent-235, score-0.52]

77 Exploiting the property that the relative worth of each pair of dice outcomes is independent of state, a list of antithetic roll outcomes can be constructed by mapping each individual roll outcome in the ﬁrst trajectory to its antithetic partner. [sent-236, score-1.54]

78 For example, a lucky roll of was paired with the unlucky roll of . [sent-237, score-0.532]

79 Chance events that are favorable in the current state are assigned low indexes, while unfavorable events are assigned high index values. [sent-243, score-0.227]

80 Later when the antithetic trajectory needs to be simulated, the previously recorded rank indexes are used to compute the relevant antithetic event for the current state. [sent-245, score-0.673]

81 For Dominion, a number of antithetic mappings were tried, but none provided any substantial reduction in variance. [sent-249, score-0.39]

82 The complexity of how cards can be played to draw more cards from one’s deck makes a good or bad shufﬂe intricately dependent on the exact composition of cards in one’s deck, of which there are intractably many possibilities with no obvious symmetries. [sent-250, score-0.43]

83 3 Experimental Setup Each variance reduction technique is evaluated in combination with the UCT algorithm, with varying levels of search effort. [sent-252, score-0.301]

84 In Pig, the default (rollout) policy plays the roll and stop actions with probability 0. [sent-253, score-0.601]

85 In Dominion, the default policy incorporates some simple domain knowledge that favors obtaining higher cost cards and avoiding redundant actions. [sent-258, score-0.259]

86 The next set of results is used to assess the overall performance of UCT when augmented with our variance reduction techniques. [sent-265, score-0.197]

87 Therefore, when assessing the potential impact of variance reduction, it is important to know just how much of the estimation error is caused by variance as opposed to bias. [sent-268, score-0.218]

88 This allows us to compute the expected return E[Xs1 | π ∗ ] of the optimal action (roll) at the starting state s1 . [sent-271, score-0.227]

89 We use this value to compute both the bias-squared and variance component of the MSE for the estimated return of the roll action at s1 when using UCT without variance reduction. [sent-272, score-0.65]

90 As Pig has just two actions, we can also compute the MSE of the estimated difference in return between rolling and stopping using UCT without variance reduction. [sent-277, score-0.217]

91 Here we see that variance is the dominating component (the bias is within ±2) when the number of simulations is less than 1024. [sent-281, score-0.219]

92 The role of bias and variance will of course vary from domain to domain, but this result suggests that variance reduction may play an important role when trying to determine the best action. [sent-282, score-0.351]

93 The results also show a clear beneﬁt to using variance reduction techniques in the challenging game of Dominion. [sent-290, score-0.326]

94 Here the best combination of variance reduction techniques leads to an improvement roughly equivalent to using 25-40% more simulations. [sent-291, score-0.234]

95 The use of antithetic variates in both Pig and Can’t Stop gave a measurable increase in performance, however the technique was less effective than either control variates or common random numbers. [sent-292, score-1.115]

96 Control variates was particularly helpful across all domains, and even more effective when combined with common random numbers. [sent-293, score-0.365]

97 Common random numbers and antithetic variates increase the space complexity of UCT by a multiplicative constant. [sent-295, score-0.671]

98 Control variates typically increase the time complexity of each value backup by a constant. [sent-296, score-0.384]

99 Note that surprising results are possible; for example, if generating the underlying chance events is expensive, using common random numbers or antithetic variates can even reduce the computational cost of each simulation. [sent-298, score-0.878]

100 6 Conclusion This paper describes how control variates, common random numbers and antithetic variates can be used to improve the performance of Monte-Carlo Tree Search by reducing variance. [sent-302, score-0.764]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('variates', 0.336), ('uct', 0.326), ('antithetic', 0.302), ('pig', 0.294), ('roll', 0.266), ('dominion', 0.191), ('mcts', 0.191), ('dice', 0.175), ('planning', 0.145), ('stop', 0.136), ('st', 0.123), ('var', 0.117), ('xst', 0.116), ('cards', 0.114), ('variance', 0.109), ('si', 0.106), ('action', 0.1), ('chance', 0.095), ('cov', 0.095), ('game', 0.092), ('policy', 0.09), ('reduction', 0.088), ('games', 0.088), ('ai', 0.087), ('events', 0.083), ('mse', 0.08), ('rollout', 0.077), ('tree', 0.07), ('pairing', 0.069), ('trajectory', 0.069), ('return', 0.066), ('simulations', 0.065), ('control', 0.064), ('trajectories', 0.064), ('xcv', 0.064), ('sl', 0.063), ('outcomes', 0.063), ('state', 0.061), ('simulation', 0.058), ('deck', 0.056), ('search', 0.056), ('default', 0.055), ('actions', 0.054), ('money', 0.051), ('sn', 0.05), ('technique', 0.048), ('rolls', 0.048), ('variate', 0.048), ('backup', 0.048), ('chaslot', 0.048), ('crn', 0.048), ('cvcrn', 0.048), ('expectimax', 0.048), ('ine', 0.045), ('bias', 0.045), ('rt', 0.044), ('bs', 0.044), ('originating', 0.044), ('players', 0.044), ('legal', 0.044), ('realized', 0.043), ('tm', 0.042), ('rolled', 0.042), ('rolling', 0.042), ('estimator', 0.041), ('domains', 0.04), ('effort', 0.039), ('xsk', 0.039), ('gelly', 0.039), ('exhaustive', 0.037), ('ucb', 0.037), ('techniques', 0.037), ('estimates', 0.036), ('shuf', 0.036), ('guillaume', 0.036), ('boolean', 0.036), ('mdps', 0.035), ('online', 0.035), ('pair', 0.034), ('decision', 0.034), ('sharing', 0.033), ('turn', 0.033), ('alberta', 0.033), ('numbers', 0.033), ('node', 0.032), ('played', 0.032), ('istvan', 0.032), ('lanctot', 0.032), ('tsk', 0.032), ('veness', 0.032), ('winands', 0.032), ('yizao', 0.032), ('yst', 0.032), ('argmaxa', 0.032), ('player', 0.032), ('stochastic', 0.031), ('cn', 0.031), ('cv', 0.03), ('ideas', 0.03), ('common', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

Author: Joel Veness, Marc Lanctot, Michael Bowling

2 0.12205539 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufﬁcient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1

3 0.12138724 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

4 0.09700314 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

5 0.094394431 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efﬁciency in different circumstances, there is little speciﬁc evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1

6 0.089489445 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

7 0.088371493 41 nips-2011-Autonomous Learning of Action Models for Planning

8 0.086932689 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

9 0.083431445 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

10 0.082539529 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

11 0.080717616 145 nips-2011-Learning Eigenvectors for Free

12 0.076242447 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

13 0.071789369 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

14 0.068869166 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

15 0.068118908 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

16 0.067752898 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

17 0.067506313 218 nips-2011-Predicting Dynamic Difficulty

18 0.064574242 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.062752999 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

20 0.060048949 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, -0.132), (2, 0.049), (3, 0.118), (4, -0.091), (5, -0.003), (6, -0.038), (7, -0.045), (8, -0.01), (9, -0.05), (10, 0.032), (11, -0.008), (12, -0.057), (13, 0.023), (14, 0.012), (15, -0.044), (16, 0.038), (17, 0.032), (18, 0.023), (19, -0.055), (20, 0.057), (21, -0.067), (22, -0.028), (23, 0.0), (24, -0.014), (25, 0.065), (26, -0.07), (27, 0.049), (28, 0.002), (29, -0.063), (30, -0.043), (31, -0.003), (32, -0.011), (33, 0.018), (34, -0.02), (35, -0.012), (36, 0.143), (37, 0.002), (38, -0.055), (39, 0.054), (40, -0.053), (41, 0.094), (42, 0.072), (43, -0.007), (44, 0.03), (45, 0.022), (46, -0.083), (47, -0.001), (48, -0.128), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94420761 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

Author: Joel Veness, Marc Lanctot, Michael Bowling

2 0.66053277 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

3 0.64697802 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We ﬁrst prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

4 0.60429043 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper deﬁnes a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both ofﬂine and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

5 0.5922401 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

Author: Andre S. Barreto, Doina Precup, Joelle Pineau

Abstract: Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to ﬁx the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difﬁcult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL’s model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and ﬁtted Q-iteration. 1

6 0.59073806 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

7 0.57395703 215 nips-2011-Policy Gradient Coagent Networks

8 0.55055892 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

9 0.54299682 256 nips-2011-Solving Decision Problems with Limited Information

10 0.53728813 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

11 0.52761871 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games

12 0.52312225 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.50459516 218 nips-2011-Predicting Dynamic Difficulty

14 0.50184727 41 nips-2011-Autonomous Learning of Action Models for Planning

15 0.48721808 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

16 0.48133627 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

17 0.4613356 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

18 0.45588112 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

19 0.4502286 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

20 0.43920788 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.037), (4, 0.038), (20, 0.028), (26, 0.034), (31, 0.129), (33, 0.025), (43, 0.045), (45, 0.078), (57, 0.047), (71, 0.3), (74, 0.061), (79, 0.012), (83, 0.032), (99, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75478339 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

Author: Joel Veness, Marc Lanctot, Michael Bowling

2 0.613662 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

Author: Jun-ichiro Hirayama, Aapo Hyvärinen

Abstract: Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or energies. Here, we propose a principled probabilistic model to model the energycorrelations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new feature is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new twostage model enables estimation of both the linear mixing and the interactions related to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. 1

3 0.59882265 291 nips-2011-Transfer from Multiple MDPs

Author: Alessandro Lazaric, Marcello Restelli

Abstract: Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.

4 0.57949686 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

Author: Christoph H. Lampert

Abstract: We study multi-label prediction for structured output sets, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multilabel classiﬁcation techniques are typically not applicable in this situation, because they require explicit enumeration of the label set, which is infeasible in case of structured outputs. Relying on techniques originally designed for single-label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneﬁcial properties with single-label maximum-margin approaches, in particular formulation as a convex optimization problem, efﬁcient working set training, and PAC-Bayesian generalization bounds. 1

5 0.5298087 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, speciﬁed by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-speciﬁc approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efﬁciency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1

6 0.52487761 241 nips-2011-Scalable Training of Mixture Models via Coresets

7 0.524602 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

8 0.52049583 66 nips-2011-Crowdclustering

9 0.52036023 75 nips-2011-Dynamical segmentation of single trials from population neural data

10 0.51903605 249 nips-2011-Sequence learning with hidden units in spiking neural networks

11 0.51835406 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

12 0.51741105 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

13 0.51669872 221 nips-2011-Priors over Recurrent Continuous Time Processes

14 0.51608193 180 nips-2011-Multiple Instance Filtering

15 0.51602823 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.51522684 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

17 0.51415807 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

18 0.51356107 197 nips-2011-On Tracking The Partition Function

19 0.51299226 102 nips-2011-Generalised Coupled Tensor Factorisation

20 0.51246822 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines