nips nips2009 nips2009-48 knowledge-graph by maker-knowledge-mining

48 nips-2009-Bootstrapping from Game Tree Search


Source: pdf

Author: Joel Veness, David Silver, Alan Blair, William W. Cohen

Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. [sent-12, score-0.579]

2 First, we update all nodes in the search tree, rather than a single node. [sent-14, score-0.304]

3 We implemented our algorithm in a chess program Meep, using a linear heuristic function. [sent-16, score-0.548]

4 When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. [sent-18, score-0.697]

5 1 Introduction The idea of search bootstrapping is to adjust the parameters of a heuristic evaluation function towards the value of a deep search. [sent-19, score-0.699]

6 The motivation for this approach comes from the recursive nature of tree search: if the heuristic can be adjusted to match the value of a deep search of depth D, then a search of depth k with the new heuristic would be equivalent to a search of depth k + D with the old heuristic. [sent-20, score-1.584]

7 Deterministic, two-player games such as chess provide an ideal test-bed for search bootstrapping. [sent-21, score-0.616]

8 The intricate tactics require a significant level of search to provide an accurate position evaluation; learning without search has produced little success in these domains. [sent-22, score-0.494]

9 Much of the prior work in learning from search has been performed in chess or similar two-player games, allowing for clear comparisons with existing methods. [sent-23, score-0.508]

10 Samuel (1959) first introduced the idea of search bootstrapping in his seminal checkers player. [sent-24, score-0.397]

11 In Samuel’s work the heuristic function was updated towards the value of a minimax search in a subsequent position, after black and white had each played one move. [sent-25, score-0.956]

12 In their algorithm, TD-Leaf, the heuristic function is adjusted so that the leaf node of the principal variation produced by an alpha-beta search is moved towards the value of an alpha-beta search at a subsequent time step. [sent-28, score-0.929]

13 First, they only update one node after each search, which discards most of the information contained in the search tree. [sent-30, score-0.301]

14 large search trees come from sequences of unnatural moves that deviate significantly from sensible play. [sent-34, score-0.232]

15 Third, the target search is performed at a subsequent time-step, after a real move and response have been played. [sent-35, score-0.34]

16 We introduce a new framework for bootstrapping from game tree search that differs from prior work in two key respects. [sent-39, score-0.502]

17 First, all nodes in the search tree are updated towards the recursive minimax values computed by a single depth limited search from the root position. [sent-40, score-1.157]

18 This makes full use of the information contained in the search tree. [sent-41, score-0.232]

19 Second, as the learning target is based on hypothetical minimax play, rather than positions that occur at subsequent time steps, our methods are less sensitive to the opponent’s playing strength. [sent-43, score-0.466]

20 We applied our algorithms to learn a heuristic function for the game of chess, starting from random initial weights and training entirely from self-play. [sent-44, score-0.351]

21 When applied to an alpha-beta search, our chess program learnt to play at a master level against human opposition. [sent-45, score-0.469]

22 2 Background The minimax search algorithm exhaustively computes the minimax value to some depth D, using a heuristic function Hθ (s) to evaluate non-terminal states at depth D, based on a parameter vector θ. [sent-46, score-1.327]

23 We use the notation VsD (s) to denote the value of state s in a depth D minimax search from root 0 D state s0 . [sent-47, score-0.717]

24 We define Ts0 to be the set of states in the depth D search tree from root state s0 . [sent-48, score-0.46]

25 We define the principal leaf, lD (s), to be the leaf state of the depth D principal variation from state s. [sent-49, score-0.235]

26 We use θ the notation ← to indicate a backup that updates the heuristic function towards some target value. [sent-50, score-0.373]

27 Without search or prior domain knowledge, the target value is noisy and improvements to the value function are hard to distinguish. [sent-53, score-0.26]

28 In the game of chess, using a naive heuristic and no search, it is hard to find checkmate sequences, meaning that most games are drawn. [sent-54, score-0.405]

29 The quality of the target value can be significantly improved by using a minimax backup to update the heuristic towards the value of a minimax search. [sent-55, score-1.082]

30 Samuel’s checkers player (Samuel, 1959) introduced this idea, using an early form of bootstrapping from search that we call TD-Root. [sent-56, score-0.453]

31 The parameters of the heuristic function, θ, were adjusted towards the minimax search value at the next θ complete time-step (see Figure 1), Hθ (st ) ← VsD (st+1 ). [sent-57, score-0.868]

32 , 1998) updates the value of a minimax search at one timestep towards the value of a minimax search at the subsequent time-step (see Figure 1). [sent-61, score-1.236]

33 The parameters of the heuristic function are updated by gradient descent, using an update of the form θ VsD (st ) ← VsD (st+1 ). [sent-62, score-0.274]

34 The root value of minimax search is not differentiable in the paramet t+1 ters, as a small change in the heuristic value can result in the principal variation switching to a completely different path through the tree. [sent-63, score-0.854]

35 This is equivalent to updating the heuristic function of the principal leaf, θ Hθ (lD (st )) ← VsD (st+1 ). [sent-65, score-0.267]

36 The chess program Knightcap achieved master-level play when trained t+1 using TD-Leaf against a series of evenly matched human opposition, whose strength improved at a similar rate to Knightcap’s. [sent-66, score-0.422]

37 A similar algorithm was introduced contemporaneously by Beal and Smith (1997), and was used to learn the material values of chess pieces. [sent-67, score-0.276]

38 The world champion checkers program Chinook used TD-Leaf to learn an evaluation function that compared favorably to its hand-tuned heuristic function (Schaeffer et al. [sent-68, score-0.387]

39 Both TD-Root and TD-Leaf are hybrid algorithms that combine a sample backup with a minimax backup, updating the current value towards the search value at a subsequent time-step. [sent-70, score-0.766]

40 In their experiments with the chess program Knightcap (Baxter et al. [sent-73, score-0.339]

41 In addition, both Samuel’s approach and TD-Leaf only update one node of the search tree. [sent-78, score-0.301]

42 This does not make efficient use of the large tree of data, typically containing millions of values, that is constructed by memory enhanced minimax search variants. [sent-79, score-0.649]

43 Furthermore, the distribution of root positions that are used to train the heuristic is very different from the distribution of positions that are evaluated during search. [sent-80, score-0.379]

44 This can lead to inaccurate evaluation of positions that occur infrequently during real games but frequently within a large search tree; these anomalous values have a tendency to propagate up through the search tree, ultimately affecting the choice of best move at the root. [sent-81, score-0.731]

45 3 Minimax Search Bootstrapping Our first algorithm, RootStrap(minimax), performs a minimax search from the current position st , at every time-step t. [sent-83, score-0.807]

46 The parameters are updated so as to move the heuristic value of the root node θ towards the minimax search value, Hθ (st ) ← VsD (st ). [sent-84, score-0.988]

47 We update the parameters by stochastic t gradient descent on the squared error between the heuristic value and the minimax search value. [sent-85, score-0.813]

48 We treat the minimax search value as a constant, to ensure that we move the heuristic towards the search value, and not the other way around. [sent-86, score-1.114]

49 t For the remainder of this paper we consider heuristic functions that are computed by a linear combination Hθ (s) = φ(s)T θ, where φ(s) is a vector of features of position s, and θ is a parameter vector specifying the weight of each feature in the linear combination. [sent-89, score-0.239]

50 Our second algorithm, TreeStrap(minimax), also performs a minimax search from the current position st . [sent-95, score-0.807]

51 However, TreeStrap(minimax) updates all interior nodes within the search tree. [sent-96, score-0.269]

52 The parameters are updated, for each position s in the tree, towards the minimax search value of s, θ D Hθ (s) ← VsD (s), ∀s ∈ Tst . [sent-97, score-0.66]

53 4 Alpha-Beta Search Bootstrapping The concept of minimax search bootstrapping can be extended to αβ-search. [sent-99, score-0.671]

54 Unlike minimax search, alpha-beta does not compute an exact value for the majority of nodes in the search tree. [sent-100, score-0.606]

55 Instead, the search is cut off when the value of the node is sufficiently high or low that it can no longer contribute to the principal variation. [sent-101, score-0.3]

56 We consider a depth D alpha-beta search from root position s0 , and denote the upper and lower bounds computed for node s by aD (s) and bD (s) res0 s0 spectively, so that bD (s) ≤ VsD (s) ≤ aD (s). [sent-102, score-0.444]

57 s0 s0 0 The bounded values computed by alpha-beta can be exploited by search bootstrapping, by using a one-sided loss function. [sent-107, score-0.232]

58 If the value from the heuristic evaluation is larger than the a-bound of the θ deep search value, then it is reduced towards the a-bound, Hθ (s) ← aD (s). [sent-108, score-0.597]

59 Similarly, if the value st from the heuristic evaluation is smaller than the b-bound of the deep search value, then it is increased 4 θ towards the b-bound, Hθ (s) ← bD (s). [sent-109, score-0.805]

60 1 Updating Parameters in TreeStrap(αβ) High performance αβ-search routines rely on transposition tables for move ordering, reducing the size of the search space, and for caching previous search results (Schaeffer, 1989). [sent-113, score-0.712]

61 DeltaFromTransTbl requires a standard transposition table implementation providing the following routines: • probe(s), which returns the transposition table entry associated with state s. [sent-116, score-0.344]

62 • depth(t), which returns the amount of search depth used to determine the bound estimates stored in transposition table entry t. [sent-117, score-0.51]

63 In addition, DeltaFromTransTbl requires a parameter d ≥ 1, that limits updates to ∆θ from transposition table entries based on a minimum of search depth of d. [sent-120, score-0.51]

64 5 Learning Chess Program We implemented our learning algorithms in Meep, a modified version of the tournament chess engine Bodo. [sent-126, score-0.403]

65 Five wellknown, chess specific feature construction concepts: material, piece square tables, pawn structure, mobility and king safety were used to generate the 1812 distinct features. [sent-131, score-0.302]

66 In the search tree, mate scores were adjusted inward slightly so that shorter paths to mate were preferred when giving mate, and vice-versa. [sent-141, score-0.319]

67 When applying the heuristic evaluation function in the search, the heuristic estimates were truncated to the interval [−9900. [sent-142, score-0.47]

68 When in tournament mode, Meep uses an enhanced alpha-beta based search algorithm. [sent-146, score-0.332]

69 In training mode however, one of two different types of game tree search algorithms are used. [sent-148, score-0.456]

70 The first is a minimax search that stores the entire game tree in memory. [sent-149, score-0.737]

71 The second is a generic alpha-beta search implementation, that uses only three well known alpha-beta search enhancements: transposition tables, killer move tables and the history heuristic (Schaeffer, 1989). [sent-151, score-0.921]

72 This simplified search routine was used by the TreeStrap(αβ) and RootStrap(αβ) algorithms. [sent-152, score-0.273]

73 During training, the transposition table was cleared before the search routine was invoked. [sent-154, score-0.445]

74 Simplified search algorithms were used during training to avoid complicated interactions with the more advanced heuristic search techniques (such as null move pruning) useful in tournament play. [sent-155, score-0.845]

75 It must be stressed that during training, no heuristic or move ordering techniques dependent on knowing properties of the evaluation weights were used by the search algorithms. [sent-156, score-0.561]

76 Furthermore, a quiescence search (Beal, 1990) that examined all captures and check evasions was applied to leaf nodes. [sent-157, score-0.365]

77 Again, no knowledge based pruning was performed inside the quiescence search tree, which meant that the quiescence routine was considerably slower than in Bodo. [sent-159, score-0.417]

78 This is the standard way of quantifying the strength of a chess player within a pool of players. [sent-162, score-0.332]

79 The TreeStrap variants used a minimum search depth parameter of d = 1. [sent-177, score-0.338]

80 1 Relative Performance Evaluation We ran a competition between many different versions of Meep in tournament mode, each using a heuristic function learned by one of our algorithms. [sent-180, score-0.309]

81 All Elo values are calculated relative to the reference player, and should not be compared with Elo ratings of human chess players (including the results of online play, described in the next section). [sent-194, score-0.364]

82 The results demonstrate that learning from many nodes in the search tree is significantly more efficient than learning from a single root node. [sent-196, score-0.391]

83 In addition, learning from alpha-beta search is more effective than learning from minimax search. [sent-198, score-0.569]

84 Alpha-beta search significantly boosts the search depth, by safely pruning away subtrees that cannot affect the minimax value at the root. [sent-199, score-0.801]

85 Although the majority of nodes now contain one-sided bounds rather than exact values, it appears that the improvements to the search depth outweigh the loss of bound information. [sent-200, score-0.375]

86 2 Evaluation by Internet Play We also evaluated the performance of the heuristic function learned by TreeStrap(αβ), by using it in Meep to play against predominantly human opposition at the Internet Chess Club. [sent-206, score-0.321]

87 We evaluated two heuristic functions, the first using weights trained by self-play, and the second using weights trained against Shredder, a grandmaster strength commercial chess program. [sent-207, score-0.535]

88 Approximately 1000 games were played online, using 3m 3s Fischer time controls, for each heuristic function. [sent-214, score-0.367]

89 Although the heuristic function was fixed, the online rating fluctuates significantly over time. [sent-215, score-0.308]

90 The online rating of the heuristic learned by self-play corresponds to weak master level play. [sent-217, score-0.355]

91 The heuristic learned from games against Shredder were roughly 150 Elo stronger, corresponding to master level performance. [sent-218, score-0.364]

92 Furthermore, we used a generic alphabeta search algorithm for learning. [sent-223, score-0.232]

93 An interesting follow-up would be to explore the interaction between our learning algorithms and the more exotic alpha-beta search enhancements. [sent-224, score-0.232]

94 Bootstrapping from search could, in principle, be applied to many other search algorithms. [sent-228, score-0.464]

95 Simulation-based search algorithms, such as UCT, have outperformed traditional search algorithms in a number of domains. [sent-229, score-0.464]

96 The TreeStrap algorithm could be applied, for example, to the heuristic function that is used to initialise nodes in a UCT search tree with prior knowledge (Gelly & Silver, 2007). [sent-230, score-0.604]

97 Alternatively, in stochastic domains the evaluation function could be updated towards the value of an expectimax search, or towards the one-sided bounds computed by a *-minimax search (Hauk et al. [sent-231, score-0.436]

98 Knightcap: a chess program that learns by combining td(lambda) with game-tree search. [sent-239, score-0.339]

99 The history heuristic and alpha-beta search enhancements in practice. [sent-288, score-0.466]

100 Effective use of transposition tables in stochastic game tree search. [sent-307, score-0.373]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('treestrap', 0.501), ('minimax', 0.337), ('chess', 0.276), ('search', 0.232), ('vsd', 0.229), ('heuristic', 0.209), ('st', 0.208), ('rootstrap', 0.172), ('transposition', 0.172), ('elo', 0.158), ('meep', 0.115), ('games', 0.108), ('depth', 0.106), ('bootstrapping', 0.102), ('tournament', 0.1), ('samuel', 0.097), ('schaeffer', 0.088), ('game', 0.088), ('nsw', 0.086), ('bd', 0.08), ('tree', 0.08), ('backup', 0.075), ('rating', 0.072), ('quiescence', 0.072), ('positions', 0.064), ('checkers', 0.063), ('program', 0.063), ('towards', 0.061), ('leaf', 0.061), ('td', 0.059), ('play', 0.058), ('deltafromtranstbl', 0.057), ('knightcap', 0.057), ('player', 0.056), ('baxter', 0.054), ('evaluation', 0.052), ('opening', 0.051), ('opponent', 0.051), ('played', 0.05), ('self', 0.049), ('ad', 0.048), ('master', 0.047), ('blair', 0.046), ('initialise', 0.046), ('initialised', 0.046), ('silver', 0.046), ('deep', 0.043), ('amateur', 0.043), ('bodo', 0.043), ('buro', 0.043), ('shredder', 0.043), ('veness', 0.043), ('move', 0.043), ('root', 0.042), ('routine', 0.041), ('nicta', 0.039), ('tt', 0.039), ('lowerbound', 0.038), ('sydney', 0.038), ('subsequent', 0.037), ('nodes', 0.037), ('players', 0.036), ('update', 0.035), ('node', 0.034), ('principal', 0.034), ('tables', 0.033), ('ld', 0.031), ('upperbound', 0.031), ('updated', 0.03), ('beal', 0.03), ('position', 0.03), ('adjusted', 0.029), ('mate', 0.029), ('training', 0.029), ('backgammon', 0.029), ('blundered', 0.029), ('chinook', 0.029), ('deltaf', 0.029), ('gelly', 0.029), ('hauk', 0.029), ('opposition', 0.029), ('ranst', 0.029), ('romt', 0.029), ('tst', 0.029), ('uct', 0.029), ('untrained', 0.029), ('campbell', 0.028), ('target', 0.028), ('internet', 0.028), ('mode', 0.027), ('engine', 0.027), ('online', 0.027), ('piece', 0.026), ('human', 0.025), ('weights', 0.025), ('club', 0.025), ('enhancements', 0.025), ('tesauro', 0.025), ('updating', 0.024), ('book', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 48 nips-2009-Bootstrapping from Game Tree Search

Author: Joel Veness, David Silver, Alan Blair, William W. Cohen

Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1

2 0.11831281 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

3 0.11307919 161 nips-2009-Nash Equilibria of Static Prediction Games

Author: Michael Brückner, Tobias Scheffer

Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1

4 0.10254815 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

5 0.086444736 221 nips-2009-Solving Stochastic Games

Author: Liam M. Dermed, Charles L. Isbell

Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1

6 0.082333654 232 nips-2009-Strategy Grafting in Extensive Games

7 0.074421853 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games

8 0.066421278 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

9 0.064330228 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

10 0.064030878 11 nips-2009-A General Projection Property for Distribution Families

11 0.061602872 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

12 0.052196637 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

13 0.050809309 74 nips-2009-Efficient Bregman Range Search

14 0.046670828 181 nips-2009-Online Learning of Assignments

15 0.044862002 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

16 0.043216508 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

17 0.042026207 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

18 0.040458556 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

19 0.040226769 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

20 0.038782127 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.116), (1, 0.063), (2, 0.052), (3, -0.084), (4, -0.039), (5, 0.044), (6, -0.051), (7, 0.023), (8, -0.01), (9, -0.02), (10, -0.159), (11, -0.158), (12, -0.026), (13, 0.028), (14, 0.048), (15, -0.041), (16, 0.029), (17, 0.011), (18, 0.02), (19, 0.032), (20, -0.034), (21, -0.009), (22, 0.019), (23, 0.013), (24, -0.016), (25, -0.048), (26, -0.039), (27, 0.04), (28, -0.03), (29, -0.028), (30, -0.056), (31, 0.002), (32, -0.141), (33, -0.037), (34, -0.021), (35, -0.059), (36, -0.116), (37, 0.083), (38, 0.08), (39, 0.02), (40, 0.001), (41, 0.025), (42, 0.007), (43, 0.094), (44, -0.054), (45, 0.051), (46, -0.153), (47, -0.019), (48, 0.078), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94226462 48 nips-2009-Bootstrapping from Game Tree Search

Author: Joel Veness, David Silver, Alan Blair, William W. Cohen

Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1

2 0.57601893 161 nips-2009-Nash Equilibria of Static Prediction Games

Author: Michael Brückner, Tobias Scheffer

Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically. 1

3 0.53569496 221 nips-2009-Solving Stochastic Games

Author: Liam M. Dermed, Charles L. Isbell

Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1

4 0.5251565 232 nips-2009-Strategy Grafting in Extensive Games

Author: Kevin Waugh, Nolan Bard, Michael Bowling

Abstract: Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is played in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement. 1

5 0.48888054 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

6 0.47126409 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games

7 0.44499463 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

8 0.43732333 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

9 0.42123055 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

10 0.41165465 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

11 0.39074433 11 nips-2009-A General Projection Property for Distribution Families

12 0.38618731 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

13 0.37112546 234 nips-2009-Streaming k-means approximation

14 0.36212394 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

15 0.34748286 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

16 0.34339088 129 nips-2009-Learning a Small Mixture of Trees

17 0.34334961 45 nips-2009-Beyond Convexity: Online Submodular Minimization

18 0.3357065 239 nips-2009-Submodularity Cuts and Applications

19 0.33040094 181 nips-2009-Online Learning of Assignments

20 0.32735431 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.088), (25, 0.046), (35, 0.049), (36, 0.085), (39, 0.047), (46, 0.364), (58, 0.037), (61, 0.058), (71, 0.043), (86, 0.059), (91, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74517506 48 nips-2009-Bootstrapping from Game Tree Search

Author: Joel Veness, David Silver, Alan Blair, William W. Cohen

Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1

2 0.58888674 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Author: Chong Wang, David M. Blei

Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1

3 0.57326245 96 nips-2009-Filtering Abstract Senses From Image Search Results

Author: Kate Saenko, Trevor Darrell

Abstract: We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to find images on the web that are associated with the object name and train a visual classifier from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classifier to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difficult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classifiers trained on concrete-sense images returned by our method for a set of ten common office objects. 1

4 0.41203323 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya

Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.

5 0.40703121 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

6 0.4052003 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

7 0.40497908 239 nips-2009-Submodularity Cuts and Applications

8 0.40143859 33 nips-2009-Analysis of SVM with Indefinite Kernels

9 0.39973021 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

10 0.39822346 45 nips-2009-Beyond Convexity: Online Submodular Minimization

11 0.39712495 71 nips-2009-Distribution-Calibrated Hierarchical Classification

12 0.3968218 72 nips-2009-Distribution Matching for Transduction

13 0.39523393 220 nips-2009-Slow Learners are Fast

14 0.39484242 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

15 0.39465126 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

16 0.39404765 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

17 0.39383602 3 nips-2009-AUC optimization and the two-sample problem

18 0.39311859 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

19 0.39235842 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

20 0.39204529 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models