nips nips2003 nips2003-84 knowledge-graph by maker-knowledge-mining

84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Source: pdf

Author: Daniela Pucci de Farias, Nimrod Megiddo

Abstract: The so-called “experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any ﬁxed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game. A new experts algorithm is presented and analyzed in the context of repeated games. It is shown that asymptotically, under certain conditions, it performs as well as the best available expert. This algorithm is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player’s actions on the opponent’s actions or the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner’s Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. [sent-5, score-0.405]

2 The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any ﬁxed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. [sent-6, score-0.834]

3 A new experts algorithm is presented and analyzed in the context of repeated games. [sent-8, score-0.413]

4 This algorithm is quite different from previously proposed experts algorithms. [sent-10, score-0.33]

5 It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player’s actions on the opponent’s actions or the environment. [sent-11, score-0.383]

6 The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner’s Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. [sent-12, score-0.439]

7 A well-known class of methods in machine learning are the socalled experts algorithms. [sent-14, score-0.31]

8 The goal of these methods is to learn from experience how to combine advice from multiple experts in order to make sequential decisions in an online environment. [sent-15, score-0.331]

9 The reward in each stage is a function of the chosen action and the choices of Nature or the environment (also referred to as the “adversary” or the “opponent”). [sent-18, score-0.5]

10 ” Each expert suggests a choice of an action based on the history of the process and the expert’s own choice algorithm. [sent-25, score-0.593]

11 An experts algorithm directs the agent with regard to which expert to follow in the next stage, based on the past history of actions and rewards. [sent-27, score-1.079]

12 An expert selection rule is said to minimize regret if it yields an average reward as large as that of any single expert, against any ﬁxed sequence of actions chosen by the opponent. [sent-31, score-0.997]

13 Indeed, certain experts algorithms, which at each stage choose an expert from a probability distribution that is related to the reward accumulated by the expert prior to that stage, have been shown to minimize regret [1, 2]. [sent-32, score-1.835]

14 It is crucial to note though that, since the experts are compared on a sequence-bysequence basis, the MR criterion ignores the possibility that different experts may induce different sequences of choices by the opponent. [sent-33, score-0.715]

15 In repeated games, the assumption that the opponent’s choices are independent of the agent’s choices is not justiﬁed, because the opponent is likely to base his choices of actions on the past history of the game. [sent-37, score-0.852]

16 But even in the case of zero-sum games, the possibility that an opponent has bounded rationality may lead a player to look for patterns to be exploited in the opponent’s past actions. [sent-40, score-0.696]

17 In the single-stage Prisoner’s Dilemma (PD) game, each player can either cooperate (C) or defect (D). [sent-43, score-0.432]

18 Defecting is better than cooperating regardless of what the opponent does, but it is better for both players if both cooperate than if both defect. [sent-44, score-0.4]

19 Suppose the row player consults with a set of experts, including the “defecting expert,” who recommends defection all the time. [sent-46, score-0.446]

20 Let the strategy of the column player in the repeated game be ﬁxed. [sent-47, score-0.681]

21 In particular, the column player may be very patient and cooperative, willing to wait for the row player to become cooperative, but eventually becoming non-cooperative if the row player does not seem to cooperate. [sent-48, score-1.121]

22 Since defection is a dominant strategy in the stage game, the defecting expert achieves in each step a reward as high as any other expert against any sequence of choices of the column player, so the row player learns with the experts algorithm to defect all the time. [sent-49, score-2.421]

23 Obviously, constant defection is not the best response in the repeated game against many possible strategies of the column player. [sent-51, score-0.408]

24 For instance, the row player would regret very much using the experts algorithm if he were told later that the column player had been playing a strategy such as Tit-for-Tat. [sent-52, score-1.32]

25 1 In this paper, we propose and analyze a new experts algorithm, which follows experts judiciously, attempting to maximize the long-term average reward. [sent-53, score-0.664]

26 First, each time an expert is selected, it is followed for multiple stages of the game rather than a single one. [sent-55, score-0.851]

27 Second, our algorithm takes 1 The Tit-for-Tat strategy is to play C in the ﬁrst stage, and later play in every stage whatever the opponent played in the preceding stage. [sent-56, score-0.791]

28 into account only the rewards that were actually achieved by an expert in the stages it was followed, rather than the reward that could have been obtained in any stage. [sent-57, score-0.89]

29 A “worst-case” guarantee that, in any play of the game, our algorithm achieves an average reward that is asymptotically as large as that of the expert that did best in the rounds of the game when it was played. [sent-60, score-1.127]

30 Under certain conditions, our algorithm achieves an average reward that is asymptotically as large as the average reward that could have been achieved by the best expert, had it been followed exclusively. [sent-63, score-0.694]

31 A bound based on actual expert performance is presented in section 3. [sent-68, score-0.505]

32 2 The algorithm We consider an “experts strategy” for the row player in a repeated two-person game in normal form. [sent-71, score-0.653]

33 At each stage of the game, the row and column player choose actions i ∈ I and j ∈ J, respectively. [sent-72, score-0.679]

34 The row player has a reward matrix R, with entries 0 ≤ Rij ≤ u. [sent-73, score-0.568]

35 The row player may consult at each stage with a set of experts {1, . [sent-74, score-0.842]

36 We denote by σe the strategy proposed by expert e, i. [sent-78, score-0.564]

37 , σe = σe (hs ) is the proposed probability distribution over actions in stage s, given the history hs . [sent-80, score-0.421]

38 We refer to the row player as the agent and to the column player as the opponent. [sent-81, score-0.781]

39 Usually, the form of experts algorithms found in the literature is as follows. [sent-82, score-0.31]

40 Denote by Me (s − 1) the average reward achieved by expert e prior to stage s of the game2 . [sent-83, score-0.931]

41 Then, a reasonable rule is to follow expert e in stage s with a probability that is proportional to some monotone function of Me (s − 1). [sent-84, score-0.646]

42 ) denote the observed actions of the opponent up to stage s, and letting σX denote the strategy induced by the experts algorithm, we have 1 s s E[R(i, js ) : i ∼ σX (hs )] ≥ sup s =1 e 1 s s E[R(i, js ) : i ∼ σe (hs )] − o(s). [sent-89, score-1.145]

43 This subtlety is also missing in the experts algorithm we described above. [sent-91, score-0.33]

44 stage of the game, the selection of expert is based solely on how well various experts have, or could have, done so far. [sent-93, score-0.956]

45 For instance, in the repeated PD game described in the introduction, assuming that the opponent is playing Tit-for-Tat, the algorithm is unable to establish the connection between the opponent’s cooperative moves and his own. [sent-95, score-0.673]

46 Based on the previous observations, we propose a new experts algorithm, which takes into account how the opponent reacts to each of the experts. [sent-96, score-0.617]

47 The idea is simple: instead of choosing a (potentially different) expert at each stage of the game, the number of stages an expert is followed, each time it is selected, increases gradually. [sent-97, score-1.274]

48 The number of phases during which expert e has been followed is denoted by Ne . [sent-101, score-0.607]

49 The average payoff from phases in which expert e has been followed is denoted by Me . [sent-102, score-0.731]

50 With probability 1/i perform an exploration phase, namely, choose an expert e from the uniform distribution over {1, . [sent-110, score-0.521]

51 , r}; otherwise, perform an exploitation phase, namely, choose an expert e from the uniform distribution over the set of experts e with maximum Me . [sent-113, score-0.794]

52 Follow expert e’s instructions for the next Ne stages. [sent-116, score-0.484]

53 Thus, Me (i) and Ne (i) are, respectively, the average payoff accumulated by expert e and the total number of phases this expert was followed on or before phase i. [sent-137, score-1.326]

54 We will also let M (s) and M (i) denote, without confusion, the average payoff accumulated by the algorithm in the ﬁrst s stages or ﬁrst i phases of the game. [sent-138, score-0.406]

55 3 A bound based on actual expert performance When the SEA is employed, the average reward Me (i) that was actually achieved by each available expert e is being tracked. [sent-139, score-1.297]

56 It is therefore interesting to compare the average reward M (s) achieved by the SEA, with the averages achieved by the various experts. [sent-140, score-0.328]

57 The following theorem states that, in the long run, the SEA obtains almost surely at least as much as the actual average reward obtained by any available expert during the same play. [sent-141, score-0.858]

58 Note that the bound (2) is merely a statement about the average reward of the SEA in comparison to the average reward achieved by each expert, but nothing is claimed about the limits themselves. [sent-147, score-0.548]

59 1 proposes an application of this bound in a case when an additional assumption about the experts’ and opponent’s strategies allows us to analyze convergence of the average reward for each expert. [sent-149, score-0.323]

60 Another interesting case occurs when one of the experts plays a maximin strategy; in this case, bound (2) ensures that the SEA achieves at least the maximin value of the game. [sent-150, score-0.502]

61 The same holds if one of the experts is a regret-minimizing experts algorithm, which is known to achieve at least the maximin value of the game. [sent-151, score-0.714]

62 Sketch of proof: Denote by V be the random variable maxe lim inf i→∞ Me (i), and denote ¯ ¯ by E the expert that achieves that maximum (if there is more than one, let E be the one with the least index). [sent-154, score-0.696]

63 1 relies on establishing that, for all > 0 and any expert e, Ne (i) · δ(Me (i) ≤ V − ) Pr lim =0 =1. [sent-157, score-0.551]

64 (3) i→∞ i In words, if the average reward of an expert falls below V by a non negligible amount, it must have been followed only a small fraction of the total number of phases. [sent-158, score-0.769]

65 There are three possible situations for any expert e: (a) When lim inf i→∞ Me (i) > V − , the inequality is satisﬁed trivially. [sent-159, score-0.617]

66 (b) When lim supi→∞ Me (i) < V , there is a phase I such that for all i ≥ I, Me (i) < ME (i), so that expert e is played only on exploration phases, and a ¯ large deviations argument establishes that (3) holds. [sent-160, score-0.736]

67 Then, between phases Ik and Ik , 0 expert e is selected at least Ne (Ik )( − δ)/(6u) times. [sent-165, score-0.613]

68 , Pk , the phases when expert e is selected, between Ik and Ik , we have j Me (Ik ) ≥ j−1 0 0 Me (Ik )(Ne (Ik ) + j − 1)(Ne (Ik ) + j) . [sent-169, score-0.564]

69 0 0 (Ne (Ik ) + j)(Ne (Ik ) + j + 1) A simple induction argument shows that, in order to have Me (Ik ) ≤ V − 2 0 ≤ Me (Ik ) − −δ , 2 0 expert e must be selected a number of times Pk ≥ Ne (Ik )( − δ)/(6u). [sent-170, score-0.537]

70 For all large enough k, the phases Ik when expert e is selected are exclusively exploration phases. [sent-172, score-0.629]

71 It is easy to show that the average reward M (s) in stages s in the middle of phase i becomes arbitrarily close to the average reward at the end of that phase M (i), as i goes to inﬁnity, and the theorem follows . [sent-179, score-0.812]

72 2 4 The ﬂexible opponent In general, it is impossible for an experts algorithm to guarantee, against an unknown opponent, a reward close to what the best available expert would have achieved if it had been the only expert. [sent-180, score-1.385]

73 In the Matching Pennies (MP) game, each of the player and the adversary has to choose either H (“Heads”) or T (“Tails”). [sent-183, score-0.37]

74 If the choices match, the player loses 1; otherwise, he wins 1. [sent-184, score-0.399]

75 A possible strategy for the adversary in the repeated MP game is: Adversary: Fix a positive integer s and a string σ s ∈ {H, T }s . [sent-185, score-0.404]

76 , if the sequence of choices of the player during the ﬁrst s stages coincided with the string σ s , then play T ; otherwise, play the 50:50 mixed strategy. [sent-190, score-0.775]

77 Suppose each available expert e corresponds to a strategy of the form: Expert: Fix a string σe ∈ {H, T }s . [sent-191, score-0.603]

78 Suppose an expert e∗ with σe∗ = σ s is available. [sent-197, score-0.484]

79 Then, in order for an experts algorithm to achieve at least the reward of e∗ , it needs to follow the string σ s precisely during the ﬁrst s stages. [sent-198, score-0.587]

80 In view of the repeated MP example, some assumption about the opponent must be made in order for the player to be able to learn how to play to against that opponent. [sent-200, score-0.838]

81 The essence of the difﬁculty with the above strategy of the opponent is that it is not ﬂexible — the player has only one chance to guess who the best expert is and thus cannot recover from a mistake. [sent-201, score-1.174]

82 Under the assumption of ﬂexibility, the SEA achieves an average reward that is asymptotically as high as what the best expert could be expected to achieve. [sent-203, score-0.856]

83 (i) An opponent playing strategy π(s) is said to be ﬂexible with respect to expert e (e = 1, . [sent-206, score-0.892]

84 In words, the expected average reward during the s stages between stage s0 and stage s0 +s converges (as s tends to inﬁnity) to a limit that does not depend on the history of the play prior to stage s0 . [sent-211, score-1.037]

85 Thus, at any stage of the game, the automaton picks an action from a probability distribution associated with its current state and transitions into a new state, according to a probability distribution which depends on the outcome of the stage game. [sent-222, score-0.405]

86 If both the opponent and an expert play PASs, then a Markov chain is induced over the set of pairs of the respective internal states. [sent-223, score-0.888]

87 For instance, if the frequency of play of action j by the opponent is relevant, the player might condition his s choice at stage s + 1 on the quantities τj = s =1 β s−s δjjs where β < 1 and δ is the Kronecker delta. [sent-231, score-0.932]

88 In this case, only actions js at stages s that are relatively recent have a signiﬁcant impact on τj . [sent-232, score-0.332]

89 5 A bound based on expected expert performance In this section we show that if the opponent is “ﬂexible” with respect to the available experts, then the SEA achieves almost surely an average payoff that is asymptotically as large as what the best expert could achieve against the same opponent. [sent-234, score-1.577]

90 If an opponent π is ﬂexible with respect to the experts 1, . [sent-237, score-0.617]

91 , r, then the average payoff up to stage s, M (s), satisﬁes Pr lim inf M (s) ≥ max µe = 1 . [sent-240, score-0.397]

92 Flexibility comes into play as a way of ensuring that the value of following any given expert is well-deﬁned, and can eventually be estimated as long as the SEA follows that expert sufﬁciently many times. [sent-245, score-1.086]

93 In other words, ﬂexibility ensures that there is a best expert to be learned, and that learning can effectively occur because actions taken by other experts, which could affect the behavior of the opponent, are eventually forgotten by the latter. [sent-246, score-0.617]

94 1, which shows that, under the ﬂexibility assumption, the average reward achieved by each expert is asymptotically almost surely the same as the reward that would have been achieved by the same expert, had he been the only available expert. [sent-248, score-1.109]

95 If the opponent is ﬂexible with respect to expert e, then with probability one, limi→∞ Me (i) = µe . [sent-251, score-0.791]

96 It follows that if the opponent is ﬂexible with respect to expert e, then for some ν > 0, as j tends to inﬁnity, E[(Me (Ij ) − µe )4 ] = O(j −1−ν ), which sufﬁces for (6). [sent-261, score-0.791]

97 Consider playing the repeated PD game against an opponent who plays Tit-for-Tat, and suppose there are only two experts: “Always defect” (AD) and “Always cooperate” (AC). [sent-264, score-0.613]

98 Thus, AC induces cooperation in every stage and yields a payoff higher than AD, which induces defection in every stage of the game except the ﬁrst one. [sent-265, score-0.732]

99 It is easy to verify that Tit-for-Tat is ﬂexible with respect to the experts AC and AD. [sent-266, score-0.31]

100 By contrast, as mentioned in the introduction, in order to minimize regret, the standard experts algorithm must play D in almost every stage of the game, and therefore achieves a lower payoff. [sent-269, score-0.647]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('expert', 0.484), ('player', 0.325), ('experts', 0.31), ('opponent', 0.307), ('ik', 0.284), ('reward', 0.198), ('game', 0.18), ('stage', 0.162), ('regret', 0.159), ('sea', 0.145), ('stages', 0.144), ('actions', 0.112), ('play', 0.097), ('repeated', 0.083), ('phases', 0.08), ('payoff', 0.08), ('hs', 0.079), ('defection', 0.076), ('js', 0.076), ('prisoner', 0.076), ('choices', 0.074), ('phase', 0.073), ('history', 0.068), ('lim', 0.067), ('mr', 0.066), ('cooperate', 0.061), ('dilemma', 0.061), ('exible', 0.059), ('strategy', 0.058), ('achieves', 0.058), ('exibility', 0.056), ('agent', 0.051), ('played', 0.05), ('pk', 0.048), ('asymptotically', 0.046), ('defect', 0.046), ('defecting', 0.046), ('flexibility', 0.046), ('maximin', 0.046), ('adversary', 0.045), ('row', 0.045), ('inf', 0.044), ('average', 0.044), ('followed', 0.043), ('playing', 0.043), ('achieved', 0.043), ('pd', 0.042), ('action', 0.041), ('games', 0.04), ('cooperative', 0.04), ('automaton', 0.04), ('accumulated', 0.038), ('string', 0.038), ('theorem', 0.038), ('exploration', 0.037), ('ij', 0.037), ('ac', 0.036), ('ne', 0.036), ('column', 0.035), ('strategies', 0.034), ('past', 0.034), ('mp', 0.034), ('players', 0.032), ('almaden', 0.03), ('jose', 0.03), ('limi', 0.03), ('megiddo', 0.03), ('pucci', 0.03), ('rationality', 0.03), ('supi', 0.03), ('surely', 0.03), ('pr', 0.03), ('selected', 0.028), ('holds', 0.027), ('cooperation', 0.026), ('assumption', 0.026), ('lemma', 0.026), ('environment', 0.025), ('argument', 0.025), ('sketch', 0.024), ('strategic', 0.024), ('ae', 0.024), ('registers', 0.024), ('nitely', 0.023), ('available', 0.023), ('nity', 0.023), ('induces', 0.023), ('denote', 0.022), ('inequality', 0.022), ('bound', 0.021), ('nr', 0.021), ('advice', 0.021), ('eventually', 0.021), ('rewards', 0.021), ('induce', 0.021), ('least', 0.021), ('conclude', 0.021), ('states', 0.02), ('algorithm', 0.02), ('economic', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

2 0.30754977 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1

3 0.27138132 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a ﬁxed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

4 0.22008073 19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

5 0.1557364 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

6 0.12352327 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

7 0.11980277 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

8 0.10366175 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

9 0.068651877 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

10 0.065498404 68 nips-2003-Eye Movements for Reward Maximization

11 0.060913701 55 nips-2003-Distributed Optimization in Adaptive Networks

12 0.059290163 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

13 0.059205994 51 nips-2003-Design of Experiments via Information Theory

14 0.053828351 78 nips-2003-Gaussian Processes in Reinforcement Learning

15 0.05180927 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

16 0.048314393 119 nips-2003-Local Phase Coherence and the Perception of Blur

17 0.045024835 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

18 0.042097777 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

19 0.039926797 87 nips-2003-Identifying Structure across Pre-partitioned Data

20 0.038210157 62 nips-2003-Envelope-based Planning in Relational MDPs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.147), (1, 0.22), (2, -0.096), (3, -0.073), (4, 0.054), (5, 0.047), (6, -0.034), (7, 0.098), (8, -0.325), (9, -0.1), (10, 0.061), (11, 0.21), (12, -0.161), (13, 0.05), (14, -0.037), (15, -0.158), (16, -0.049), (17, 0.146), (18, 0.176), (19, -0.034), (20, 0.022), (21, -0.115), (22, -0.036), (23, -0.17), (24, 0.019), (25, -0.172), (26, -0.131), (27, -0.139), (28, -0.086), (29, -0.025), (30, -0.107), (31, -0.045), (32, 0.039), (33, -0.1), (34, 0.023), (35, 0.02), (36, 0.065), (37, 0.088), (38, -0.133), (39, -0.004), (40, 0.082), (41, -0.019), (42, -0.055), (43, 0.086), (44, 0.089), (45, -0.004), (46, -0.016), (47, 0.005), (48, -0.011), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98406982 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

2 0.81965506 19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

3 0.70933813 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

4 0.64772701 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

5 0.4569428 44 nips-2003-Can We Learn to Beat the Best Stock

Author: Allan Borodin, Ran El-Yaniv, Vincent Gogan

Abstract: A novel algorithm for actively trading stocks is presented. While traditional universal algorithms (and technical trading heuristics) attempt to predict winners or trends, our approach relies on predictable statistical relations between all pairs of stocks in the market. Our empirical results on historical markets provide strong evidence that this type of technical trading can “beat the market” and moreover, can beat the best stock in the market. In doing so we utilize a new idea for smoothing critical parameters in the context of expert learning. 1 Introduction: The Portfolio Selection Problem The portfolio selection (PS) problem is a challenging problem for machine learning, online algorithms and, of course, computational ﬁnance. As is well known (e.g. see Lugosi [1]) sequence prediction under the log loss measure can be viewed as a special case of portfolio selection, and perhaps more surprisingly, from a certain worst case minimax criterion, portfolio selection is not essentially any harder (than prediction) as shown in [2] (see also [1], Thm. 20 & 21). But there seems to be a qualitative difference between the practical utility of “universal” sequence prediction and universal portfolio selection. Simply stated, universal sequence prediction algorithms under various probabilistic and worst-case models work very well in practice whereas the known universal portfolio selection algorithms do not seem to provide any substantial beneﬁt over a naive investment strategy (see Sec. 4). A major pragmatic question is whether or not a computer program can consistently outperform the market. A closer inspection of the interesting ideas developed in information theory and online learning suggests that a promising approach is to exploit the natural volatility in the market and in particular to beneﬁt from simple and rather persistent statistical relations between stocks rather than to try to predict stock prices or “winners”. We present a non-universal portfolio selection algorithm1 , which does not try to predict winners. The motivation behind our algorithm is the rationale behind constant rebalancing algorithms and the worst case study of universal trading introduced by Cover [3]. Not only does our proposed algorithm substantially “beat the market” on historical markets, it also beats the best stock. So why are we presenting this algorithm and not just simply making money? There are, of course some caveats and obstacles to utilizing the algorithm. But for large investors the possibility of a goose laying silver (if not golden) eggs is not impossible. 1 Any PS algorithm can be modiﬁed to be universal by investing any ﬁxed fraction of the initial wealth in a universal algorithm. Assume a market with m stocks. Let vt = (vt (1), . . . , vt (m)) be the closing prices of the m stocks for the tth day, where vt (j) is the price of the jth stock. It is convenient to work with relative prices xt (j) = vt (j)/vt−1 (j) so that an investment of $d in the jth stock just before the tth period yields dxt (j) dollars. We let xt = (xt (1), . . . , xt (m)) denote the market vector of relative prices corresponding to the tth day. A portfolio b is an allocation of wealth in the stocks, speciﬁed by the proportions b = (b(1), . . . , b(m)) of current dollar wealth invested in each of the stocks, where b(j) ≥ 0 and j b(j) = 1. The daily return of a portfolio b w.r.t. a market vector x is b · x = j b(j)x(j) and the (compound) total return, retX (b1 , . . . , bn ), of a sequence of portfolios b1 , . . . , bn w.r.t. a market sequence n X = x1 , . . . , xn is t=1 bt · xt . A portfolio selection algorithm is any deterministic or randomized rule for specifying a sequence of portfolios. The simplest strategy is to “buy-and-hold” stocks using some portfolio b. We denote this strategy by BAHb and let U-BAH denote the uniform buy-and-hold when b = (1/m, . . . , 1/m). We say that a portfolio selection algorithm “beats the market” when it outperforms U-BAH on a given market sequence although in practice “the market” can be represented by some non-uniform BAH (e.g. DJIA). Buy-and-hold strategies rely on the tendency of successful markets to grow. Much of modern portfolio theory focuses on how to choose a good b for the buy-and-hold strategy. The seminal ideas of Markowitz in [4] yield an algorithmic procedure for choosing the weights of the portfolio b so as to minimize the variance for any feasible expected return. This variance minimization is possible by placing appropriate larger weights on subsets of anti-correlated stocks, an idea which we shall also utilize. We denote the optimal in hindsight buy-and-hold strategy (i.e. invest only in the best stock) by BAH∗ . An alternative approach to the static buy-and-hold is to dynamically change the portfolio during the trading period. This approach is often called “active trading”. One example of active trading is constant rebalancing; namely, ﬁx a portfolio b and (re)invest your dollars each day according to b. We denote this constant rebalancing strategy by CBALb and let CBAL∗ denote the optimal (in hindsight) CBAL. A constant rebalancing strategy can often take advantage of market ﬂuctuations to achieve a return signiﬁcantly greater than that of BAH∗ . CBAL∗ is always at least as good as the best stock BAH∗ and in some real market sequences a constant rebalancing strategy will take advantage of market ﬂuctuations and signiﬁcantly outperform the best stock (see Table 1). For now, consider Cover and Gluss’ [5] classic (but contrived) example of a market consisting of cash and one stock and 1 1 the market sequence of price relatives 1/2 , 1 , 1/2 , 1 , . . . Now consider the CBALb 2 2 3 1 1 1 11 with b = ( 2 , 2 ). On each odd day the daily return of CBALb is 2 1 + 2 2 = 4 and on each even day, it is 3/2. The total return over n days is therefore (9/8)n/2 , illustrating how a constant rebalancing strategy can yield exponential returns in a “no-growth market”. Under the assumption that the daily market vectors are observations of identically and independently distributed (i.i.d) random variables, it is shown in [6] that CBAL∗ performs at least as good (in the sense of expected total return) as the best online portfolio selection algorithm. However, many studies (see e.g. [7]) argue that stock price sequences do have long term memory and are not i.i.d. A non-traditional objective (in computational ﬁnance) is to develop online trading strategies that are in some sense always guaranteed to perform well. Within a line of research pioneered by Cover [5, 3, 2] one attempts to design portfolio selection algorithms that can provably do well (in terms of their total return) with respect to some online or ofﬂine benchmark algorithms. Two natural online benchmark algorithms are the uniform buy and hold U-BAH, and the uniform constant rebalancing strategy U-CBAL, which is CBALb with 1 1 b = ( m , . . . , m ). A natural ofﬂine benchmark is BAH∗ and a more challenging ofﬂine benchmark is CBAL∗ . Cover and Ordentlich’s Universal Portfolios algorithm [3, 2], denoted here by UNIVERSAL, was proven to be universal against CBAL∗ , in the sense that for every market sequence X of m stocks over n days, it guarantees a sub-exponential (indeed polynomial) ratio in n, retX (CBAL∗ )/retX (UNIVERSAL) ≤ O n m−1 2 (1) From a theoretical perspective this is surprising as the ratio is a polynomial in n (for ﬁxed m) whereas CBAL∗ is capable of exponential returns. From a practical perspective, while the m−1 ratio n 2 is not very useful, the motivation that underlies the potential of CBAL algorithms is useful! We follow this motivation and develop a new algorithm which we call ANTICOR. By attempting to systematically follow the constant rebalancing philosophy, ANTICOR is capable of some extraordinary performance in the absence of transaction costs, or even with very small transaction costs. 2 Trying to Learn the Winners The most direct approach to expert learning and portfolio selection is a “(reward based) weighted average prediction” algorithm which adaptively computes a weighted average of experts by gradually increasing (by some multiplicative or additive update rule) the relative weights of the more successful experts. For example, in the context of the PS problem consider the “exponentiated gradient” EG(η) algorithm proposed by Helmbold et al. [8]. The EG(η) algorithm computes the next portfolio to be bt+1 (j) = bt (j) exp {ηxt (j)/(bt · xt )} m j=1 bt (j) exp {ηxt (j)/(bt · xt )} where η is a “learning rate” parameter. EG was designed to greedily choose the best portfolio for yesterday’s market xt while at the same time paying a penalty from moving far from yesterday’s portfolio. For a universal bound on EG, Helmbold et al. set η = 2xmin 2(log m)/n where xmin is a lower bound on any price relative.2 It is easy to see that as n increases, η decreases to 0 so that we can think of η as being very small in order to achieve universality. When η = 0, the algorithm EG(η) degenerates to the uniform CBAL which is not a universal algorithm. It is also the case that if each day the price relatives for all stocks were identical, then EG (as well as other PS algorithms) will converge to the uniform CBAL. Combining a small learning rate with a “reasonably balanced” market we expect the performance of EG to be similar to that of the uniform CBAL and this is conﬁrmed by our experiments (see Table1).3 Cover’s universal algorithms adaptively learn each day’s portfolio by increasing the weights of successful CBALs. The update rule for these universal algorithms is bt+1 = b · rett (CBALb )dµ(b) , rett (CBALb )dµ(b) where µ(·) is some prior distribution over portfolios. Thus, the weight of a possible portfolio is proportional to its total return rett (b) thus far times its prior. The particular universal algorithm we consider in our experiments uses the Dirichlet prior (with parameters 1 1 ( 2 , . . . , 2 )) [2]. Within a constant factor, this algorithm attains the optimal ratio (1) with respect to CBAL∗ .4 The algorithm is equivalent to a particular static distribution over the 2 Helmbold et al. show how to eliminate the need to know xmin and n. While EG can be made universal, its performance ratio is only sub-exponential (and not polynomial) in n. 3 Following Helmbold et al. we ﬁx η = 0.01 in our experiments. 4 Experimentally (on our datasets) there is a negligible difference between the uniform universal algorithm in [3] and the above Dirichlet universal algorithm. class of all CBALs. This equivalence helps to demystify the universality result and also shows that the algorithm can never outperform CBAL∗ . A different type of “winner learning” algorithm can be obtained from any sequence prediction strategy. For each stock, a (soft) sequence prediction algorithm provides a probability p(j) that the next symbol will be j ∈ {1, . . . , m}. We view this as a prediction that stock j will have the best price relative for the next day and set bt+1 (j) = pj . We consider predictions made using the prediction component of the well-known Lempel-Ziv (LZ) lossless compression algorithm [9]. This prediction component is nicely described in Langdon [10] and in Feder [11]. As a prediction algorithm, LZ is provably powerful in various senses. First it can be shown that it is asymptotically optimal with respect to any stationary and ergodic ﬁnite order Markov source (Rissanen [12]). Moreover, Feder shows that LZ is also universal in a worst case sense with respect to the (ofﬂine) benchmark class of all ﬁnite state prediction machines. To summarize, the common approach to devising PS algorithms has been to attempt and learn winners using winner learning schemes. 3 The Anticor Algorithm We propose a different approach, motivated by the CBAL “philosophy”. How can we interpret the success of the uniform CBAL on the Cover and Gluss example of Sec. 1? Clearly, the uniform CBAL here is taking advantage of price ﬂuctuation by constantly transferring wealth from the high performing stock to the anti-correlated low performing stock. Even in a less contrived market, we should be able to take advantage when a stock is currently outperforming other stocks especially if this strong performance is anti-correlated with the performance of these other stocks. Our ANTICORw algorithm considers a short market history (consisting of two consecutive “windows”, each of w trading days) so as to model statistical relations between each pair of stocks. Let LX1 = log(xt−2w+1 ), . . . , log(xt−w )T and LX2 = log(xt−w+1 ), . . . , log(xt )T , where log(xk ) denotes (log(xk (1)), . . . , log(xk (m))). Thus, LX1 and LX2 are the two vector sequences (equivalently, two w × m matrices) constructed by taking the logarithm over the market subsequences corresponding to the time windows [t − 2w + 1, t − w] and [t − w + 1, t], respectively. We denote the jth column of LXk by LXk (j). Let µk = (µk (1), . . . , µk (m)), be the vectors of averages of columns of LXk (that is, µk (j) = E{LXk (j)}). Similarly, let σk , be the vector of standard deviations of columns of LXk . The cross-correlation matrix (and its normalization) between column vectors in LX1 and LX2 are deﬁned as: Mcov (i, j) = (LX1 (i) − µ1 (i))T (LX2 (j) − µ2 (j)); Mcov (i,j) σ1 (i)σ2 (j) σ1 (i), σ2 (j) = 0; 0 otherwise. Mcor (i, j) ∈ [−1, 1] measures the correlation between log-relative prices of stock i over the ﬁrst window and stock j over the second window. For each pair of stocks i and j we compute claimi→j , the extent to which we want to shift our investment from stock i to stock j. Namely, there is such a claim iff µ2 (i) > µ2 (j) and Mcor (i, j) > 0 in which case claimi→j = Mcor (i, j) + A(i) + A(j) where A(h) = |Mcor (h, h)| if Mcor (h, h) < 0, else 0. Following our interpretation for the success of a CBAL, Mcor (i, j) > 0 is used to predict that stocks i and j will be correlated in consecutive windows (i.e. the current window and the next window based on the evidence for the last two windows) and Mcor (h, h) < 0 predicts that stock h will be anti-correlated with itself over consec˜ utive windows. Finally, bt+1 (i) = bt (i) + j=i [transferj→i − transferi→j ] where ˜ t (i) · claimi→j / ˜ transferi→j = b j claimi→j and bt is the resulting portfolio just after market closing (on day t). Mcor (i, j) SP500: Anticor vs. window size NYSE: Anticor vs. window size w BAH(Anticor ) w Anticor 12 8 w Best Stock Market Return 10 Total Return Total Return (log−scale) 10 Anticorw 5 10 BAH(Anticorw) Anticorw Best Stock Market Best Stock 8 Anticorw 6 4 2 10 2 1 10 Best Stock 1 0 10 2 5 10 15 20 25 5 30 10 15 20 25 30 Window Size (w) Window Size (w) Figure 1: ANTICORw ’s total return (per $1 investment) vs. window size 2 ≤ w ≤ 30 for NYSE (left) and SP500 (right). Our ANTICORw algorithm has one critical parameter, the window size w. In Figure 1 we depict the total return of ANTICORw on two historical datasets as a function of the window size w = 2, . . . , 30. As we might expect, the performance of ANTICORw depends signiﬁcantly on the window size. However, for all w, ANTICORw beats the uniform market and, moreover, it beats the best stock using most window sizes. Of course, in online trading we cannot choose w in hindsight. Viewing the ANTICORw algorithms as experts, we can try to learn the best expert. But the windows, like individual stocks, induce a rather volatile set of experts and standard expert combination algorithms [13] tend to fail. Alternatively, we can adaptively learn and invest in some weighted average of all ANTICORw algorithms with w less than some maximum W . The simplest case is a uniform investment on all the windows; that is, a uniform buy-and-hold investment on the algorithms ANTICORw , w ∈ [2, W ], denoted by BAHW (ANTICOR). Figure 2 (left) graphs the total return of BAHW (ANTICOR) as a function of W for all values of 2 ≤ W ≤ 50 with respect to the NYSE dataset (see details below). Similar graphs for the other datasets we consider appear qualitatively the same and the choice W = 30 is clearly not optimal. However, for all W ≥ 3, BAHW (ANTICOR) beats the best stock in all our experiments. DJIA: Dec 14, 2002 − Jan 14, 2003 NYSE: Total Return vs. Max Window 10 1.1 BAHW(Anticor) 10 Best Stock MArket 4 10 3 10 Best Stock 2.8 Anticor2 2.2 2.6 1 BAHW(Anticor) 5 Total Return Total Return (log−scale) 10 6 Anticor1 Stocks 7 2 0.9 2.4 1.8 0.8 2.2 1.6 0.7 2 1.4 0.6 2 10 1.8 1.2 0.5 1 10 1.6 1 0.4 0 10 5 10 15 20 25 30 35 Maximal Window size (W) 40 45 50 5 10 15 20 25 Days 5 10 15 20 25 Days 5 10 15 20 25 Days Figure 2: Left: BAHW (ANTICOR)’s total return (per $1 investment) as a function of the maximal window W . Right: Cumulative returns for last month of the DJIA dataset: stocks (left panel); ANTICORw algorithms trading the stocks (denoted ANTICOR1 , middle panel); ANTICORw algorithms trading the ANTICOR algorithms (right panel). Since we now consider the various algorithms as stocks (whose prices are determined by the cumulative returns of the algorithms), we are back to our original portfolio selection problem and if the ANTICOR algorithm performs well on stocks it may also perform well on algorithms. We thus consider active investment in the various ANTICORw algorithms using ANTICOR. We again consider all windows w ≤ W . Of course, we can continue to compound the algorithm any number of times. Here we compound twice and then use a buy-and-hold investment. The resulting algorithm is denoted BAHW (ANTICOR(ANTICOR)). One impact of this compounding, depicted in Figure 2 (right), is to smooth out the anti-correlations exhibited in the stocks. It is evident that after compounding twice the returns become almost completely correlated thus diminishing the possibility that additional compounding will substantially help.5 This idea for eliminating critical parameters may be applicable in other learning applications. The challenge is to understand the conditions and applications in which the process of compounding algorithms will have this smoothing effect! 4 Experimental Results We present an experimental study of the the ANTICOR algorithm and the three online learning algorithms described in Sec. 2. We focus on BAH30 (ANTICOR), abbreviated by ANTI1 and BAH30 (ANTICOR(ANTICOR)), abbreviated by ANTI2 . Four historical datasets are used. The ﬁrst NYSE dataset, is the one used in [3, 2, 8, 14]. This dataset contains 5651 daily prices for 36 stocks in the New York Stock Exchange (NYSE) for the twenty two year period July 3rd , 1962 to Dec 31st , 1984. The second TSE dataset consists of 88 stocks from the Toronto Stock Exchange (TSE), for the ﬁve year period Jan 4th , 1994 to Dec 31st , 1998. The third dataset consists of the 25 stocks from SP500 which (as of Apr. 2003) had the largest market capitalization. This set spans 1276 trading days for the period Jan 2nd , 1998 to Jan 31st , 2003. The fourth dataset consists of the thirty stocks composing the Dow Jones Industrial Average (DJIA) for the two year period (507 days) from Jan 14th , 2001 to Jan 14th , 2003.6 These four datasets are quite different in nature (the market returns for these datasets appear in the ﬁrst row of Table 1). While every stock in the NYSE increased in value, 32 of the 88 stocks in the TSE lost money, 7 of the 25 stocks in the SP500 lost money and 25 of the 30 stocks in the “negative market” DJIA lost money. All these sets include only highly liquid stocks with huge market capitalizations. In order to maximize the utility of these datasets and yet present rather different markets, we also ran each market in reverse. This is simply done by reversing the order and inverting the relative prices. The reverse datasets are denoted by a ‘-1’ superscript. Some of the reverse markets are particularly challenging. For example, all of the NYSE−1 stocks are going down. Note that the forward and reverse markets (i.e. U-BAH) for the TSE are both increasing but that the TSE−1 is also a challenging market since so many stocks (56 of 88) are declining. Table 1 reports on the total returns of the various algorithms for all eight datasets. We see that prediction algorithms such as LZ can do quite well but the more aggressive ANTI1 and 2 ANTI have excellent and sometimes fantastic returns. Note that these active strategies beat the best stock and even CBAL∗ in all markets with the exception of the TSE−1 in which they still signiﬁcantly outperform the market. The reader may well be distrustful of what appears to be such unbelievable returns for ANTI1 and ANTI2 especially when applied to the NYSE dataset. However, recall that the NYSE dataset consists of n = 5651 trading days and the y such that y n = the total NYSE return is approximately 1.0029511 for ANTI1 (respectively, 1.0074539 for ANTI2 ); that is, the average daily increase is less than .3% 5 This smoothing effect also allows for the use of simple prediction algorithms such as “expert advice” algorithms [13], which can now better predict a good window size. We have not explored this direction. 6 The four datasets, including their sources and individual stock compositions can be downloaded from http://www.cs.technion.ac.il/∼rani/portfolios. (respectively, .75%). Thus a transaction cost of 1% can present a signiﬁcant challenge to such active trading strategies (see also Sec. 5). We observe that UNIVERSAL and EG have no substantial advantage over U-CBAL. Some previous expositions of these algorithms highlighted particular combinations of stocks where the returns signiﬁcantly outperformed UNIVERSAL and the best stock. But the same can be said for U-CBAL. Algorithm M ARKET (U-BAH) B EST S TOCK CBAL∗ U-CBAL ANTI1 ANTI2 LZ EG UNIVERSAL NYSE 14.49 54.14 250.59 27.07 17,059,811.56 238,820,058.10 79.78 27.08 26.99 TSE 1.61 6.27 6.77 1.59 26.77 39.07 1.32 1.59 1.59 SP500 1.34 3.77 4.06 1.64 5.56 5.88 1.67 1.64 1.62 DJIA 0.76 1.18 1.23 0.81 1.59 2.28 0.89 0.81 0.80 NYSE−1 0.11 0.32 2.86 0.22 246.22 1383.78 5.41 0.22 0.22 TSE−1 1.67 37.64 58.61 1.18 7.12 7.27 4.80 1.19 1.19 SP500−1 0.87 1.65 1.91 1.09 6.61 9.69 1.20 1.09 1.07 DJIA−1 1.43 2.77 2.97 1.53 3.67 4.60 1.83 1.53 1.53 Table 1: Monetary returns in dollars (per $1 investment) of various algorithms for four different datasets and their reversed versions. The winner and runner-up for each market appear in boldface. All ﬁgures are truncated to two decimals. 5 Concluding Remarks When handling a portfolio of m stocks our algorithm may perform up to m transactions per day. A major concern is therefore the commissions it will incur. Within the proportional commission model (see e.g. [14] and [15], Sec. 14.5.4) there exists a fraction γ ∈ (0, 1) such that an investor pays at a rate of γ/2 for each buy and for each sell. Therefore, the return of a sequence b1 , . . . , bn of portfolios with re˜ spect to a market sequence x1 , . . . , xn is t bt · xt (1 − j γ |bt (j) − bt (j)|) , where 2 1 ˜ (bt (1)xt (1), . . . , bt (m)xt (m)). Our investment algorithm in its simplest form bt = bt ·xt can tolerate very small proportional commission rates and still beat the best stock.7 We note that Blum and Kalai [14] showed that the performance guarantee of UNIVERSAL still holds (and gracefully degrades) in the case of proportional commissions. Many current online brokers only charge a small per share commission rate. A related problem that one must face when actually trading is the difference between bid and ask prices. These bid-ask spreads (and the availability of stocks for both buying and selling) are typically functions of stock liquidity and are typically smaller for large market capitalization stocks. We consider here only very large market cap stocks. As a ﬁnal caveat, we note that we assume that any one portfolio selection algorithm has no impact on the market! But just like any goose laying golden eggs, widespread use will soon lead to the end of the goose; that is, the market will quickly react. Any report of abnormal returns using historical markets should be suspected of “data snooping”. In particular, when a dataset is excessively mined by testing many strategies there is a substantial chance that one of the strategies will be successful by simple overﬁtting. Another data snooping hazard is stock selection. For example, the 36 stocks selected for the NYSE dataset were all known to have survived for 22 years. Our ANTICOR algorithms were fully developed using only the NYSE and TSE datasets. The DJIA and SP500 sets were obtained (from public domain sources) after the algorithms were ﬁxed. Finally, our algorithm has one parameter (the maximal window size W ). Our experiments indicate that the algorithm’s performance is robust with respect to W (see Figure 2). 7 For example, with γ = 0.1% we can typically beat the best stock. These results will be presented in the full paper. A number of well-respected works report on statistically robust “abnormal” returns for simple “technical analysis” heuristics, which slightly beat the market. For example, the landmark study of Brock et al. [16] apply 26 simple trading heuristics to the DJIA index from 1897 to 1986 and provide strong support for technical analysis heuristics. While consistently beating the market is considered a great (if not impossible) challenge, our approach to portfolio selection indicates that beating the best stock is an achievable goal. What is missing at this point of time is an analytical model which better explains why our active trading strategies are so successful. In this regard, we are investigating various “statistical adversary” models along the lines suggested by [17, 18]. Namely, we would like to show that an algorithm performs well (relative to some benchmark) for any market sequence that satisﬁes certain constraints on its empirical statistics. References [1] G. Lugosi. Lectures on prediction URL:http://www.econ.upf.es/∼lugosi/ihp.ps, 2001. of individual sequences. [2] T.M. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2):348–363, 1996. [3] T.M. Cover. Universal portfolios. Mathematical Finance, 1:1–29, 1991. [4] H. Markowitz. Portfolio Selection: Efﬁcient Diversiﬁcation of Investments. John Wiley and Sons, 1959. [5] T.M. Cover and D.H. Gluss. Empirical bayes stock market portfolios. Advances in Applied Mathematics, 7:170–181, 1986. [6] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991. [7] A. Lo and C. MacKinlay. A Non-Random Walk Down Wall Street. Princeton University Press, 1999. [8] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. Portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325–347, 1998. [9] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory, 24:530–536, 1978. [10] G.G. Langdon. A note on the lempel-ziv model for compressing individual sequences. IEEE Transactions on Information Theory, 29:284–287, 1983. [11] M. Feder. Gambling using a ﬁnite state machine. IEEE Transactions on Information Theory, 37:1459–1465, 1991. [12] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29:656–664, 1983. [13] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997. [14] A. Blum and A. Kalai. Universal portfolios with and without transaction costs. Machine Learning, 30(1):23–30, 1998. [15] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. [16] L. Brock, J. Lakonishok, and B. LeBaron. Simple technical trading rules and the stochastic properties of stock returns. Journal of Finance, 47:1731–1764, 1992. [17] P. Raghavan. A statistical adversary for on-line algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 7:79–83, 1992. [18] A. Chou, J.R. Cooperstock, R. El-Yaniv, M. Klugerman, and T. Leighton. The statistical adversary allows optimal money-making trading strategies. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995.

6 0.42598161 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

7 0.29862374 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

8 0.29798311 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

9 0.25427219 68 nips-2003-Eye Movements for Reward Maximization

10 0.22724089 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

11 0.2150075 55 nips-2003-Distributed Optimization in Adaptive Networks

12 0.20028409 75 nips-2003-From Algorithmic to Subjective Randomness

13 0.1964256 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

14 0.17158568 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

15 0.17093624 123 nips-2003-Markov Models for Automated ECG Interval Analysis

16 0.16969982 119 nips-2003-Local Phase Coherence and the Perception of Blur

17 0.16844717 14 nips-2003-A Nonlinear Predictive State Representation

18 0.15960959 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

19 0.15578647 32 nips-2003-Approximate Expectation Maximization

20 0.15425527 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (11, 0.018), (29, 0.013), (30, 0.025), (35, 0.043), (53, 0.055), (69, 0.023), (71, 0.054), (74, 0.353), (76, 0.033), (85, 0.062), (91, 0.16), (99, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84160846 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

2 0.82770622 14 nips-2003-A Nonlinear Predictive State Representation

Author: Matthew R. Rudary, Satinder P. Singh

Abstract: Predictive state representations (PSRs) use predictions of a set of tests to represent the state of controlled dynamical systems. One reason why this representation is exciting as an alternative to partially observable Markov decision processes (POMDPs) is that PSR models of dynamical systems may be much more compact than POMDP models. Empirical work on PSRs to date has focused on linear PSRs, which have not allowed for compression relative to POMDPs. We introduce a new notion of tests which allows us to deﬁne a new type of PSR that is nonlinear in general and allows for exponential compression in some deterministic dynamical systems. These new tests, called e-tests, are related to the tests used by Rivest and Schapire [1] in their work with the diversity representation, but our PSR avoids some of the pitfalls of their representation—in particular, its potential to be exponentially larger than the equivalent POMDP. 1

3 0.78225338 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons

Author: Hsin Chen, Patrice Fleury, Alan F. Murray

Abstract: This paper presents VLSI circuits with continuous-valued probabilistic behaviour realized by injecting noise into each computing unit(neuron). Interconnecting the noisy neurons forms a Continuous Restricted Boltzmann Machine (CRBM), which has shown promising performance in modelling and classifying noisy biomedical data. The Minimising-Contrastive-Divergence learning algorithm for CRBM is also implemented in mixed-mode VLSI, to adapt the noisy neurons’ parameters on-chip. 1

4 0.62769192 81 nips-2003-Geometric Analysis of Constrained Curves

Author: Anuj Srivastava, Washington Mio, Xiuwen Liu, Eric Klassen

Abstract: We present a geometric approach to statistical shape analysis of closed curves in images. The basic idea is to specify a space of closed curves satisfying given constraints, and exploit the differential geometry of this space to solve optimization and inference problems. We demonstrate this approach by: (i) deﬁning and computing statistics of observed shapes, (ii) deﬁning and learning a parametric probability model on shape space, and (iii) designing a binary hypothesis test on this space. 1

5 0.47142741 46 nips-2003-Clustering with the Connectivity Kernel

Author: Bernd Fischer, Volker Roth, Joachim M. Buhmann

Abstract: Clustering aims at extracting hidden structure in dataset. While the problem of ﬁnding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: ﬁrst the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactness-based criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactness-based criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or mean-shift clustering. 1

6 0.46951273 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

7 0.46774396 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

8 0.46726319 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

9 0.46301985 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

10 0.45960507 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

11 0.45461598 87 nips-2003-Identifying Structure across Pre-partitioned Data

12 0.4532522 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

13 0.45231247 146 nips-2003-Online Learning of Non-stationary Sequences

14 0.45196673 68 nips-2003-Eye Movements for Reward Maximization

15 0.45066869 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

16 0.44581878 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

17 0.44330946 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

18 0.44321266 55 nips-2003-Distributed Optimization in Adaptive Networks

19 0.44266811 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

20 0.44047657 107 nips-2003-Learning Spectral Clustering