nips nips2004 nips2004-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. [sent-11, score-0.874]
2 Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. [sent-12, score-1.327]
3 We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. [sent-13, score-0.352]
4 Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. [sent-14, score-0.181]
5 We also show that in the case that the dynamics change over time, the problem becomes computationally hard. [sent-15, score-0.125]
6 1 Introduction There is an inherent tension between the objectives in an expert setting and those in a reinforcement learning setting. [sent-16, score-0.188]
7 In the experts problem, during every round a learner chooses one of n decision making experts and incurs the loss of the chosen expert. [sent-17, score-0.874]
8 The setting is typically an adversarial one, where Nature provides the examples to a learner. [sent-18, score-0.16]
9 The standard objective here is a myopic, backwards looking one — in retrospect, we desire that our performance is not much worse than had we chosen any single expert on the sequence of examples provided by Nature. [sent-19, score-0.156]
10 In contrast, a reinforcement learning setting typically makes the much stronger assumption of a fixed environment, typically a Markov decision process (MDP), and the forward looking objective is to maximize some measure of the future reward with respect to this fixed environment. [sent-20, score-0.765]
11 The motivation of this work is to understand how to efficiently incorporate the benefits of existing experts algorithms into a more adversarial reinforcement learning setting, where certain aspects of the environment could change over time. [sent-21, score-0.606]
12 A naive way to implement an experts algorithm is to simply associate an expert with each fixed policy. [sent-22, score-0.455]
13 The running time of such algorithms is polynomial in the number of experts and the regret (the difference from the optimal reward) is logarithmic in the number of experts. [sent-23, score-0.557]
14 For our setting the number of policies is huge, namely #actions#states , which renders the naive experts approach computationally infeasible. [sent-24, score-0.572]
15 We might hope for a more effective regret bound which has no dependence on the size of state space (which is typically large). [sent-28, score-0.354]
16 The setting we consider is one in which the dynamics of the environment are known to the learner, but the reward function can change over time. [sent-29, score-0.8]
17 We assume that after each time step the learner has complete knowledge of the previous reward functions (over the entire environment), but does not know the future reward functions. [sent-30, score-1.218]
18 Note that at each time step we select one road segment, suffer a certain delay, and need to plan ahead with respect to our current position. [sent-35, score-0.121]
19 In fact Kalai and Vempala [2003], address the computational difficulty of handling a large number of experts under certain linear assumptions on the reward functions. [sent-37, score-0.916]
20 [2003] also considered a similar setting — they also assume that the reward function is chosen by an adversary and that the dynamics are fixed. [sent-40, score-0.789]
21 In this work, we provide efficient ways to incorporate existing best experts algorithms into the MDP setting. [sent-42, score-0.347]
22 Furthermore, our loss bounds (compared to the best constant policy) have no dependence on the number of states and depend only on on a certain horizon time of the environment and log(#actions). [sent-43, score-0.335]
23 The first is where we allow Nature to change the dynamics of the environment over time. [sent-45, score-0.178]
24 Here, we show that it becomes NP-Hard to develop a low regret algorithm even for oblivious adversary. [sent-46, score-0.187]
25 The second extension is to consider one in which the agent only observes the rewards for the states it actually visits (a generalization of the multi-arm bandits problem). [sent-47, score-0.163]
26 2 The Setting We consider an MDP with state space S; initial state distribution d1 over S; action space A; state transition probabilities {Psa (·)} (here, Psa is the next-state distribution on taking action a in state s); and a sequence of reward functions r1 , r2 , . [sent-49, score-1.664]
27 rT , where rt is the (bounded) reward function at time step t mapping S × A into [0, 1]. [sent-52, score-0.894]
28 We assume the agent has complete knowledge of the transition model P , but at time t, the agent only knows the past reward functions r1 , r2 , . [sent-54, score-0.891]
29 Hence, an algorithm A is a mapping from S and the previous reward functions r1 , . [sent-58, score-0.597]
30 rt−1 ) is the probability of taking action a at time t. [sent-64, score-0.201]
31 rT (A) = E T T t=1 rt (st , at ) d1 , A where at ∼ A(a|st , r1 , . [sent-68, score-0.305]
32 rt−1 ) and st is the random variable which represents the state at time t, starting from initial state s1 ∼ d1 and following actions a1 , a2 , . [sent-71, score-0.564]
33 Note that we keep track of the expectation and not of a specific trajectory (and our algorithm specifies a distribution over actions at every state and at every time step t). [sent-75, score-0.429]
34 Ideally, we would like to find an A which achieves a large reward Vr1 ,. [sent-76, score-0.569]
35 rT (A) regardless of how the adversary chooses the reward functions. [sent-79, score-0.679]
36 In general, this of course is not possible, and, as in the standard experts setting, we desire that our algorithm competes favorably against the best fixed stationary policy π(a|s) in hindsight. [sent-80, score-0.743]
37 For every stationary policy π(a|s), we define P π to be the transition matrix induced by π, where the component [P π ]s,s′ is the transition probability from s to s′ under π. [sent-83, score-0.466]
38 Also, define dπ,t to be the state distribution at time t when following π, ie dπ,t = d1 (P π )t where we are treating d1 as a row vector here. [sent-84, score-0.255]
39 Assumption 1 (Mixing) We assume the transition model over states, as determined by π, has a well defined stationary distribution, which we call dπ . [sent-85, score-0.16]
40 More formally, for every initial state s, dπ,t converges to dπ as t tends to infinity and dπ P π = dπ . [sent-86, score-0.216]
41 Furthermore, this implies there exists some τ such that for all policies π, and distributions d and d′ , dP π − d′ P π 1 ≤ e−1/τ d − d′ 1 where x 1 denotes the l1 norm of a vector x. [sent-87, score-0.184]
42 The parameter τ provides a bound on the planning horizon timescale, since it implies that every policy achieves close to its average reward in O(τ ) steps 1 . [sent-89, score-1.071]
43 This parameter also governs how long it effectively takes to switch from one policy to another (after time O(τ ) steps there is little information in the state distribution about the previous policy). [sent-90, score-0.484]
44 These values satisfy the well known recurrence equation: ′ Qπ,r (s, a) = r(s, a) − ηr (π) + Es′ ∼Psa [Qπ (s′ , π)] (1) where Qπ (s , π) is the next state value (without deviation). [sent-93, score-0.228]
45 1 If this timescale is unreasonably large for some specific MDP, then one could artificially impose some horizon time and attempt to compete with those policies which mix in this horizon time, as done Kearns and Singh [1998]. [sent-94, score-0.407]
46 If π ∗ is an optimal policy (with respect to ηr ), then, as usual, we define Q∗ (s, a) to be the r value of the optimal policy, ie Q∗ (s, a) = Qπ∗ ,r (s, a). [sent-95, score-0.254]
47 It is straightforward to see that the previous assumption implies a rate of convergence to the stationary distribution that is O(τ ), for all policies. [sent-97, score-0.182]
48 Lemma 2 For all policies π, dπ,t − dπ 1 ≤ 2e−t/τ . [sent-99, score-0.147]
49 Lemma 3 For all reward functions r, Qπ,r (s, a) ≤ 3τ . [sent-104, score-0.572]
50 For all t, including t = 1, let dπ,s,t be the state distribution at time t starting from state s and following π. [sent-107, score-0.399]
51 2 The Algorithm Now we provide our main result showing how to use any generic experts algorithm in our setting. [sent-110, score-0.372]
52 We associate each state with an experts algorithm, and the expert for each state is responsible for choosing the actions at that state. [sent-111, score-0.888]
53 We now assume that our experts algorithm achieves a performance comparable to the best constant action. [sent-114, score-0.397]
54 Assumption 4 (Black Box Experts) We assume access to an optimized best expert algorithm which guarantees that for any sequence of loss functions c1 , c2 , . [sent-115, score-0.177]
55 cT over actions A, the algorithm selects a distribution qt over A (using only the previous loss functions c1 , c2 , . [sent-118, score-0.287]
56 ct−1 ) such that T t=1 T Ea∼qt [ct (a)] ≤ ct (a) + M t=1 T log |A|, where ct (a) ≤ M . [sent-121, score-0.186]
57 Furthermore, we also assume that decision distributions do not change quickly: log |A| qt − qt+1 1 ≤ t These assumptions are satisfied by the multiplicative weights algorithms. [sent-122, score-0.233]
58 For instance, the algorithm in Freund and Schapire [1999] is such that the for each decision a, | log qt (a) − log |A| ), t log qt+1 (a)| changes by O( which implies the weaker l1 condition above. [sent-123, score-0.383]
59 In our setting, we have an experts algorithm associated with every state s, which is fed the loss function Qπt ,rt (s, ·) at time t. [sent-124, score-0.672]
60 Intuitively, our experts algorithms will be using a similar policy for significantly long periods of time. [sent-126, score-0.568]
61 |πt (·|s) − πt+1 (·|s)|1 ≤ Also note that since the experts algorithms are associated with each state and each of the N experts chooses decisions out of A actions, the algorithm is efficient (polynomial in N and A, assuming that that the black box uses a reasonable experts algorithm). [sent-127, score-1.325]
62 rT and for all stationary policies π, Then for all reward functions log |A| log |A| 4τ − 3τ − T T T √ As expected, the regret goes to 0 at the rate O(1/ T ), as is the case with experts algorithms. [sent-133, score-1.437]
63 Importantly, note that the bound does not depend on the size of the state space. [sent-134, score-0.205]
64 First, we analyze the performance of the algorithm in an idealized setting, where the algorithm instantaneously obtains the average reward of its current policy at each step. [sent-143, score-0.965]
65 Then we take into account the slow change of the policies to show that the actual performance is similar to the instantaneous performance. [sent-144, score-0.195]
66 An Idealized Setting: Let us examine the case in which at each time t, when the algorithm uses πt , it immediately obtains reward ηrt (πt ). [sent-145, score-0.685]
67 The following theorem compares the performance of our algorithms to that of a fixed constant policy in this setting. [sent-146, score-0.266]
68 rT , the MDP experts algorithm have the following performance bound. [sent-150, score-0.372]
69 πT is the sequence of policies generated by A in response to r1 , r2 , . [sent-154, score-0.173]
70 Next we provide a technical lemma, which is a variant of a result in Kakade [2003] Lemma 7 For all policies π and π ′ , ηr (π ′ ) − ηr (π) = Es∼dπ′ [Qπ,r (s, π ′ ) − Qπ,r (s, π)] Proof. [sent-158, score-0.147]
71 Note that by definition of stationarity, if the state distribution is at dπ′ , then the next state distribution is also dπ′ if π ′ is followed. [sent-159, score-0.354]
72 The lemma shows why our choice to feed each experts algorithm Qπt ,rt was appropriate. [sent-162, score-0.518]
73 Taking Mixing Into Account: This subsection relates the values V to the sums of average reward used in the idealized setting. [sent-166, score-0.623]
74 πT is the sequence of policies generated by A in response to r1 , r2 , . [sent-176, score-0.173]
75 Since the above holds for all A (including those A which are the constant policy π), then combining this with Theorem 6 (once with A and once with π) completes the proof of Theorem 5. [sent-180, score-0.263]
76 It shows how close are the next state distributions when following πt rather than πt+1 . [sent-183, score-0.177]
77 Then for any state distribution Analogous to the definition of dπ,t , we define dA,t dA,t = Pr[st = s|d1 , A] which is the probability that the state at time t is s given that A has been followed. [sent-186, score-0.399]
78 πT be the sequence of policies generated by A in response to r1 , r2 , . [sent-190, score-0.173]
79 Using our experts assumption, it is straightforward to see that that the change in the policy over k steps is |πk (·|s) − πt (·|s)|1 ≤ (t − k) log |A|/t. [sent-196, score-0.725]
80 Recursing on the above equation leads to: 2 dA,t − dπt ≤ 2 log |A|/t ≤ 2 log |A|/t k=t ∞ (t − k)e−(t−k)/τ + e−t/τ d1 − dπt ke−k/τ + 2e−t/τ k=1 The sum is bounded by an integral from 0 to ∞, which evaluates to τ 2 . [sent-198, score-0.162]
81 Here, in each timestep t, an oblivious adversary is allowed to choose both the reward function rt and the transition model Pt — the model that determines the transitions to be used at timestep t. [sent-207, score-1.25]
82 After each timestep, the agent receives complete knowledge of both rt and Pt . [sent-208, score-0.432]
83 We let Rt (M ) be the optimal average reward obtained by a stationary policy for times [1, t]. [sent-211, score-0.904]
84 Theorem 11 In the changing dynamics model, if there exists a polynomial time online algorithm (polynomial in the problem parameters) such that, for any MDP, has an expected ∗ average reward larger than (0. [sent-212, score-0.796]
85 The following lemma is useful in the proof and uses the fact that it is hard to approximate MAX3SAT within any factor better than 0. [sent-214, score-0.154]
86 Lemma 12 Computing a stationary policy in the changing dynamics model with average reward larger than (0. [sent-216, score-1.017]
87 sn , sn+1 , two actions in each state, 0, 1 and fixed dynamic for 3m steps which will be described later. [sent-228, score-0.239]
88 We prove that a policy with average reward p/3 translates to an assignment that satisfies p fraction of φ and vice versa. [sent-229, score-0.897]
89 The initial state is s1 and the reward for action 0 is 0 and the agent moves to state s2 , for action 1 the reward is 1 and it moves to state sn+1 . [sent-232, score-2.156]
90 In the second timestep the reward in sn+1 is 0 for every action and the agents stay in it; in state s2 if the agent performs action 0 then it obtains reward 1 and move to state sn+1 otherwise it obtains reward 0 and moves to state s7 . [sent-233, score-3.007]
91 In the next timestep the reward in sn+1 is 0 for every action and the agents moves to x4 , the reward in s7 is 1 for action 1 and zero for action 0 and moves to s4 for both actions. [sent-234, score-1.857]
92 Note that time interval [3(ℓ − 1) + 1, 3ℓ] corresponds to Cℓ and that the reward obtained in this interval is at most 1. [sent-236, score-0.641]
93 , yn where yi = {0, 1} that satisfies p fraction of it, if and only if π which takes action yi in si has average reward p/3. [sent-240, score-0.75]
94 We prove it by looking on each interval separately and noting that if a reward 1 is obtained then there is an action a that we take in one of the states which has reward 1 but this action corresponds to a satisfying assignment for this clause. [sent-241, score-1.577]
95 After n steps the adversary chooses uniformly at random the next clause. [sent-245, score-0.176]
96 The strategy defined by the algorithm at the k iteration is the probability assigned to action 0/1 at state sℓ just before arriving to sℓ . [sent-247, score-0.432]
97 Note that the strategy at each iteration is actually a stationary policy for M . [sent-248, score-0.409]
98 Thus the strategy in each iteration defines an assignment for the formula. [sent-249, score-0.117]
99 We also note that before an iteration the expected reward of the optimal stationary policy in the iteration is k/(nm), where k is the maximal number of satisfiable clauses and there are m clauses, and we have E[R∗ (M )] = k/(nm). [sent-250, score-1.059]
100 If we choose at random an iteration, then the strategy defined in that iteration has an expected reward which is larger than (0. [sent-251, score-0.618]
wordName wordTfidf (topN-words)
[('reward', 0.544), ('experts', 0.347), ('rt', 0.305), ('policy', 0.221), ('state', 0.177), ('mdp', 0.173), ('action', 0.156), ('policies', 0.147), ('psa', 0.129), ('regret', 0.121), ('stationary', 0.114), ('timestep', 0.112), ('lemma', 0.112), ('actions', 0.104), ('agent', 0.101), ('sn', 0.094), ('qt', 0.091), ('clauses', 0.09), ('adversary', 0.09), ('horizon', 0.085), ('adversarial', 0.082), ('setting', 0.078), ('dynamics', 0.077), ('es', 0.075), ('obtains', 0.071), ('log', 0.068), ('clause', 0.067), ('kalai', 0.067), ('moves', 0.062), ('st', 0.061), ('ct', 0.059), ('expert', 0.059), ('mixing', 0.058), ('idealized', 0.054), ('environment', 0.053), ('satis', 0.053), ('recurrence', 0.051), ('road', 0.051), ('reinforcement', 0.051), ('change', 0.048), ('transition', 0.046), ('theorem', 0.045), ('chooses', 0.045), ('mcmahan', 0.045), ('timescale', 0.045), ('iteration', 0.045), ('time', 0.045), ('polynomial', 0.044), ('assignment', 0.043), ('proof', 0.042), ('ea', 0.041), ('oblivious', 0.041), ('steps', 0.041), ('prove', 0.039), ('every', 0.039), ('loss', 0.039), ('ready', 0.038), ('kakade', 0.038), ('vempala', 0.038), ('implies', 0.037), ('dp', 0.037), ('decisions', 0.037), ('pt', 0.036), ('changing', 0.036), ('desire', 0.036), ('looking', 0.035), ('states', 0.034), ('feed', 0.034), ('immediate', 0.034), ('ie', 0.033), ('mansour', 0.031), ('assumption', 0.031), ('learner', 0.031), ('strategy', 0.029), ('functions', 0.028), ('rewards', 0.028), ('dependence', 0.028), ('bound', 0.028), ('kearns', 0.027), ('sk', 0.027), ('furthermore', 0.027), ('complete', 0.026), ('interval', 0.026), ('nm', 0.026), ('planning', 0.026), ('agents', 0.026), ('bounds', 0.026), ('decision', 0.026), ('sequence', 0.026), ('bounded', 0.026), ('stay', 0.025), ('move', 0.025), ('certain', 0.025), ('average', 0.025), ('unless', 0.025), ('algorithm', 0.025), ('fraction', 0.025), ('achieves', 0.025), ('freund', 0.024), ('associate', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
2 0.30861008 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
3 0.2688095 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
Author: Daniela D. Farias, Nimrod Megiddo
Abstract: A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven. 1
4 0.25475791 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
5 0.21621557 24 nips-2004-Approximately Efficient Online Mechanism Design
Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh
Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1
6 0.21436593 39 nips-2004-Coarticulation in Markov Decision Processes
7 0.17692205 102 nips-2004-Learning first-order Markov models for control
8 0.174992 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
9 0.174777 148 nips-2004-Probabilistic Computation in Spiking Populations
10 0.14941598 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
11 0.13598911 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
12 0.13053417 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
13 0.12664242 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
14 0.12401216 33 nips-2004-Brain Inspired Reinforcement Learning
15 0.1224085 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
16 0.12016704 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
17 0.11509651 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
18 0.091176666 155 nips-2004-Responding to Modalities with Different Latencies
19 0.083945714 138 nips-2004-Online Bounds for Bayesian Algorithms
20 0.081168868 183 nips-2004-Temporal-Difference Networks
topicId topicWeight
[(0, -0.2), (1, -0.09), (2, 0.521), (3, -0.067), (4, -0.239), (5, 0.183), (6, -0.094), (7, -0.112), (8, -0.078), (9, 0.086), (10, -0.031), (11, 0.046), (12, 0.053), (13, 0.001), (14, -0.021), (15, 0.04), (16, 0.033), (17, 0.106), (18, 0.009), (19, 0.001), (20, -0.021), (21, -0.003), (22, 0.006), (23, -0.061), (24, 0.031), (25, -0.006), (26, -0.112), (27, 0.071), (28, -0.053), (29, 0.002), (30, -0.009), (31, -0.027), (32, 0.002), (33, -0.114), (34, -0.029), (35, -0.078), (36, 0.03), (37, -0.06), (38, 0.088), (39, 0.053), (40, -0.081), (41, 0.092), (42, 0.031), (43, -0.005), (44, 0.037), (45, -0.051), (46, -0.007), (47, -0.102), (48, -0.012), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.97540611 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
2 0.78195798 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
3 0.76995814 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
Author: Daniela D. Farias, Nimrod Megiddo
Abstract: A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven. 1
4 0.66433203 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
5 0.57166892 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
Author: Guy Shani, Ronen I. Brafman
Abstract: Agents learning to act in a partially observable domain may need to overcome the problem of perceptual aliasing – i.e., different states that appear similar but require different responses. This problem is exacerbated when the agent’s sensors are noisy, i.e., sensors may produce different observations in the same state. We show that many well-known reinforcement learning methods designed to deal with perceptual aliasing, such as Utile SufÄ?Ĺš x Memory, Ä?Ĺš nite size history windows, eligibility traces, and memory bits, do not handle noisy sensors well. We suggest a new algorithm, Noisy Utile SufÄ?Ĺš x Memory (NUSM), based on USM, that uses a weighted classiÄ?Ĺš cation of observed trajectories. We compare NUSM to the above methods and show it to be more robust to noise.
6 0.54616797 24 nips-2004-Approximately Efficient Online Mechanism Design
7 0.54526734 48 nips-2004-Convergence and No-Regret in Multiagent Learning
8 0.52629286 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
9 0.5260849 102 nips-2004-Learning first-order Markov models for control
10 0.51569635 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
11 0.49534544 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
12 0.46971047 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
13 0.434726 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
14 0.42289549 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
15 0.40335739 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
16 0.36530578 171 nips-2004-Solitaire: Man Versus Machine
17 0.34224755 33 nips-2004-Brain Inspired Reinforcement Learning
18 0.3369619 138 nips-2004-Online Bounds for Bayesian Algorithms
19 0.33166623 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
20 0.33145532 148 nips-2004-Probabilistic Computation in Spiking Populations
topicId topicWeight
[(13, 0.129), (15, 0.114), (20, 0.193), (26, 0.059), (31, 0.029), (33, 0.277), (35, 0.017), (39, 0.012), (50, 0.031), (71, 0.024)]
simIndex simValue paperId paperTitle
1 0.94696206 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits
Author: Wolfgang Maass, Robert A. Legenstein, Nils Bertschinger
Abstract: What makes a neural microcircuit computationally powerful? Or more precisely, which measurable quantities could explain why one microcircuit C is better suited for a particular family of computational tasks than another microcircuit C ? We propose in this article quantitative measures for evaluating the computational power and generalization capability of a neural microcircuit, and apply them to generic neural microcircuit models drawn from different distributions. We validate the proposed measures by comparing their prediction with direct evaluations of the computational performance of these microcircuit models. This procedure is applied first to microcircuit models that differ with regard to the spatial range of synaptic connections and with regard to the scale of synaptic efficacies in the circuit, and then to microcircuit models that differ with regard to the level of background input currents and the level of noise on the membrane potential of neurons. In this case the proposed method allows us to quantify differences in the computational power and generalization capability of circuits in different dynamic regimes (UP- and DOWN-states) that have been demonstrated through intracellular recordings in vivo. 1
same-paper 2 0.93258524 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
3 0.85941362 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
4 0.85362768 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
5 0.8480674 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
Author: David P. Wipf, Bhaskar D. Rao
Abstract: Finding the sparsest, or minimum ℓ0 -norm, representation of a signal given an overcomplete dictionary of basis vectors is an important problem in many application domains. Unfortunately, the required optimization problem is often intractable because there is a combinatorial increase in the number of local minima as the number of candidate basis vectors increases. This deficiency has prompted most researchers to instead minimize surrogate measures, such as the ℓ1 -norm, that lead to more tractable computational methods. The downside of this procedure is that we have now introduced a mismatch between our ultimate goal and our objective function. In this paper, we demonstrate a sparse Bayesian learning-based method of minimizing the ℓ0 -norm while reducing the number of troublesome local minima. Moreover, we derive necessary conditions for local minima to occur via this approach and empirically demonstrate that there are typically many fewer for general problems of interest. 1
6 0.84675843 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
7 0.84667277 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
8 0.8445242 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
9 0.84350002 44 nips-2004-Conditional Random Fields for Object Recognition
10 0.84339416 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
11 0.84335399 124 nips-2004-Multiple Alignment of Continuous Time Series
12 0.8408407 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
13 0.83935237 77 nips-2004-Hierarchical Clustering of a Mixture Model
14 0.83893287 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
15 0.83856517 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
16 0.83770901 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
17 0.83763903 163 nips-2004-Semi-parametric Exponential Family PCA
18 0.83745408 116 nips-2004-Message Errors in Belief Propagation
19 0.83677298 99 nips-2004-Learning Hyper-Features for Visual Identification
20 0.83672059 127 nips-2004-Neighbourhood Components Analysis