nips nips2006 nips2006-191 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. [sent-5, score-0.259]
2 The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. [sent-6, score-0.218]
3 In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. [sent-7, score-0.793]
4 Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. [sent-8, score-0.525]
5 In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. [sent-9, score-0.56]
6 1 Introduction In many decision problems the parameters of the problem are inherently uncertain. [sent-10, score-0.167]
7 The standard approach in decision making to circumvent the adverse effect of the parameter uncertainty is to find a solution that performs best under the worst possible parameters. [sent-12, score-0.401]
8 This approach, termed the “robust” approach, has been used in both single stage ([1]) and multi-stage decision problems (e. [sent-13, score-0.233]
9 By requiring the solution to be feasible to all possible parameters within the uncertainty set, Soyester ([1]) solved the column-wise independent uncertainty case, and Ben-Tal and Nemirovski ([3]) solved the row-wise independent case. [sent-17, score-0.405]
10 In robust MDP problems, there may be two different types of parameter uncertainty, namely, the reward uncertainty and the transition probability uncertainty. [sent-18, score-0.589]
11 Under the assumption that the uncertainty is state-wise independent (an assumption made by all papers to date, to the best of our knowledge), the optimality principle holds and this problem can be decomposed as a series of step by step mini-max problems solved by backward induction ([2, 4, 5]). [sent-19, score-0.178]
12 This implies that the vector of nominal parameters (the parameters used as an approximation of the true one regardless of the uncertainty) is not treated in a special way and is just an element of the set of feasible parameters. [sent-21, score-0.44]
13 Second, the desirability of the solution highly depends on the precise modeling of the uncertainty set which is often based on some ad-hoc criterion. [sent-27, score-0.155]
14 Third, it may happen that the nominal parameters are close to the real parameters, so that the performance of the solution under nominal parameters may provide important information for predicting the performance under the true parameters. [sent-28, score-0.829]
15 Finally, there is a certain tradeoff relationship between the worst-case performance and the nominal performance, that is, if the decision maker insists on maximizing one criterion, the other criterion may decrease dramatically. [sent-29, score-1.108]
16 On the other hand, relaxing both criteria may lead to a well balanced solution with both satisfactory nominal performance and also reasonable robustness to parameter uncertainty. [sent-30, score-0.657]
17 In this paper we capture the Robustness-Performance (RP) tradeoff explicitly. [sent-31, score-0.294]
18 We use the worstcase behavior of a solution as the function representing its robustness, and formulate the decision problem as an optimization of both the robustness criterion and the performance under nominal parameters simultaneously. [sent-32, score-0.852]
19 Here, “simultaneously” is achieved by optimizing the weighted sum of the performance criterion and the robustness criterion. [sent-33, score-0.258]
20 Instead of optimizing the weighted sum of the robustness and performance for some specific weights, we show how to efficiently find the solutions for all possible weights. [sent-35, score-0.279]
21 We prove that the set of these solutions is in fact equivalent to the set of all Pareto efficient solutions in the robustness-performance space. [sent-36, score-0.136]
22 Therefore, we solve the tradeoff problem without choosing a specific tradeoff parameter, and leave the subjective decision of determining the exact tradeoff to the decision maker. [sent-37, score-1.216]
23 Instead of arbitrarily claiming that a certain solution is a good tradeoff, our algorithm computes the whole tradeoff relationship so that the decision maker can choose the most desirable solution according to her preference, which is usually complicated and an explicit form is not available. [sent-38, score-0.813]
24 Our approach thus avoids the tuning of tradeoff parameters, where generally no good a-priori method exists. [sent-39, score-0.294]
25 This is opposed to certain relaxations of the worst-case robust optimization approach like [6] (for single stage only) where some explicit tradeoff parameters have to be chosen. [sent-40, score-0.429]
26 Unlike risk sensitive learning approaches [7, 8, 9] which aim to tune a strategy online, our approach compute a robust strategy offline without trial and error. [sent-41, score-0.252]
27 Section 2 is devoted to the RP tradeoff for Linear Programming. [sent-43, score-0.294]
28 In Section 3 and Section 4 we discuss the RP tradeoff for MDP with uncertain rewards, and uncertain transition probabilities, respectively. [sent-44, score-0.694]
29 2 Parametric linear programming and RP tradeoffs in optimization In this section, we briefly recall Parametric Linear Programming (PLP) [10, 11, 12], and show how it can be used to find the whole set of Pareto efficient solutions for RP tradeoffs in Linear Programming. [sent-47, score-0.419]
30 This serves as the base for the discussion of RP tradeoffs in MDPs. [sent-48, score-0.128]
31 An outline of the PLP algorithm is described in Algorithm 1, which is essentially a tableau simplex algorithm while the entering variable is determined in a specific way. [sent-55, score-0.125]
32 Find a basic feasible optimal solution for λ = 0. [sent-59, score-0.185]
33 , the zero row in the (1) simplex table) of the first objective, denoted as cj . [sent-65, score-0.121]
34 This algorithm is based on the observation that for any λ, there exists an optimal basic feasible solution. [sent-71, score-0.174]
35 It is also known that the optimal value for PLP is a continuous piecewise linear function of λ. [sent-75, score-0.12]
36 2 RP tradeoffs in Linear Programming Consider the following LP: NOMINAL PROBLEM : Minimize: c x Subject to: Ax ≤ b Here A ∈ Rn×m , x ∈ Rm , b ∈ Rn , c ∈ Rm . [sent-80, score-0.128]
37 (2) Suppose that the constraint matrix A is only a guess of the unknown true parameter Ar which is known to belonging to set A (we call A the uncertainty set). [sent-81, score-0.134]
38 To quantify how a solution x behaves with respect to the parameter uncertainty, we define the following criterion to be minimized as its robustness measure (more accurately, non-robustness measure). [sent-84, score-0.249]
39 p(x) ˜ Ax − b sup ˜ A∈A + 1 = sup ˜ A∈A i=1 (3) n n max ˜(i) x − bi , 0 = a max i=1 sup ˜(i):T (i)˜(i)≤v(i) a a ˜(i) x − bi , 0 . [sent-85, score-0.256]
40 a ˜ Here [·]+ stands for the positive part of a vector, ˜(i) is the ith row of the matrix A, and bi is the a ith element of b. [sent-86, score-0.154]
41 Using the weighted sum of the performance and robustness objective as the minimizing objective, we formulate the explicit tradeoff between robustness and performance as: GENERAL PROBLEM : λ ∈ [0, 1] Minimize: λc x + (1 − λ)p(x) Subject to: Ax ≤ b. [sent-88, score-0.724]
42 By duality theorem, for a given x, sup˜(i):T (i)˜(i)≤v(i) ˜(i) x equals to the optimal value of the a a a following LP on y(i): Minimize: v(i) y(i) Subject to: T (i) y(i) = x y(i) ≥ 0. [sent-90, score-0.147]
43 We use r to denote the vector combining the reward for all state-action pairs and rs to denote the vector combining all reward of state s. [sent-94, score-0.853]
44 More specifically, we have a nominal parameter r(s, a) which is believed to be a reasonably good guess of the true reward. [sent-99, score-0.388]
45 The reward r is known to belong to a bounded set R. [sent-100, score-0.322]
46 We further assume that the uncertainty set R is state-wise independent and a polytope for each state. [sent-101, score-0.168]
47 That is, R = s∈S Rs , and for each s ∈ S, there exists a matrix Cs and a vector ds such that Rs = {rs |Cs rs ≥ ds }. [sent-102, score-0.344]
48 We assume that for different visits of one state, the realization of the reward need not be identical and may take different values within the uncertainty set. [sent-103, score-0.508]
49 The set of admissible control policies for the decision maker is the set of randomized history dependent policies, which we denote by ΠHR . [sent-104, score-0.507]
50 In the following three subsections we discuss different standard reward criteria: cumulative reward with a finite horizon, discounted reward with infinite horizon, and limiting average reward with infinite horizon under a unichain assumption. [sent-105, score-1.7]
51 1 Finite horizon case In the finite horizon case (T = {1, · · · , N }), we assume without loss of generality that each state belongs to only one stage, which is equivalent to the assumption of non-stationary reward realization, and use Si to denote the set of states at the ith stage. [sent-107, score-0.798]
52 We also assume that the first stage consists of only one state s1 , and that there are no terminal rewards. [sent-108, score-0.133]
53 We define the following two functions as the performance measure and the robustness measure of a policy π ∈ ΠHR : N −1 P (π) Eπ { r(si , ai )}, i=1 (6) N −1 R(π) min Eπ { r∈R r(si , ai )}. [sent-109, score-0.352]
54 i=1 The minimum is attainable, since R is compact and the total expected reward is a continuous function of r. [sent-110, score-0.322]
55 π∈ΠHR (7) We set PN ≡ RN ≡ cN ≡ 0, and note that cλ (s1 ) is the optimal RP tradeoff with weight λ. [sent-120, score-0.346]
56 The proof is omitted since it follows similarly to standard backward induction in finite horizon robust decision problems. [sent-122, score-0.458]
57 For s ∈ St , t < N , let ∆s be the probability simplex on As , then cλ (s) = max t q∈∆s min rs ∈Rs λ a∈As r(s, a)q(a) + (1 − λ) s ∈St+1 a∈As a∈As r(s, a)q(a) + p(s |s, a)q(a)cλ (s ) . [sent-124, score-0.231]
58 t+1 We now consider the maximin problem in each state and show how to find the solutions for all λ in one pass. [sent-125, score-0.189]
59 That is, there exist constants li and mj such that cλ (sj ) = li λ + mj , for λ ∈ [λi−1 , λi ]. [sent-130, score-0.118]
60 t+1 i i By the duality theorem, we have that cλ (s) equals to the optimal value of the following LP on y and t q. [sent-131, score-0.147]
61 Substituting cλ (sj ) and rearranging, it follows t+1 that for λ ∈ [λi−1 , λi ] the objective function equals to (1 − λ) a∈As +λ a∈As k j=1 p(sj |s, a)mj q(a) + ds y i r(s, a) + k j=1 j p(sj |s, a)(li + mj ) q(a) . [sent-134, score-0.216]
62 Furthermore, we need not to re-initiate for each interval, since the optimal solution for the end of ith interval is also the optimal solution for the begin of the next interval. [sent-136, score-0.291]
63 2 Discounted reward infinite horizon case In this section we address the RP tradeoff for infinite horizon MDPs with a discounted reward criterion. [sent-140, score-1.387]
64 For a fixed λ, the problem is equivalent to a zero-sum game, with the decision maker trying to maximize the weighted sum and Nature trying to minimize it by selecting an adversarial reward realization. [sent-141, score-0.807]
65 A well known result in discounted zero-sum stochastic games states that, even if non-stationary policies are admissible, a Nash equilibrium in which both players choose a stationary policy exists; see Proposition 7. [sent-142, score-0.234]
66 (9) Since it suffices to consider a stationary policy for Nature, the tradeoff problem becomes: Maximize: inf r∈R s∈S a∈As Subject to: x ∈ X . [sent-145, score-0.385]
67 Maximize:λ s∈S a∈As Subject to: a∈As r(s, a)x(s, a) + (1 − λ) x(s , a) − x(s, a) ≥ 0, s∈S a∈As ds ys s∈S γp(s |s, a)x(s, a) = α(s ), ∀s , (11) ∀s, ∀a, Cs ys = xs ∀s, ys ≥ 0, ∀s. [sent-147, score-0.496]
68 As before, there exists an optimal maximin stationary policy. [sent-150, score-0.183]
69 4 The RP tradeoff in MDPs with uncertain transition probabilities In this section we provide a counterexample which demonstrates that the weighted sum criterion in the most general case, i. [sent-152, score-0.647]
70 , the uncertain transition probability case, may lead to non-Markovian optimal policies. [sent-154, score-0.327]
71 In the finite horizon MDP shown in the Figure 1, S = {s1, s2, s3, s4, s5, t1, t2, t3, t4}; As1 = {a(1, 1)}; As2 = {a(2, 1)}; As3 = {a(3, 1)}; As4 = {a(4, 1)} and As5 = {a(5, 1), a(5, 2)}. [sent-155, score-0.18]
72 The nominal transition probabilities are p (s2|s1, a(1, 1)) = 0. [sent-157, score-0.482]
73 Observe that the worst parameter realization is p(s4|s2, a(2, 1)) = p(t3|s5, a(5, 2)) = 0. [sent-161, score-0.161]
74 We look for the strategy that maximizes the sum of the nominal reward and the worst-reward (i. [sent-162, score-0.755]
75 Since multiple actions only exist in state s5, a strategy is determined by the action chosen on s5. [sent-166, score-0.238]
76 In this case, with the nominal transition probability, this trajectory will reach t1 with a reward of 10, regardless of the choice of p. [sent-169, score-0.774]
77 The worst transition is that action a(2, 1) leads to s5 and action a(5, 2) leads to t4, hence the expected reward is 5p + 4(1 − p). [sent-170, score-0.613]
78 In this case, the nominal reward is 5p + 8(1 − p), and the worst case reward is 5p + 4(1 − p). [sent-179, score-1.081]
79 The unique optimal strategy for this example is thus non-Markovian. [sent-183, score-0.127]
80 The optimal strategy is non-Markovian because we are taking expectation over two different probability measures, hence the smoothing property of conditional expectation cannot be used in finding the optimal strategy. [sent-185, score-0.179]
81 In state h, the decision maker can choose either to replace the machine which will lead to state 1 deterministically, or to continue running, which with probability p will lead to state h + 1. [sent-188, score-0.635]
82 If the machine is in state n, then the decision maker has to replace it. [sent-189, score-0.445]
83 The replacing cost is perfectly known to be cr , and the nominal running cost in state h is ch . [sent-190, score-0.619]
84 We√ assume that the realization of the running cost lies in the interval [ch − δh , ch + δh ]. [sent-191, score-0.255]
85 For each solution found, we sample the reward 300 times according to a uniform distribution. [sent-195, score-0.373]
86 , we divide the cost by the smallest expected nominal cost. [sent-198, score-0.443]
87 Denoting the normalized cost of the ith simulation for strategy j as si (j), we use the following function to compare the solutions: 300 i=1 |si (j)|α . [sent-199, score-0.278]
88 300 Note that α = 1 is the mean of the simulation cost, whereas larger α puts higher penalty on deviation representing a risk-averse decision maker. [sent-200, score-0.205]
89 Figure 2(b) shows that, the solutions that focus on nominal parameters (i. [sent-201, score-0.426]
90 That is, if the decision maker is risk neutral, then the solutions based on nominal parameters are good. [sent-204, score-0.837]
91 However, these solutions are not robust and are not good choices for risk-averse decision makers. [sent-205, score-0.304]
92 Note that, in this example, the nominal cost is the expected cost for each stage, i. [sent-206, score-0.472]
93 Even in such case, we see that risk-averse decision makers can benefit from considering the RP tradeoff. [sent-209, score-0.167]
94 vj (α) = 6 α Concluding remarks In this paper we proposed a method that directly addresses the robustness versus performance tradeoff by treating the robustness as an optimization objective. [sent-210, score-0.657]
95 rewards are uncertain, we presented an efficient algorithm that computes the whole set of optimal RP tradeoffs for MDPs with finite horizon, infinite horizon discounted reward, and limiting average reward (unichain). [sent-240, score-0.915]
96 For MDPs with uncertain transition probabilities, we showed an example where the solution may be non-Markovian and hence may in general be intractable. [sent-241, score-0.298]
97 The main advantage of the presented approach is that it addresses robustness directly. [sent-242, score-0.151]
98 This frees the decision maker from the need to make probabilistic assumptions on the problems parameters. [sent-243, score-0.378]
99 It also allows the decision maker to determine the desired robustness-performance tradeoff based on observing the whole curve of possible tradeoffs rather than guessing a single value. [sent-244, score-0.839]
100 Robust control of markov decision processes with uncertain transition matrices. [sent-278, score-0.446]
wordName wordTfidf (topN-words)
[('nominal', 0.358), ('reward', 0.322), ('tradeoff', 0.294), ('plp', 0.243), ('maker', 0.211), ('rp', 0.19), ('horizon', 0.18), ('decision', 0.167), ('hr', 0.162), ('uncertain', 0.153), ('robustness', 0.151), ('rs', 0.142), ('pareto', 0.141), ('tradeoffs', 0.128), ('ys', 0.128), ('uncertainty', 0.104), ('transition', 0.094), ('unichain', 0.094), ('mdps', 0.092), ('simplex', 0.089), ('discounted', 0.089), ('sj', 0.086), ('lp', 0.085), ('realization', 0.082), ('feasible', 0.082), ('ds', 0.081), ('ch', 0.08), ('worst', 0.079), ('cs', 0.079), ('strategy', 0.075), ('ax', 0.074), ('robust', 0.069), ('piecewise', 0.068), ('solutions', 0.068), ('subject', 0.067), ('state', 0.067), ('stage', 0.066), ('polytope', 0.064), ('st', 0.061), ('si', 0.059), ('mj', 0.059), ('mdp', 0.059), ('action', 0.059), ('ai', 0.058), ('cost', 0.057), ('duality', 0.056), ('programming', 0.056), ('rewards', 0.056), ('bi', 0.056), ('policy', 0.054), ('policies', 0.054), ('maximin', 0.054), ('optimal', 0.052), ('solution', 0.051), ('limiting', 0.049), ('ith', 0.049), ('sup', 0.048), ('worstcase', 0.047), ('maintenance', 0.047), ('bertsimas', 0.047), ('criterion', 0.047), ('parametric', 0.043), ('shie', 0.043), ('nite', 0.042), ('minimize', 0.042), ('overly', 0.042), ('backward', 0.042), ('exists', 0.04), ('admissible', 0.04), ('reinforcement', 0.04), ('equals', 0.039), ('whole', 0.039), ('simulation', 0.038), ('satisfactory', 0.038), ('stationary', 0.037), ('objective', 0.037), ('rn', 0.037), ('actions', 0.037), ('rm', 0.037), ('interval', 0.036), ('entering', 0.036), ('maximize', 0.036), ('history', 0.035), ('risk', 0.033), ('athena', 0.033), ('markov', 0.032), ('solved', 0.032), ('conservative', 0.032), ('august', 0.032), ('cj', 0.032), ('performance', 0.031), ('xs', 0.031), ('concluding', 0.031), ('probabilities', 0.03), ('guess', 0.03), ('remarks', 0.03), ('weighted', 0.029), ('cn', 0.029), ('divide', 0.028), ('lead', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
2 0.19781451 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
3 0.14687039 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
4 0.11520316 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner
Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best fit between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically significant differences between different genetic strains and stress conditions. 1
5 0.10456688 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
6 0.10353222 154 nips-2006-Optimal Change-Detection and Spiking Neurons
7 0.088656314 124 nips-2006-Linearly-solvable Markov decision problems
8 0.087548532 50 nips-2006-Chained Boosting
9 0.085254416 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
10 0.077382669 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
11 0.074038848 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
12 0.066723578 44 nips-2006-Bayesian Policy Gradient Algorithms
13 0.066704385 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
14 0.06258323 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
15 0.061000451 26 nips-2006-An Approach to Bounded Rationality
16 0.06030973 146 nips-2006-No-regret Algorithms for Online Convex Programs
17 0.060106173 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
18 0.059707239 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
19 0.057431024 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
20 0.056819245 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
topicId topicWeight
[(0, -0.177), (1, -0.008), (2, -0.208), (3, -0.14), (4, 0.005), (5, -0.064), (6, -0.006), (7, 0.017), (8, 0.182), (9, 0.038), (10, 0.018), (11, -0.025), (12, -0.032), (13, 0.032), (14, 0.042), (15, -0.002), (16, 0.044), (17, -0.007), (18, -0.002), (19, -0.099), (20, 0.068), (21, 0.103), (22, 0.144), (23, -0.052), (24, -0.003), (25, 0.05), (26, -0.159), (27, 0.064), (28, -0.01), (29, 0.007), (30, -0.041), (31, -0.085), (32, -0.078), (33, -0.112), (34, -0.04), (35, 0.133), (36, 0.098), (37, 0.042), (38, 0.081), (39, 0.039), (40, 0.023), (41, -0.0), (42, 0.009), (43, 0.102), (44, -0.093), (45, -0.012), (46, 0.087), (47, 0.112), (48, 0.123), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96110874 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
2 0.6895749 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
3 0.66086262 124 nips-2006-Linearly-solvable Markov decision problems
Author: Emanuel Todorov
Abstract: We introduce a class of MPDs which greatly simplify Reinforcement Learning. They have discrete state spaces and continuous control spaces. The controls have the effect of rescaling the transition probabilities of an underlying Markov chain. A control cost penalizing KL divergence between controlled and uncontrolled transition probabilities makes the minimization problem convex, and allows analytical computation of the optimal controls given the optimal value function. An exponential transformation of the optimal value function makes the minimized Bellman equation linear. Apart from their theoretical signi cance, the new MDPs enable ef cient approximations to traditional MDPs. Shortest path problems are approximated to arbitrary precision with largest eigenvalue problems, yielding an O (n) algorithm. Accurate approximations to generic MDPs are obtained via continuous embedding reminiscent of LP relaxation in integer programming. Offpolicy learning of the optimal value function is possible without need for stateaction values; the new algorithm (Z-learning) outperforms Q-learning. This work was supported by NSF grant ECS–0524761. 1
4 0.57662451 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
5 0.5571087 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng
Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1
6 0.53013718 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
7 0.49856004 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
8 0.45258757 154 nips-2006-Optimal Change-Detection and Spiking Neurons
9 0.44808936 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
10 0.42171964 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
11 0.41062632 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
12 0.39396775 50 nips-2006-Chained Boosting
13 0.38813028 192 nips-2006-Theory and Dynamics of Perceptual Bistability
14 0.3586154 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
15 0.35759863 75 nips-2006-Efficient sparse coding algorithms
16 0.35344931 146 nips-2006-No-regret Algorithms for Online Convex Programs
17 0.32899529 44 nips-2006-Bayesian Policy Gradient Algorithms
18 0.32412484 33 nips-2006-Analysis of Representations for Domain Adaptation
19 0.30479774 179 nips-2006-Sparse Representation for Signal Classification
20 0.30139452 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
topicId topicWeight
[(1, 0.096), (2, 0.038), (3, 0.018), (7, 0.057), (9, 0.046), (22, 0.067), (44, 0.059), (57, 0.068), (65, 0.044), (69, 0.017), (71, 0.399)]
simIndex simValue paperId paperTitle
1 0.97500026 145 nips-2006-Neurophysiological Evidence of Cooperative Mechanisms for Stereo Computation
Author: Jason M. Samonds, Brian R. Potetz, Tai S. Lee
Abstract: Although there has been substantial progress in understanding the neurophysiological mechanisms of stereopsis, how neurons interact in a network during stereo computation remains unclear. Computational models on stereopsis suggest local competition and long-range cooperation are important for resolving ambiguity during stereo matching. To test these predictions, we simultaneously recorded from multiple neurons in V1 of awake, behaving macaques while presenting surfaces of different depths rendered in dynamic random dot stereograms. We found that the interaction between pairs of neurons was a function of similarity in receptive fields, as well as of the input stimulus. Neurons coding the same depth experienced common inhibition early in their responses for stimuli presented at their nonpreferred disparities. They experienced mutual facilitation later in their responses for stimulation at their preferred disparity. These findings are consistent with a local competition mechanism that first removes gross mismatches, and a global cooperative mechanism that further refines depth estimates. 1 In trod u ction The human visual system is able to extract three-dimensional (3D) structures in random noise stereograms even when such images evoke no perceptible patterns when viewed monocularly [1]. Bela Julesz proposed that this is accomplished by a stereopsis mechanism that detects correlated shifts in 2D noise patterns between the two eyes. He also suggested that this mechanism likely involves cooperative neural processing early in the visual system. Marr and Poggio formalized the computational constraints for solving stereo matching (Fig. 1a) and devised an algorithm that can discover the underlying 3D structures in a variety of random dot stereogram patterns [2]. Their algorithm was based on two rules: (1) each element or feature is unique (i.e., can be assigned only one disparity) and (2) surfaces of objects are cohesive (i.e., depth changes gradually across space). To describe their algorithm in neurophysiological terms, we can consider neurons in primary visual cortex as simple element or feature detectors. The first rule is implemented by introducing competitive interactions (mutual inhibition) among neurons of different disparity tuning at each location (Fig. 1b, blue solid horizontal or vertical lines), allowing only one disparity to be detected at each location. The second rule is implemented by introducing cooperative interactions (mutual facilitation) among neurons tuned to the same depth (image disparity) across different spatial locations (Fig. 1b, along the red dashed diagonal lines). In other words, a disparity estimate at one location is more likely to be correct if neighboring locations have similar disparity estimates. A dynamic system under such constraints can relax to a stable global disparity map. Here, we present neurophysiological evidence of interactions between disparity-tuned neurons in the primary visual cortex that is consistent with this general approach. We sampled from a variety of spatially distributed disparity tuned neurons (see electrodes Fig. 1b) while displaying DRDS stimuli defined at various disparities (see stimulus Fig.1b). We then measured the dynamics of interactions by assessing the temporal evolution of correlation in neural responses. a Left Image b Right Image Electrodes Disparity Left Image ? Stimulus Right Image Figure 1: (a) Left and right images of random dot stereogram (right image has been shifted to the right). (b) 1D graphical depiction of competition (blue solid lines) and cooperation (red dashed lines) among disparity-tuned neurons with respect to space as defined by Marr and Poggio’s stereo algorithm [2]. 2 2.1 Methods Recording and stimulation a Posterior - Anterior Recordings were made in V1 of two awake, behaving macaques. We simultaneously recorded from 4-8 electrodes providing data from up to 10 neurons in a single recording session (some electrodes recorded from as many as 3 neurons). We collected data from 112 neurons that provided 224 pairs for cross-correlation analysis. For stimuli, we used 12 Hz dynamic random dot stereograms (DRDS; 25% density black and white pixels on a mean luminance background) presented in a 3.5-degree aperture. Liquid crystal shutter goggles were used to present random dot patterns to each eye separately. Eleven horizontal disparities between the two eyes, ranging from ±0.9 degrees, were tested. Seventy-four neurons (66%) had significant disparity tuning and 99 pairs (44%) were comprised of neurons that both had significant disparity tuning (1-way ANOVA, p<0.05). b 5mm Medial - Lateral 100µV 0.2ms 1° Figure 2: (a) Example recording session from five electrodes in V1. (b) Receptive field (white box—arrow represents direction preference) and random dot stereogram locations for same recording session (small red square is the fixation spot). 2.2 Data analysis Interaction between neurons was described as
2 0.94728512 36 nips-2006-Attentional Processing on a Spike-Based VLSI Neural Network
Author: Yingxue Wang, Rodney J. Douglas, Shih-Chii Liu
Abstract: The neurons of the neocortex communicate by asynchronous events called action potentials (or ’spikes’). However, for simplicity of simulation, most models of processing by cortical neural networks have assumed that the activations of their neurons can be approximated by event rates rather than taking account of individual spikes. The obstacle to exploring the more detailed spike processing of these networks has been reduced considerably in recent years by the development of hybrid analog-digital Very-Large Scale Integrated (hVLSI) neural networks composed of spiking neurons that are able to operate in real-time. In this paper we describe such a hVLSI neural network that performs an interesting task of selective attentional processing that was previously described for a simulated ’pointer-map’ rate model by Hahnloser and colleagues. We found that most of the computational features of their rate model can be reproduced in the spiking implementation; but, that spike-based processing requires a modification of the original network architecture in order to memorize a previously attended target. 1
3 0.88070011 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
Author: Neil D. Lawrence, Guido Sanguinetti, Magnus Rattray
Abstract: Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.
same-paper 4 0.82182205 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
5 0.68499839 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
Author: Thomas Voegtlin
Abstract: In biological neurons, the timing of a spike depends on the timing of synaptic currents, in a way that is classically described by the Phase Response Curve. This has implications for temporal coding: an action potential that arrives on a synapse has an implicit meaning, that depends on the position of the postsynaptic neuron on the firing cycle. Here we show that this implicit code can be used to perform computations. Using theta neurons, we derive a spike-timing dependent learning rule from an error criterion. We demonstrate how to train an auto-encoder neural network using this rule. 1
6 0.67753845 59 nips-2006-Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons
7 0.60571861 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
8 0.60113603 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron
9 0.56997895 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex
10 0.55881834 18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons
11 0.55220664 154 nips-2006-Optimal Change-Detection and Spiking Neurons
12 0.5428409 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
13 0.50332892 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
14 0.49257982 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
15 0.4882099 16 nips-2006-A Theory of Retinal Population Coding
16 0.47317797 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding
17 0.46135828 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
18 0.45360842 124 nips-2006-Linearly-solvable Markov decision problems
19 0.4467935 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
20 0.44581586 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency