nips nips2007 nips2007-176 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
Reference: text
sentIndex sentText sentNum sentScore
1 We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. [sent-7, score-0.647]
2 We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution. [sent-8, score-0.813]
3 Using a combination of probabilistic and dynamic programming tools, it has been shown that when the decision horizon is infinite (i. [sent-10, score-0.113]
4 no deadline), the optimal policy is to accumulate sensory evidence for one alternative versus the other until a fixed threshold, and report the corresponding hypothesis [1]. [sent-12, score-0.212]
5 Moreover, there is variability associated with that deadline either due to external variability associated with the deadline imposition itself, or due to internal timing uncertainty about how much total time is allowed and how much time has already elapsed. [sent-16, score-1.186]
6 In either case, with respect to the observer’s internal timer, the deadline can be viewed as a stochastic quantity. [sent-17, score-0.581]
7 We show through analytical and numerical analysis that the optimal policy is a monotonically declining decision threshold over time. [sent-19, score-0.278]
8 We then use numerical simulations to examine the optimal policy in some specific examples (Sec. [sent-25, score-0.197]
9 , xt ) to be the vector of observations made by time t. [sent-39, score-0.137]
10 Defining pt P{θ = 1 | xt }, we t+1 t observe that p may be obtained iteratively from p via Bayes’ rule, pt+1 = P{θ = 1 | xt+1 } = pt f1 (xt+1 ) . [sent-41, score-1.597]
11 pt f1 (xt+1 ) + (1 − pt )f0 (xt+1 ) (1) Let D be a deadline drawn from a known distribution that is independent of the observations xt . [sent-42, score-2.167]
12 We will assume that the deadline D is observed immediately and effectively terminates the trial. [sent-43, score-0.548]
13 Let c > 0 be the cost associated with each unit time of decision delay, and d ≥ . [sent-44, score-0.13]
14 5 be the cost associated with exceeding the deadline, where both c and d are normalized against the (unit) cost of making an incorrect decision. [sent-45, score-0.13]
15 A decision-policy π is a sequence of mappings, one for each time t, from the observations so far to the set of possible actions: stop and choose θ = 0; stop and choose θ = 1; or continue sampling. [sent-49, score-0.186]
16 We define τπ to be the time when the decision is made to stop sampling under decision-policy π, and δπ to be the hypothesis chosen at this time – both are random variables dependent on the sequence of observations. [sent-50, score-0.16]
17 We may also define t t σπ inf{t ∈ N : π (x ) ∈ {0, 1}} to be the time when the policy would choose to stop sampling if the deadline were to fail to occur. [sent-55, score-0.738]
18 1 Dynamic Programming A decision policy is characterized by how τ and δ are generated as a function of the data observed so far. [sent-59, score-0.157]
19 The optimal policy decides whether or not to stop based on whether pt is inside a set C t ⊆ [0, 1] or not. [sent-61, score-0.945]
20 That is, the optimal policy is to iteratively compute pt based on incoming data, and to decide for the respective hypothesis as soon as it hits either a high (δ = 1) or low (δ = 0) threshold. [sent-63, score-0.967]
21 The red line denotes the cost of stopping at time t as a function of the current belief pt = p. [sent-69, score-0.937]
22 The blue line denotes the cost of continuing at least one more time step, as a function of pt . [sent-70, score-0.966]
23 The black line denotes the cost of continuing at least two more time steps, as a function of pt . [sent-71, score-0.947]
24 Because the cost of continuing is concave in pt (Lemma 1), and larger than stopping for pt ∈ {0, 1} (Lemma 4), the continuation region is an interval delimited by where the costs of continuing and stopping intersect (blue dashed lines). [sent-72, score-2.187]
25 Moreover, because the cost of continuing two more timesteps is always larger than that of continuing one more for a given amount of belief (Lemmas 2 and 3), that “window” of continuation narrows over time (Main Theorem). [sent-73, score-0.426]
26 The value function V : N × [0, 1] → R+ specifies the minimal cost (incurred by the optimal policy) at time t, given that the deadline has not yet occurred, that xt have been observed, and that the current cumulative evidence for θ = 1 is pt : V (t, pt ) inf τ ≥t,δ l(τ, δ; θ, D) | D > t, pt θ,D,x . [sent-76, score-3.09]
27 The cost associated with continuing at time t, known as the Q-factor for continuing and denoted by Q, takes the form Q(t, pt ) inf l(τ, δ; θ, D) | D > t, pt θ,D,x . [sent-77, score-1.841]
28 5 1 pt = p Figure 1: Comparison of the cost Q(t, p) of stopping at time t (red); the cost Q(t, p) of continuing at time t (blue solid line); and Q(t + 1, p) − c (black solid line), which is the cost of continuing at time t + 1 minus an adjustment Q(t + 1, p) − Q(t, p) = c. [sent-85, score-1.324]
29 The continuation region C t is the interval between the intersections of the solid blue and red lines, marked by the blue dotted lines, and the continuation region C t+1 is the interval between the intersections of the solid black and red lines, marked by the black dotted lines. [sent-86, score-0.59]
30 Note that, in general, both V (t, pt ) and Q(t, pt ) may be difficult to compute due to the need to optimize over infinitely many decision policies. [sent-88, score-1.547]
31 Conversely, the cost associated with stopping at time t, known as the Q-factor for stopping and denoted by Q, is easily computed as Q(t, pt ) = inf l(t, δ; θ, D) | D > t, pt θ,D,x = min{pt , 1 − pt } + ct, (4) δ=0,1 where the infimum is obtained by choosing δ = 0 if pt ≤ . [sent-89, score-3.283]
32 An optimal stopping rule is to stop the first time the expected cost of continuing exceeds that of stopping, and to choose δ = 0 or δ = 1 to minimize the probability of error given the accumulated evidence (see [10]). [sent-91, score-0.36]
33 That is, τ ∗ = inf{t ≥ 0 : Q(t, pt ) ≤ Q(t, pt )} and δ ∗ = 1{pτ ∗ ≥1/2} . [sent-92, score-1.502]
34 We define the continuation region at time t by C t pt ∈ [0, 1] : Q(t, pt ) > Q(t, pt ) so that ∗ t t τ = inf{t ≥ 0 : p ∈ C }. [sent-93, score-2.415]
35 Although we have obtained an expression for the optimal policy in / terms of Q(t, p) and Q(t, p), computing Q(t, p) is difficult in general. [sent-94, score-0.149]
36 The function p → Q(t, pt ) is concave with respect to pt for each t ∈ N. [sent-96, score-1.547]
37 First, the expectation is conditioned on pt , which contains all the information about θ available in the past observations xt , and makes it unnecessary for the optimal policy to depend on xt except through pt . [sent-104, score-1.875]
38 Second, dependence on pt in the optimal policy may be made implicit by allowing the infimum to be attained by different τ and δ for different values of pt but removing explicit dependence on pt from the individual policies over which the infimum is taken. [sent-105, score-2.424]
39 With τ and δ chosen from this restricted set of policies, we note that the distribution of the future observations xt+1 is entirely determined by θ and so we have l(τ, δ; θ, D) | θ, pt D,xt+1 = l(τ, δ; θ, D) | θ D,xt+1 . [sent-106, score-0.773]
40 Summing over the possible values of θ, we may then write: l(τ, δ; θ, D) | pt θ,D,xt+1 l(τ, δ; θ, D) | θ = k = D,xt+1 P{θ = k | pt } k∈{0,1} = l(τ, δ; θ, D) | θ = 0 D,xt+1 (1 − pt ) + l(τ, δ; θ, D) | θ = 1 D,xt+1 pt . [sent-107, score-3.004]
41 (3) can then be rewritten as: Q(t, pt ) = inf l(τ, δ; θ, D) | θ = 0 D,xt+1 (1 − pt ) + l(τ, δ; θ, D) | θ = 1 D,xt+1 pt , τ ≥t+1,δ where this infimum is again understood to be taken over this set of policies depending only upon observations after time t. [sent-109, score-2.375]
42 Since neither l(τ, δ; θ, D) | θ = 0 nor l(τ, δ; θ, D) | θ = 1 depend on pt , this is the infimum of a collection of linear functions in pt , and hence is concave in pt ( [9]). [sent-110, score-2.298]
43 3 We now need a lemma describing how expected cost depends on the distribution of the deadline. [sent-111, score-0.12]
44 Let D′ be a deadline whose distribution is different than that of D. [sent-112, score-0.548]
45 Let π ∗ be the policy that is optimal given that the deadline has distribution D, and denote σπ∗ by σ ∗ . [sent-113, score-0.697]
46 Then define V ′ (t, pt ) ∗ ∗ min(pσ , 1 − pσ )1{σ∗ t ′ θ,D,x ∗ so that V gives the expected cost of taking the stopping time σ which is optimal for deadline D ′ and applying it to the situation with deadline D′ . [sent-114, score-2.048]
47 Similarly, let Q′ (t, pt ) and Q (t, pt ) denote the ∗ ′ corresponding expected costs under σ and D given that we continue or stop, respectively, at time ′ t given pt and D′ > t. [sent-115, score-2.322]
48 Note that Q (t, pt ) = Q(t, pt ) = min(pt , 1 − pt ) + ct. [sent-116, score-2.253]
49 These definitions are the basis for the following lemma, which essentially shows that replacing the deadline D which a less urgent deadline D′ lowers cost. [sent-117, score-1.096]
50 Now consider a finite horizon version of the problem where σ ∗ is only optimal among stopping times bounded above by a finite integer T . [sent-126, score-0.134]
51 We will show the lemma for this case, and the lemma for the infinite horizon version of the problem follows by taking the limit as T → ∞. [sent-127, score-0.139]
52 If σ ∗ chooses to stop ′ at t when pt = p, then V (t, p) = Q(t, p) = Q (t, p) = V ′ (t, p). [sent-132, score-0.796]
53 If this requirement is not met, then if pt is such that d < min(pt , 1 − pt ) then we may prefer to get timed out rather than choose δ = 0 or δ = 1 and suffer the expected penalty of min(pt , 1 − pt ) for choosing incorrectly. [sent-135, score-2.291]
54 In this situation, since the conditional probability P{D = t + 1 | D > t} that we will time out in the next time period grows as time moves forward, the continuation region may expand with time rather than contract. [sent-136, score-0.222]
55 Under most circumstances, however, it seems reasonable to assume the deadline cost to be at least as large as that of making an error. [sent-137, score-0.613]
56 We now state Lemma 3, which shows that the cost of delaying by one time period is as least as large as the continuation cost c, but may be larger because the delay causes the deadline to approach more rapidly. [sent-138, score-0.844]
57 For each t ∈ N and p ∈ (0, 1), Q(t − 1, pt−1 = p) ≤ Q(t, pt = p) − c. [sent-140, score-0.751]
58 Let D′ = D + 1 and we have ∗ ∗ Q′ (t, p) = min(pσ , 1−pσ )1{σ∗ t, pt = p D′ ,xt+1 , so Q(t − 1, p) ≤ Q′ (t, p) − c. [sent-146, score-0.751]
59 On the event pt = 0, we have that P{θ = 0} = 1 and the policy attaining the infimum in (3) is τ ∗ = t+1, δ ∗ = 0. [sent-151, score-0.879]
60 Thus, Q(t, 0) becomes Q(t, 0) = l(τ ∗ , δ ∗ ; θ, D) | D > t, pt = 0 D,xt+1 = d1{t+1≥D} + c(t + 1) | D > t, θ = 0 = l(τ ∗ , δ ∗ ; θ, D) | D > t, θ = 0 D,xt+1 D,xt+1 = c(t+1) + dP{D = t+1 | D > t}. [sent-152, score-0.751]
61 Similarly, on the event pt = 1, we have that P{θ = 1} = 1 and the policy attaining the infimum in (3) is τ ∗ = t+1, δ ∗ = 1. [sent-153, score-0.879]
62 As noted, the continuation region C t is the set of p such that Q(t, p) ≤ Q(t, p), To show that C t is either empty or an interval, we note that Q(t, p) is a concave function in p (Lemma 1) whose value at the endpoints p = 0, 1 are greater than the corresponding values of Q(t, p) (Lemma 4). [sent-158, score-0.208]
63 At each time t ∈ N, the optimal continuation region C t is either empty or a closed interval, and C t+1 ⊆ C t . [sent-165, score-0.22]
64 Finally, C t ⊆ [at , bt ] ⊆ [at , 1/2] ∪ [1/2, bt ] ⊆ C t and we must have C t = [at , bt ]. [sent-191, score-0.135]
65 We also include the following proposition, which shows that if D is finite with probability 1 then the continuation region must eventually narrow to nothing. [sent-192, score-0.142]
66 By neglecting the error probability and including only continuation and deadline costs, we obtain Q(t, pt ) ≥ d P{D = t+1 | D > t}+c(t+1). [sent-200, score-1.413]
67 Bounding the error probability by 1/2 we obtain Q(t, pt ) ≤ ct + 1/2. [sent-201, score-0.792]
68 Thus, Q(t, pt ) − Q(t, pt ) ≥ c + d P{D = t + 1 | D > t} − 1/2. [sent-202, score-1.502]
69 This implies that, for t ≥ T and pt ∈ [0, 1], Q(t, pt ) − Q(t, pt ) > 0 and C t = ∅. [sent-204, score-2.253]
70 3 Computational simulations We conducted a series of simulations in which we computed the continuation region and distributions of response time and accuracy for the optimal policy for several choices of the parameters c and d, and for the distribution of the deadline D. [sent-205, score-0.941]
71 We computed optimal policies for two different forms of deadline distribution: first for a deterministic deadline fixed to some known constant; and second for a gamma distributed deadline. [sent-213, score-1.223]
72 The gamma distribution with parameters k > 0 and β > 0 has density (β k /Γ(k))xk−1 e−βx for x > 0, where Γ(·) is the gamma function. [sent-214, score-0.15]
73 A fixed deadline T may actually be seen as a limiting case of a gamma-distributed deadline by taking both k and β to infinity such that k/β = T is fixed. [sent-216, score-1.096]
74 Then we calculated value functions and Q-factors for previous times recursively according to Bellman’s equation: Q(t, p) = V (t + 1, pt+1 ) | pt = p pt+1 ; V (t, p) = min(Q(t, p), Q(t, p)). [sent-228, score-0.751]
75 Then we note that P{xt+1 = 1 | pt } = P{xt+1 = 1 | θ = 1}pt + P{xt+1 = 1 | θ = 0}(1 − pt ) = pt q1 + (1 − pt )q0 , and similarly P{xt+1 = 0 | pt } = pt (1 − q1 ) + (1 − pt )(1 − q0 ). [sent-232, score-5.257]
76 Then Q(t, pt ) = (c(t+1)+d)P{D ≤ t+1 | D > t} + P{D > t+1 | D > t} [ V t+1, g(pt, 1) pt q1 +(1 − pt )q0 +V t+1, g(pt, 0) pt (1 − q1 )+(1 − pt )(1 − q0 ) . [sent-233, score-3.755]
77 We computed continuation regions C t from these Q-factors, and then used Monte Carlo simulation with 106 samples for each problem setting to estimate P{δ = θ | τ = t} and P{τ = t} as functions of t. [sent-234, score-0.114]
78 3A that the decision boundaries for a fixed deadline (solid blue) are smoothly narrowing toward the midline. [sent-237, score-0.631]
79 Clearly, at the last opportunity for responding before the deadline, the optimal policy would always generate a response (and therefore the thresholds merge), since we assumed that the cost of penalty 6 A B Probability Probability Varying std(D) 1 Varying mean(D) 0. [sent-238, score-0.329]
80 2 10 20 30 0 0 40 Time 10 20 30 40 Time Figure 2: Plots of the continuation region C t (blue), and the probability of a correct response P{δ = θ | τ = t} (red). [sent-254, score-0.158]
81 5 (since the optimal policy is to choose the hypothesis with probability ≥ . [sent-261, score-0.192]
82 At the time step before, the optimal policy would only continue if one more data point is going to improve the belief state enough to offset the extra time cost c. [sent-264, score-0.295]
83 Therefore, the optimal policy only continues for a small “window” around . [sent-265, score-0.162]
84 When uncertainty about the deadline increases (larger std(D); shown in dashed and dash-dotted blue lines), the optimal thresholds are squeezed toward each other and to the left, the intuition being that the threat of encountering the deadline spreads earlier and earlier into the trial. [sent-268, score-1.282]
85 The red lines denote the average accuracy for different stopping times obtained from a million Monte Carlo simulations of the observation-decision process. [sent-269, score-0.138]
86 They closely follow the decision thresholds (since the threshold is on the posterior probability pτ ), but are slightly larger, because pτ must exceed the threshold, and pt moves in discrete increments due to the discrete Bernoulli process. [sent-270, score-0.883]
87 The effect of decreasing the mean deadline is to shift the decision boundaries left-ward, as shown in Fig. [sent-271, score-0.609]
88 The effect of increasing the cost of time c is to squeeze the boundaries toward the midline (Fig. [sent-273, score-0.166]
89 4 Discussion In this work, we formalized the problem of sequential hypothesis testing (of two alternatives) under the pressure of a stochastically sampled deadline, and characterized the optimal policy. [sent-277, score-0.142]
90 For a large class of deadline distributions (including gamma, normal, exponential, delta), we showed that the optimal policy is to report a hypothesis as soon as the posterior belief hits one of a pair of monotonically declining thresholds (toward the midline). [sent-278, score-0.877]
91 This generalizes the classical infinite horizon case in the limit when the deadline goes to infinity, and the optimal policy reverts to a pair of fixed thresholds as in the sequential probability ratio test [1]. [sent-279, score-0.85]
92 We showed that the decision policy becomes more conservative (thresholds pushed outward and to the right) when there’s less uncertainty about 7 the deadline, when the mean of the deadline is larger, when the linear temporal cost is larger, and when the deadline cost is smaller. [sent-280, score-1.397]
93 This assumption implies that, if the deadline has not occurred already, then the likelihood that it will happen soon grows larger and larger, as time passes. [sent-282, score-0.608]
94 The assumption is violated by multi-modal distributions, for which there is a large probability the deadline will occur at some early point in time, but if the deadline does not occur by that point in time then will not occur until some much later time. [sent-283, score-1.116]
95 This assumption is met by a fixed deadline (std(D)→ 0), and also includes the classical infinite-horizon case (D → ∞) as a special case (and the optimal policy reverts to the sequential probability ratio test). [sent-284, score-0.79]
96 We used gamma distributions for the deadline in the numerical stimulations. [sent-288, score-0.631]
97 There are several empirical properties about timing uncertainty in humans and animals that make the gamma distribution particularly suitable. [sent-289, score-0.117]
98 First, realizations from the gamma distribution are always non-negative, which is consistent with the assumption that a subject never thinks a deadline has passed before the experiment has started. [sent-290, score-0.616]
99 Second, if we fix the rate parameter β and vary the shape k, then we obtain a collection of deadline distributions with different means whose variance and mean are in a fixed ratio, which is consistent with experimental observations [12]. [sent-291, score-0.57]
100 Subjects who have more internal uncertainty, and therefore larger variance in their perceived deadline stochasticity, should respond to stimuli earlier and with lower accuracy. [sent-296, score-0.598]
wordName wordTfidf (topN-words)
[('pt', 0.751), ('deadline', 0.548), ('continuation', 0.114), ('policy', 0.112), ('continuing', 0.098), ('xt', 0.095), ('mum', 0.071), ('gamma', 0.068), ('stopping', 0.068), ('cost', 0.065), ('inf', 0.058), ('lemma', 0.055), ('thresholds', 0.051), ('stop', 0.045), ('decision', 0.045), ('bt', 0.045), ('concave', 0.045), ('ct', 0.041), ('timer', 0.041), ('optimal', 0.037), ('interval', 0.036), ('declining', 0.035), ('sequential', 0.035), ('min', 0.034), ('simulations', 0.033), ('std', 0.033), ('ps', 0.033), ('dp', 0.032), ('blue', 0.032), ('hypothesis', 0.03), ('horizon', 0.029), ('region', 0.028), ('pressure', 0.028), ('continue', 0.028), ('intersect', 0.026), ('princeton', 0.026), ('deadlines', 0.023), ('midline', 0.023), ('powell', 0.023), ('reverts', 0.023), ('soon', 0.022), ('toward', 0.022), ('policies', 0.022), ('observations', 0.022), ('empty', 0.021), ('costs', 0.021), ('squeeze', 0.02), ('intersections', 0.02), ('time', 0.02), ('red', 0.02), ('threshold', 0.02), ('met', 0.02), ('fix', 0.02), ('dynamic', 0.02), ('programming', 0.019), ('solid', 0.019), ('timing', 0.019), ('opportunity', 0.019), ('accumulate', 0.019), ('larger', 0.018), ('internal', 0.017), ('chose', 0.017), ('lines', 0.017), ('response', 0.016), ('attaining', 0.016), ('adjustment', 0.016), ('boundaries', 0.016), ('stochastic', 0.016), ('exceed', 0.016), ('animals', 0.016), ('responding', 0.016), ('continuity', 0.016), ('al', 0.015), ('ratio', 0.015), ('earlier', 0.015), ('numerical', 0.015), ('hits', 0.015), ('dynamics', 0.014), ('delay', 0.014), ('window', 0.014), ('evidence', 0.014), ('uncertainty', 0.014), ('density', 0.014), ('monotonically', 0.014), ('behavioral', 0.013), ('choose', 0.013), ('penalty', 0.013), ('black', 0.013), ('varying', 0.013), ('continues', 0.013), ('trivially', 0.013), ('marked', 0.013), ('belief', 0.013), ('lemmas', 0.013), ('nite', 0.012), ('tools', 0.012), ('conditioned', 0.012), ('prefer', 0.012), ('testing', 0.012), ('situation', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
2 0.23194358 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.14711873 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
4 0.13189687 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
Author: Pierre Dangauthier, Ralf Herbrich, Tom Minka, Thore Graepel
Abstract: We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players’ lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player’s ability to force a draw provides significantly better predictive power. 1
5 0.09768267 195 nips-2007-The Generalized FITC Approximation
Author: Andrew Naish-guzman, Sean Holden
Abstract: We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.
6 0.07863424 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
7 0.071913838 214 nips-2007-Variational inference for Markov jump processes
8 0.064501427 162 nips-2007-Random Sampling of States in Dynamic Programming
9 0.063406058 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
10 0.063336805 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
11 0.061113264 102 nips-2007-Incremental Natural Actor-Critic Algorithms
12 0.060612947 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
13 0.060176842 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
14 0.056392476 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
15 0.051153898 21 nips-2007-Adaptive Online Gradient Descent
16 0.048442923 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
17 0.045032736 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
18 0.043855723 146 nips-2007-On higher-order perceptron algorithms
19 0.041423883 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
20 0.039542504 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
topicId topicWeight
[(0, -0.123), (1, -0.161), (2, 0.058), (3, 0.021), (4, 0.087), (5, -0.007), (6, -0.038), (7, -0.046), (8, -0.023), (9, -0.051), (10, 0.025), (11, 0.059), (12, -0.046), (13, 0.027), (14, -0.033), (15, 0.013), (16, 0.034), (17, -0.016), (18, -0.017), (19, -0.01), (20, -0.099), (21, 0.026), (22, 0.021), (23, -0.117), (24, 0.094), (25, -0.131), (26, -0.111), (27, -0.079), (28, -0.086), (29, 0.104), (30, -0.111), (31, -0.117), (32, 0.134), (33, -0.017), (34, 0.191), (35, -0.102), (36, 0.092), (37, -0.146), (38, 0.098), (39, 0.117), (40, 0.055), (41, -0.165), (42, -0.145), (43, -0.129), (44, -0.094), (45, -0.068), (46, 0.016), (47, 0.037), (48, -0.136), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.97381181 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
2 0.59242213 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.5211795 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
4 0.4425737 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
Author: Pierre Dangauthier, Ralf Herbrich, Tom Minka, Thore Graepel
Abstract: We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players’ lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player’s ability to force a draw provides significantly better predictive power. 1
5 0.3470763 214 nips-2007-Variational inference for Markov jump processes
Author: Manfred Opper, Guido Sanguinetti
Abstract: Markov jump processes play an important role in a large number of application domains. However, realistic systems are analytically intractable and they have traditionally been analysed using simulation based techniques, which do not provide a framework for statistical inference. We propose a mean field approximation to perform posterior inference and parameter estimation. The approximation allows a practical solution to the inference problem, while still retaining a good degree of accuracy. We illustrate our approach on two biologically motivated systems.
6 0.33968812 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
7 0.32307911 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
8 0.28969833 195 nips-2007-The Generalized FITC Approximation
9 0.28520685 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
10 0.25053966 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
11 0.24414144 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
12 0.22414564 162 nips-2007-Random Sampling of States in Dynamic Programming
13 0.22391668 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
14 0.22181574 119 nips-2007-Learning with Tree-Averaged Densities and Distributions
15 0.21698838 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
16 0.21630377 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
17 0.20563643 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
18 0.20356812 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
19 0.2035372 53 nips-2007-Compressed Regression
20 0.19640233 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
topicId topicWeight
[(5, 0.035), (9, 0.012), (13, 0.033), (16, 0.036), (17, 0.257), (18, 0.031), (19, 0.016), (21, 0.072), (31, 0.016), (34, 0.035), (35, 0.037), (47, 0.101), (49, 0.024), (83, 0.103), (85, 0.016), (87, 0.013), (90, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.76473153 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
2 0.56646305 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
Author: Michael Ross, Andrew Cohen
Abstract: This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects’ performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it represents classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than previous methods and provides a foundation for further exploration. 1
3 0.5657481 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
4 0.56382102 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
5 0.56266689 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray
Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1
6 0.56066287 86 nips-2007-Exponential Family Predictive Representations of State
7 0.55992472 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
8 0.55939323 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
9 0.55871767 158 nips-2007-Probabilistic Matrix Factorization
10 0.55848706 96 nips-2007-Heterogeneous Component Analysis
11 0.55756885 63 nips-2007-Convex Relaxations of Latent Variable Training
12 0.5574916 100 nips-2007-Hippocampal Contributions to Control: The Third Way
13 0.55749017 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
14 0.55598438 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
15 0.55577087 156 nips-2007-Predictive Matrix-Variate t Models
16 0.55404913 174 nips-2007-Selecting Observations against Adversarial Objectives
17 0.55398864 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
18 0.55378848 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
19 0.55320472 43 nips-2007-Catching Change-points with Lasso
20 0.55316675 24 nips-2007-An Analysis of Inference with the Universum