nips nips2011 nips2011-50 knowledge-graph by maker-knowledge-mining

50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Source: pdf

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper deﬁnes a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both ofﬂine and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As such, prior work on BO, both batch and sequential, completely ignores the problem of how to schedule experiments under ﬁxed experimental budget and time constraints. [sent-21, score-0.412]

2 First, we have l available labs (which may correspond to experimental stations at one location or to physically distinct laboratories), allowing up to l concurrent experiments. [sent-38, score-0.356]

3 First, a scheduler should ensure that all n experiments complete within the horizon h, which encourages high concurrency. [sent-43, score-0.321]

4 Second, we wish to select new experiments given as many previously completed experiments as possible to make more intelligent experiment selections, which encourages low concurrency. [sent-44, score-0.316]

5 Assuming a known density function pd for the experiment durations, the inputs to our problem include the total number of available labs l, the total number of experiments n, and the time horizon h by which we must ﬁnish. [sent-54, score-0.552]

6 The goal is to design a policy π for selecting when to start experiments and which ones to start to optimize f . [sent-55, score-0.415]

7 Speciﬁcally, the inputs to π are the set of completed experiments and their outcomes, the set of currently running experiments with their elapsed running time, the number of free labs, and the remaining time till the horizon. [sent-56, score-0.34]

8 Any run of the policy ends when either n experiments are completed or the time horizon is reached, resulting in a set X of n or fewer completed experiments. [sent-58, score-0.651]

9 The objective is to obtain a policy with small regret, which is the expected difference between the optimal value of f and the value of f for the predicted best experiment in X. [sent-59, score-0.295]

10 2 3 Overview of General Approach A policy for our problem must make two types of decisions: 1) scheduling when to start new experiments, and 2) selecting the speciﬁc experiments to start. [sent-64, score-0.575]

11 We assume a black box function SelectBatch for intelligently selecting the k ≥ 1 experiments based on both completed and currently running experiments. [sent-66, score-0.27]

12 regret for 30 different schedulers on two BO A poor scheduler that starts all n experiments at the same time benchmarks. [sent-96, score-0.328]

13 Similarly, as the number of experiments started together (the batch size) increases, we might also expect diminishing returns since SelectBatch must choose the experiments based on the same prior experiments. [sent-101, score-0.258]

14 Note that while the schedules are ofﬂine, the overall BO policy has online characteristics, since the exact experiments to run are only speciﬁed when they need to be started by SelectBatch, based 3 on the most recent information. [sent-111, score-0.611]

15 This ofﬂine scheduling approach is often convenient in real experimental domains where it is useful to plan out a static equipment/personnel schedule for the duration of a project. [sent-112, score-0.533]

16 1 Staged Schedules A staged schedule deﬁnes a consecutive sequence of N experimental stages, denoted by a sequence of tuples (ni , di ) N , where 0 < ni ≤ l, i di ≤ h, and i ni ≤ n. [sent-116, score-0.705]

17 Stage i begins by starting up ni new i=1 experiments selected by SelectBatch using the most recent information, and ends after a duration of di , upon which stage i + 1 starts. [sent-117, score-0.412]

18 In some applications, staged schedules are preferable as they allow project planning to focus on a relatively small number of time points (the beginning of each stage). [sent-118, score-0.302]

19 If, because of this, at the beginning of stage i there are not ni free labs, the experiments will wait till labs free up. [sent-120, score-0.581]

20 We say that an execution E of a staged schedule S is safe if each experiment is completed within its speciﬁed duration in S. [sent-121, score-0.744]

21 We say that a staged schedule S is p-safe if with probability at least p an execution of S is safe which provides a probabilistic guarantee that all n experiments complete within the horizon h. [sent-122, score-0.677]

22 Further, it ensures with probability p that the maximum number of concurrent experiments when executing S is maxi ni (since experiments from two stages will not overlap with probability p). [sent-123, score-0.409]

23 As such, we are interested in ﬁnding staged schedules that are p-safe for a user speciﬁed p, e. [sent-124, score-0.302]

24 N i−1 The CPE of any safe execution of S (slightly abusing notation) is: CPE(S) = i=2 ni j=1 nj . [sent-128, score-0.298]

25 It turns out that for any ﬁxed number of stages N , the schedules that maximize CPE(S) must be uniform. [sent-131, score-0.313]

26 A staged schedule is deﬁned to be uniform if ∀i, j, |ni − nj | ≤ 1, i. [sent-132, score-0.424]

27 For any number of experiments n and labs l, let SN be the set of corresponding N stage schedules, where N ≥ n/l . [sent-136, score-0.426]

28 It is easy to verify that for a given n and l, an N stage uniform schedule achieves a strictly higher CPE than any N − 1 stage schedule. [sent-138, score-0.463]

29 This implies that we should prefer uniform schedules with maximum number of stages allowed by the p-safeness restriction. [sent-139, score-0.326]

30 This motivates us to solve the following problem: Find a p-safe uniform schedule with maximum number of stages. [sent-140, score-0.295]

31 Algorithm 1 Algorithm for computing a p-safe uniform schedule with maximum number of stages. [sent-141, score-0.295]

32 For each value of N , the call to MaxProbUniform computes a uniform schedule S with the highest probability of a safe execution, among all N stage uniform schedules. [sent-143, score-0.514]

33 If the resulting schedule is p-safe then we consider N + 1 stages. [sent-144, score-0.254]

34 Otherwise, there is no uniform N stage schedule that is p-safe and we return a uniform N − 1 stage schedule, which was computed in the previous iteration. [sent-145, score-0.504]

35 4 It remains to describe the MaxProbUniform function, which computes a uniform N stage schedule S = (ni , di ) N that maximizes the probability of a safe execution. [sent-146, score-0.523]

36 First, any N stage uniform schedule must i=1 have N = (n mod N ) stages with n = n/N +1 experiments and N −N stages with n −1 experiments. [sent-147, score-0.662]

37 The MaxProbUniform problem is now reduced to computing the durations di of S that maximize the probability of safeness for each given ni . [sent-152, score-0.319]

38 For any duration distribution pd that is log-concave, if an N stage schedule S = (ni , di ) N i=1 is p-safe, then there is a p-safe N stage schedule S = (ni , di ) N such that if ni = nj then di = dj . [sent-155, score-1.121]

39 i=1 This lemma suggests that any stages with equal ni ’s should have equal di ’s to maximize the probability of safe execution. [sent-156, score-0.374]

40 Thus we only need to consider schedules with two durations, d for stages with ni = n and d for stages with ni = n − 1. [sent-158, score-0.596]

41 Based on this, for any value of d the N −N probability of the uniform schedule using durations d and d is as follows, where Pd is the CDF of pd . [sent-160, score-0.515]

42 For any log-concave pd , computing MaxProbUniform by maximizing Equation 1 over d , if a p-safe uniform schedule exists, Algorithm 1 returns a maximum-stage p-safe uniform schedule. [sent-164, score-0.446]

43 This class allows the start times of different labs to be decoupled, desirable in settings where labs are run by independent experimenters. [sent-167, score-0.6]

44 Further, our online scheduling approach is based on repeatedly calling an ofﬂine scheduler, which requires the ﬂexibility to make schedules for labs in different stages of execution. [sent-168, score-0.802]

45 An independent lab (IL) schedule S speciﬁes a number of labs k < l and for each lab i, a number of experiments mi such that i mi = n. [sent-169, score-0.82]

46 The execution of S runs each lab independently, by having each lab start up experiments whenever they move to the next stage. [sent-174, score-0.336]

47 We say that an execution of an IL schedule is safe if all experiments ﬁnish within their speciﬁed durations, which also yields a notion of p-safeness. [sent-179, score-0.496]

48 Intuitively, CPE will be maximized if the amount of concurrency during an execution is minimized, suggesting the use of as few labs as possible. [sent-181, score-0.373]

49 This motivates the problem of ﬁnding a p-safe IL schedule that use the minimum number of labs. [sent-182, score-0.254]

50 Starting with k = 1, we compute a k labs IL schedule with the goal of maximizing the probability of safe execution. [sent-185, score-0.618]

51 If this probability is less than p, we increment k, and otherwise output the schedule for k labs. [sent-186, score-0.254]

52 To compute a schedule for each value of k, we ﬁrst allocate the number of experiments mi across k labs as uniformly as possible. [sent-187, score-0.633]

53 In particular, (n mod k) labs will have n/k + 1 experiments and k − (n mod k) labs will have n/k experiments. [sent-188, score-0.662]

54 This choice is motivated by the intuition that the best way to maximize the probability of a safe execution is to distribute the work across labs as uniformly as possible. [sent-189, score-0.468]

55 Given mi for each lab, we assign all durations of lab i to be h/mi , which can be shown to be optimal for log-concave pd . [sent-190, score-0.332]

56 In this way, for each value of k the schedule we compute has just two possible values of mi and labs with the same mi have the same stage durations. [sent-191, score-0.682]

57 The ﬂexibility of the online approaches offers the potential to outperform ofﬂine schedules by adapting to speciﬁc stochastic outcomes observed during experimental runs. [sent-193, score-0.264]

58 Below we ﬁrst describe two baseline online approaches, followed by our main approach, policy switching, which aims to directly optimize CPE. [sent-194, score-0.293]

59 This baseline policy simply tries to ﬁnish all of the n experiments as quickly as possible. [sent-196, score-0.316]

60 As such, it keeps all l labs busy as long as there are experiments left to run. [sent-197, score-0.342]

61 Speciﬁcally whenever a lab (or labs) becomes free the policy immediately uses SelectBatch with the latest information to select new experiments to start right away. [sent-198, score-0.449]

62 The OnMEL policy simply restricts OnFCP to use only k labs, where k is the minimum number of labs required to guarantee with probability at least p that all n experiments complete within the horizon. [sent-202, score-0.586]

63 Our policy switching approach decides the number of new experiments to start at each decision epoch. [sent-205, score-0.471]

64 The motivation behind policy switching is to exploit the availability of a policy generator that can produce multiple policies at any decision epoch, where at least one of them is expected to be good. [sent-207, score-0.765]

65 Given such a generator, the goal is to deﬁne a new (switching) policy that performs as well or better than the best of the generated policies in any state. [sent-208, score-0.325]

66 This is motivated by prior work on policy switching [6] over a ﬁxed policy library, and generalize that work to handle arbitrary policy generators instead of static policy libraries. [sent-210, score-1.067]

67 Below we describe the general approach and then the speciﬁc policy generator that we use. [sent-211, score-0.323]

68 We use s to denote the experimental state of the scheduling problem, which encodes the number of completed experiments and ongoing experiments with their elapsed running time. [sent-213, score-0.542]

69 We assume access to a policy generator Π(s, t) which returns a set of base scheduling policies (possibly nonstationary) given inputs s and t. [sent-214, score-0.65]

70 Prior work on policy switching [6] corresponds to the case where Π(s, t) returns a ﬁxed set of policies regardless of s and t. [sent-215, score-0.438]

71 Given Π(s, t), π (s, t, π) denotes the resulting switching ¯ policy based on s, t, and the base policy π selected in the previous epoch. [sent-216, score-0.605]

72 The decision returned by π is ¯ computed by ﬁrst conducting N simulations of each policy returned by Π(s, t) along with π to estimate their CPEs. [sent-217, score-0.33]

73 The base policy with the highest estimated CPE is then selected and its decision is returned by π . [sent-218, score-0.315]

74 The ¯ need to compare to the previous policy π is due to the use of a dynamic policy generator, rather than a ﬁxed library. [sent-219, score-0.488]

75 The base policy passed into policy switching for the ﬁrst decision epoch can be arbitrary. [sent-220, score-0.652]

76 In particular, the CPE of the switching policy will not be much worse than the best of the policies produced by our generator given accurate simulations. [sent-222, score-0.495]

77 We say that a CPE estimator is -accurate if it can π estimate the CPE Ct (s) of any base policy π for any s and t within an accuracy bound of . [sent-223, score-0.27]

78 Let Π(s, t) be a policy generator and π be the switching policy computed with -accurate ¯ π π estimates. [sent-226, score-0.658]

79 For any state s, stages-to-go t, and base policy π, Ct¯ (s, π) ≥ maxπ ∈Π(s,t)∪{π} Ct (s) − 2t . [sent-227, score-0.27]

80 We use a simple policy generator Π(s, t) that makes multiple calls to the ofﬂine IL scheduler described earlier. [sent-228, score-0.501]

81 Speciﬁcally, rather than follow the ﬁxed ofﬂine schedule we may choose to use fewer labs and hence improve CPE. [sent-231, score-0.544]

82 Note that the deﬁnition of these policies depends on s and t and hence can not be viewed as a ﬁxed set of static policies as used by traditional policy switching. [sent-247, score-0.406]

83 In the initial state s0 , π(s0 ,h,0) corresponds to the ofﬂine IL schedule and hence the above theorem guarantees that we will not perform much worse than the ofﬂine IL, with the expectation of performing much better. [sent-248, score-0.254]

84 Whenever policy switching selects a πi with i > 0 then no new experiments will be started and we wait for the next decision epoch. [sent-249, score-0.515]

85 For i = 0, it will apply the ofﬂine IL scheduler to return a p-safe schedule to start immediately, which may require starting new labs to ensure high probability of completing n experiments. [sent-250, score-0.759]

86 Given the set of completed experiments O and on-going experiments A, SelectBatch selects k new experiments. [sent-252, score-0.27]

87 We evaluate our scheduling policies using 6 well-known synthetic benchmark functions (shown in Tab. [sent-259, score-0.306]

88 The Hydrogen data is produced by a study on biosolar hydrogen production [5], where the goal was to maximize hydrogen production of a particular bacteria by optimizing PH and Nitrogen levels. [sent-261, score-0.264]

89 Given l, n and h, to evaluate policy π using function f (with a set of initial observed experiments), we execute π and get a set X of n or fewer completed experiments. [sent-271, score-0.365]

90 , selecting one experiment at a time) using SelectBatch, which can be viewed as an effective upper bound on the optimal performance of any constrained scheduler because it ignores the time horizon (h = ∞). [sent-277, score-0.323]

91 To handle this the scheduler ﬁrst considers using already executing labs taking into account how long they have been running. [sent-280, score-0.448]

92 If more labs are required to ensure p-safeness new ones are added. [sent-281, score-0.27]

93 This suggests that there is limited beneﬁt in these scenarios to using the more ﬂexible IL schedules, which were primarily introduced for use in the online scheduling context. [sent-396, score-0.266]

94 However, the ofﬂine schedules purposefully wait for experiments to complete before starting up new experiments, which tends to improve the CPE values. [sent-399, score-0.308]

95 Finally, policy switching consistently outperforms other policies (excluding h = ∞) on the medium horizon setting and performs similarly in the other settings. [sent-402, score-0.507]

96 For short horizons, there is less opportunity for scheduling choices and for longer horizons the scheduling problem is easier and hence the ofﬂine approaches are more competitive. [sent-404, score-0.451]

97 Further examination of the schedules produced by PS indicates that although it begins with the same number of labs as OfIL, PS often selects fewer labs in later steps if early experiments are completed sooner than expected, which leads to higher CPE and consequently better performance. [sent-406, score-0.95]

98 We considered ofﬂine and online approaches for scheduling experiments in this setting, relying on a black box function to intelligently select speciﬁc experiments at their scheduled start times. [sent-409, score-0.497]

99 Our ofﬂine scheduling approaches signiﬁcantly outperformed some natural baselines and our online approach of policy switching was the best overall performer. [sent-411, score-0.602]

100 For example, taking into account varying costs and duration distributions across labs and experiments. [sent-414, score-0.328]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cpe', 0.576), ('labs', 0.27), ('schedule', 0.254), ('policy', 0.244), ('scheduling', 0.198), ('schedules', 0.192), ('selectbatch', 0.192), ('scheduler', 0.178), ('ine', 0.176), ('durations', 0.132), ('bo', 0.127), ('staged', 0.11), ('ni', 0.109), ('il', 0.101), ('completed', 0.101), ('onmel', 0.096), ('safe', 0.094), ('stages', 0.093), ('switching', 0.091), ('pd', 0.088), ('stage', 0.084), ('maxprobuniform', 0.082), ('onfcp', 0.082), ('policies', 0.081), ('generator', 0.079), ('execution', 0.076), ('lab', 0.075), ('hydrogen', 0.072), ('experiments', 0.072), ('horizon', 0.071), ('ofil', 0.069), ('ps', 0.068), ('concurrent', 0.063), ('duration', 0.058), ('nish', 0.056), ('azimi', 0.055), ('horizons', 0.055), ('regret', 0.051), ('experiment', 0.051), ('di', 0.05), ('online', 0.049), ('intelligently', 0.048), ('anode', 0.041), ('fuelcell', 0.041), ('ofstaged', 0.041), ('uniform', 0.041), ('start', 0.038), ('fern', 0.038), ('batch', 0.037), ('mi', 0.037), ('production', 0.036), ('cosines', 0.036), ('started', 0.032), ('requesting', 0.031), ('ongoing', 0.028), ('maximize', 0.028), ('concurrency', 0.027), ('cpes', 0.027), ('fuel', 0.027), ('hart', 0.027), ('mfcs', 0.027), ('nitrogen', 0.027), ('priore', 0.027), ('schedulers', 0.027), ('shekel', 0.027), ('benchmark', 0.027), ('base', 0.026), ('running', 0.026), ('decision', 0.026), ('budget', 0.026), ('wait', 0.025), ('mod', 0.025), ('selects', 0.025), ('exibility', 0.025), ('microbial', 0.024), ('ct', 0.023), ('selecting', 0.023), ('diminishing', 0.023), ('experimental', 0.023), ('returns', 0.022), ('elapsed', 0.022), ('conducting', 0.022), ('electricity', 0.022), ('ph', 0.022), ('run', 0.022), ('till', 0.021), ('epoch', 0.021), ('dj', 0.021), ('select', 0.02), ('optimizing', 0.02), ('fewer', 0.02), ('medium', 0.02), ('budgeted', 0.02), ('ends', 0.02), ('baselines', 0.02), ('nj', 0.019), ('returned', 0.019), ('limited', 0.019), ('costly', 0.019), ('starting', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

2 0.19224146 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

3 0.11706511 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

4 0.11593754 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a ﬁnite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

5 0.10791271 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufﬁcient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1

6 0.10374233 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

7 0.090929106 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

8 0.082176939 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

9 0.074264728 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

10 0.067643546 283 nips-2011-The Fixed Points of Off-Policy TD

11 0.062208209 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

12 0.060312606 271 nips-2011-Statistical Tests for Optimization Efficiency

13 0.059945326 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

14 0.057360608 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

15 0.054497272 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

16 0.052303229 56 nips-2011-Committing Bandits

17 0.049252886 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

18 0.048153568 256 nips-2011-Solving Decision Problems with Limited Information

19 0.046561852 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

20 0.04543208 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.116), (1, -0.129), (2, 0.043), (3, 0.129), (4, -0.118), (5, 0.033), (6, -0.01), (7, 0.04), (8, -0.014), (9, -0.007), (10, -0.013), (11, 0.037), (12, -0.034), (13, -0.067), (14, -0.061), (15, -0.069), (16, 0.021), (17, 0.085), (18, 0.009), (19, -0.003), (20, -0.002), (21, -0.061), (22, 0.025), (23, -0.046), (24, -0.025), (25, 0.006), (26, 0.01), (27, 0.033), (28, 0.053), (29, 0.026), (30, 0.084), (31, -0.026), (32, -0.0), (33, -0.008), (34, -0.01), (35, 0.004), (36, 0.034), (37, 0.088), (38, -0.049), (39, 0.046), (40, 0.038), (41, 0.048), (42, 0.023), (43, 0.026), (44, -0.049), (45, 0.059), (46, -0.148), (47, -0.063), (48, 0.048), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94112831 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

2 0.77576017 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

3 0.73813009 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We ﬁrst prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

4 0.73739475 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

5 0.71276063 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

6 0.68127692 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

7 0.66787976 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

8 0.59822857 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

9 0.5483008 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

10 0.52411461 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

11 0.50405335 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.45846036 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

13 0.41799137 256 nips-2011-Solving Decision Problems with Limited Information

14 0.40798211 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

15 0.39472428 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

16 0.38111502 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

17 0.38040185 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

18 0.3586483 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.35375834 32 nips-2011-An Empirical Evaluation of Thompson Sampling

20 0.33885667 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (4, 0.028), (20, 0.021), (26, 0.022), (31, 0.063), (33, 0.012), (43, 0.038), (45, 0.079), (57, 0.029), (74, 0.032), (83, 0.025), (99, 0.538)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94536179 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector ﬁeld perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efﬁciently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

same-paper 2 0.92448848 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

3 0.91257983 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to ﬁnd a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efﬁcacy of our proposed approaches.

4 0.87125653 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

5 0.83561361 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

6 0.79857171 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

7 0.53964525 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.53870285 102 nips-2011-Generalised Coupled Tensor Factorisation

9 0.53073657 263 nips-2011-Sparse Manifold Clustering and Embedding

10 0.52998394 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

11 0.5148856 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

12 0.51365685 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.50821429 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.4949387 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

15 0.48821014 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.48150724 72 nips-2011-Distributed Delayed Stochastic Optimization

17 0.47918662 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

18 0.47702116 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

19 0.47490102 198 nips-2011-On U-processes and clustering performance

20 0.47371125 66 nips-2011-Crowdclustering