nips nips2013 nips2013-241 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Huggins , Harold Pashler† Department of Computer Science, University of Colorado, Boulder † Department of Psychology, University of California, San Diego Abstract Psychologists are interested in developing instructional policies that boost student learning. [sent-4, score-0.55]
2 An instructional policy specifies the manner and content of instruction. [sent-5, score-0.71]
3 For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. [sent-6, score-0.671]
4 , contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). [sent-9, score-1.176]
5 We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. [sent-10, score-0.255]
6 For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. [sent-11, score-0.546]
7 We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. [sent-12, score-0.543]
8 Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. [sent-14, score-0.86]
9 A critical factor is their instructional policy, which specifies the manner and content of instruction. [sent-17, score-0.275]
10 Electronic tutoring systems have been constructed that implement domain-specific instructional policies (e. [sent-18, score-0.479]
11 As a complement to these efforts, we are interested in addressing fundamental questions in the design of instructional policies that pertain to basic cognitive skills. [sent-27, score-0.517]
12 Empirical evaluation of blocking and interleaving policies involves training a set of human subjects with a fixed-length sequence of exemplars drawn from one policy or the other. [sent-32, score-1.228]
13 During training, exemplars are presented one at a time, and typically subjects are asked to guess the category label associated with the exemplar, after which they are told the correct label. [sent-33, score-0.48]
14 Such an experiment yields an intrinsically noisy evaluation of the two policies, limited by the number of subjects and inter-individual variability. [sent-35, score-0.263]
15 Consequently, the goal of a typical psychological experiment is to find a statistically reliable difference between the training conditions, allowing the experimenter to conclude that one policy is superior. [sent-36, score-0.617]
16 Blocking and interleaving are but two points in a space of policies that could be parameterized by the probability, ρ, that the exemplar presented on trial t + 1 is drawn from the same category as the exemplar on trial t. [sent-37, score-0.714]
17 (There are many more interesting ways of constructing a policy space that includes blocking and interleaving, e. [sent-39, score-0.566]
18 , ρ might vary with t or with a student’s running-average classification accuracy, but we will use the simple fixed-ρ policy space for illustration. [sent-41, score-0.435]
19 2 Optimizing an instructional policy Our goal is to discover the optimum in policy space—the policy that maximizes mean accuracy or another measure of performance over a population of students. [sent-43, score-1.661]
20 Evaluating a single policy with a degree of certainty requires testing many subjects to reduce sampling variance due to individual differences, factors outside of experimental control (e. [sent-46, score-0.678]
21 To avoid exceeding a subject budget, the flexibility of the POMDP framework demands additional bias, imposed via restrictions on the class of candidate policies and strong assumptions about the learner. [sent-54, score-0.273]
22 The approach we will propose likewise requires specification of a constrained policy space, but does not make assumptions about the internal state of the learner or the temporal dynamics of learning. [sent-55, score-0.435]
23 In contrast to POMDP approaches, the cognitive agnosticism of our approach allows it to be readily applied to arbitrary policy optimization problems. [sent-56, score-0.473]
24 Neither POMDP nor direct-optimization approaches models the policy space explicitly. [sent-60, score-0.435]
25 From a functionapproximation perspective, the goal is to determine the shape and optimum of the function that maps policies to performance—call this the policy performance function or PPF. [sent-62, score-0.685]
26 Traditional experimental design—which aims to show a statistically reliable difference between two alternative policies—requires testing many subjects for each policy. [sent-64, score-0.277]
27 2 Instructional Policy 0 Figure 1: A hypothetical 1D instructional policy space. [sent-69, score-0.71]
28 The solid black line represents an (unknown) policy performance function. [sent-70, score-0.435]
29 The grey disks indicate the noisy outcome of singlesubject experiments conducted at specified points in policy space. [sent-71, score-0.509]
30 number of points in policy space each with few subjects instead of a small number of points each with many subjects. [sent-74, score-0.641]
31 Our vision is a completely automated system that selects points in policy space to evaluate, runs an experiment—an evaluation of some policy with one or a small number of subjects—and repeats until a budget for data collection is exhausted. [sent-76, score-0.934]
32 Like other Bayesian models, GPR makes efficient use of limited data, which is particularly critical to us because our budget is expressed in terms of the number of subjects required. [sent-83, score-0.27]
33 The primary constraint imposed by the Gaussian Process prior—that of function smoothness—can readily be ensured with the appropriate design of policy spaces. [sent-85, score-0.435]
34 To illustrate GPR in surrogate-based optimization, Figure 1 depicts a hypothetical 1D instructional policy space, along with the true PPF and the GPR posterior conditioned on the outcome of a set of single-subject experiments at various points in policy space. [sent-86, score-1.197]
35 2 Generative model of student performance Each instructional policy is presumed to have an inherent effectiveness for a population of individuals. [sent-88, score-0.816]
36 To determine the most effective policy from noisy observations, we must specify a generative model of student performance which relates the inherent effectiveness of instruction to observed performance. [sent-90, score-0.506]
37 Formally, each subject s is trained under a policy xs and then tested to evaluate their performance. [sent-91, score-0.504]
38 We posit that each training policy x has a latent population-wide effectiveness fx ∈ R and that how well a subject performs on the test is a noisy function of fxs . [sent-92, score-0.729]
39 We are interested in predicting the effectiveness of a policy x across a population of students given the observed test scores of S subjects trained under the policies x1:S . [sent-93, score-0.88]
40 Conceptually, this involves first inferring the effectiveness f of policies x1:S from the noisy test data, then interpolating from f to fx . [sent-94, score-0.271]
41 2 2 Having specified a prior over policy effectiveness, we turn to specifying a distribution over observable measures of subject learning conditioned on effectiveness. [sent-98, score-0.504]
42 This is consistent with the observation model µs | fxs , o, γ ∼ Beta(γ, γe−(o+fxs ) ) cs | µs ∼ Binomial(g + (1 − g)µs ; ns ) (1) where γ controls inter-subject variability in µs and g is the probability of answering a question correctly by random guessing. [sent-101, score-0.208]
43 5) = 2−ns ns cs cs i=0 cs B(γ + i, ns − cs + γe−(o+fxs ) ) i B(γ, γe−(o+fxs ) ) (2) where B(a, b) = Γ(a)Γ(b)/Γ(a + b) is the beta function. [sent-105, score-0.326]
44 The effectiveness of a policy x M 1 is estimated via p(fx | c) ≈ M m=1 p(fx | f (m) , θ (m) ), where p(fx | f (m) , θ (m) ) is Gaussian with mean and variance determined by the mth sample from the posterior p(f , θ | c). [sent-107, score-0.487]
45 With regard to the latter, subjects may undergo multiple randomly ordered blocks of trials where in each block b a subject s is trained under a policy fxb and then tested. [sent-110, score-0.806]
46 (We refer to this as a ‘strategy’ instead of as a ‘policy’ to avoid confusion with instructional policies. [sent-115, score-0.275]
47 A focus on exploration rapidly exhausts the subject budget for subjects. [sent-121, score-0.237]
48 We adapt the UCB strategy by transforming the UCB based on the GPR to an expression c based on the the population accuracy (proportion correct) via xt = argmaxx P ( ns > νt | fx ), s where νt is an accuracy level determining the exploration/exploitation trade off. [sent-127, score-0.302]
49 Note that in applying the UCB selection 4 (b) relative distance to category boundary (a) Figure 2: (a) Experiment 1 training display; (b) Selected Experiment near 2 stimuli 5 and 10 their 15 training (b) graspability ratings trial (a) 80 70 60 750 1000 25 the PPF with 100 subjects. [sent-130, score-0.607]
50 Light grey squares with erred line depicts the GP posterior mean, ror bars indicate the pink shading is ±2σ(x), µ(x) for policy x, and the results 50 of a traditional comparison where 40 σ(x) is the GP posterior standard deviation. [sent-131, score-0.632]
51 (We refer to this diction of optimum presenta20 as a ’strategy’ instead of a ’policy’ to avoid confusion with instructional policies. [sent-134, score-0.321]
52 We applied a fine uniform grid search over policy space to regions of thisfunction with the maximum uncertainty; exploitation involves concentrating perform the selection. [sent-141, score-0.581]
53 Viewing was training, each pair was viewed the function divided into 16/d trials each with a duration of d sec, where d ranged from 1 sec (viewing the x = argmaxx µt−1 (x) + ηt t−1 intermediate pair 16 times) to 16 sec (viewing the pair once). [sent-149, score-0.3]
54 Large ηt focus on regions in which subjects were asked to → 0, the favorite We explored a variant of this experiment with the greatest uncertainty, but as ηtlearn the focus shifts to exploitation in the During training, each individual’s Annealing ηt as a along sporting team of six individuals. [sent-153, score-0.402]
55 The training policy specifies the duration d of each face-team pair. [sent-157, score-0.576]
56 Immediately following training, subjects To testtested on each optimization faces in random order and a challenging problem were our approach to of the six of instructional policies, we use were in the team. [sent-160, score-0.519]
57 Figure 2a recruited scale, where 1 meanswere graspable and 5 their participawho not shows several objects and tion. [sent-171, score-0.206]
58 The policy space was defined to be in the logarithm of the duration, i. [sent-172, score-0.435]
59 After each subject the following t completed the experiment, the PPF posterior was reestimated, and the upper-confidence bound strategy was used to select the policy for subject t + 1, xt+1 . [sent-186, score-0.672]
60 Figure 3b uses the PPF mean to estimate the optimum duration, and this duration is plotted against 5 near repetition probability relative distance to category boundary far 5 10 15 20 1 0. [sent-195, score-0.375]
61 5 25 0 5 training trial (b) 10 15 training trial Some objects and their graspability ratings: 1 means not graspable and 5 graspable; choosing the category of training examplars over a sequence of the number of subjects. [sent-196, score-0.835]
62 Our procedure mples of fading policies drawn from the 1D fading policy space used in our 20 25 Figure 4: Expt. [sent-197, score-1.085]
63 2, trial dependent fading and repetition policies (left and right, respectively). [sent-198, score-0.552]
64 However, obtaining s the GP posterior mean, µ(x) for policy x, truth requires a massive data collection effort. [sent-202, score-0.487]
65 (We subjects in a standard experimental design involving evaluation of We ran 100 additional refer to this instead of a ’policy’ to avoid confusion with instructional policies. [sent-206, score-0.518]
66 ) The mean score for a trust region; and goal-setting approaches egions of policy space where performance is likelyis plotted in target level each policy to attain some Figure 3a as light grey squares with bars indicating ±2 standard current best experiment result. [sent-212, score-1.001]
67 PPF posterior, but the budget of 100 subjects places a limitation on the interpretability faces an exploration/exploitation trade off. [sent-215, score-0.359]
68 the policy space to the region 1 sec ≤ d ≤ 5 sec. [sent-222, score-0.487]
69 training phase, objects are presented one at a time and subjects must classify the object as glopnor or not-glopnor. [sent-230, score-0.47]
70 On each trial, the object from the previous trial is shown in the corner of the display along with its correct classification, the reason for which is to facilitate comparison proach to optimization of instructional policies, we use a challenging problem of concept or category learning. [sent-232, score-0.559]
71 Following 25 training trials, 24 test trials are administered in rating norms for a set of 320 objects in terms of their graspability, i. [sent-234, score-0.266]
72 The training and test it is to grasp and use makes with one trials each of the objects multiple times olled 57 individuals, each of whom ratedare roughly balanced in number of positive and negative examples. [sent-237, score-0.254]
73 Our goal was to teach the concept of glopnor, using objects multiple times using a 1–5 scale, where 1 means each of whom rated each of the nstructions: 4 not graspable and 5 means highly graspable. [sent-249, score-0.317]
74 We defined an instructional policy space characterized by two dimensions: fading and blocking. [sent-259, score-0.933]
75 Fading refers to the notion from the animal learning literature that learning is facilitated by presenting exemplars far from the category boundary initially, and gradually transitioning toward more difficult exemplars over time. [sent-260, score-0.449]
76 Exemplars far from the boundary may help individuals to attend to the dimension of interest; exemplars near the boundary may help individuals determine where the boundary lies (Pashler & Mozer, in press). [sent-261, score-0.385]
77 Theorists have 6 also made computational arguments for the benefit of fading (Bengio, Louradour, Collobert, & Weston, 2009; Khan et al. [sent-262, score-0.223]
78 Blocking refers to the issue discussed in the Introduction concerning the sequence of category labels: Should training exemplars be blocked or interleaved? [sent-264, score-0.332]
79 That is, should the category label on one trial tend to be the same as or different than the label on the previous trial? [sent-265, score-0.23]
80 For fading, we considered a family of trial-dependent functions that specify the distance of the chosen exemplar to the category boundary (left panel of Figure 4). [sent-266, score-0.266]
81 This family is parameterized by a single policy variable x2 , 0 ≤ x2 ≤ 1 that relates to the distance of an t−1 exemplar to the category boundary, d, as follows: d(t, x2 ) = min(1, 2x2 )−(1−|2x2 −1|) T −1 , where T is the total number of training trials and t is the current trial. [sent-267, score-0.772]
82 For blocking, we also considered a family of trial-dependent functions that vary the probability of a category label repetition over trials (right panel of Figure 4). [sent-268, score-0.259]
83 This family is parameterized by the policy variable x1 , 0 ≤ x1 ≤ 1, that relates to the probability of repeating the category label t−1 of the previous trial, r, as follows: r(t, x1 ) = x1 + (1 − 2x1 ) T −1 . [sent-269, score-0.585]
84 Figure 5a provides a visualization of sample training trial sequences for different points in the 2D policy space. [sent-270, score-0.573]
85 The abscissa of each graph is an index over the 25 training trials; the ordinate represents the category label and its distance from the category boundary. [sent-272, score-0.358]
86 Policies in the top and bottom rows show sequences of all-easy and all-hard examples, respectively; intermediate rows achieve fading in various forms. [sent-273, score-0.223]
87 Policies in the leftmost column begin training with many repetitions and end training with many alternations; policies in the rightmost column begin with alternations and end with repetitions; policies in the middle column have a time-invariant repetition probability of 0. [sent-274, score-0.601]
88 The test objects spanned the spectrum of distances from the category boundary. [sent-277, score-0.243]
89 We seeded the optimization process by running 10 subjects in each of four corners of policy space as well as in the center point of the space. [sent-279, score-0.641]
90 Figure 5 shows the PPF posterior mean over the 2D policy space, along with the selection in policy space of the 200 subjects. [sent-281, score-0.922]
91 Contour map colors indicate the expected accuracy of the corresponding policy (in contrast to the earlier colored graphs in which the coloring indicates the cdf). [sent-282, score-0.435]
92 To validate the outcome of this exploration, we ran 50 subjects at x∗ as well as policies in the upper corners and the center of Figure 5. [sent-285, score-0.41]
93 However, post-hoc power computation revealed that with 50 subjects and the variability inherent in the data, the odds of observing a reliable 2% difference in the mean is only . [sent-295, score-0.24]
94 Thus, although we did not observe a statistically significant improvement at the inferred optimum compared to sensible alternative policies, the results are consistent with our inferred optimum being an improvement over the type of policies one might have proposed a priori. [sent-299, score-0.296]
95 5 Discussion The traditional experimental paradigm in psychology involves comparing a few alternative conditions by testing a large number of subjects in each condition. [sent-300, score-0.294]
96 The present work did not address individual differences or high-dimensional policy spaces, but our framework can readily be extended. [sent-314, score-0.435]
97 For example, one might adopt a fading policy in which the rate of fading depends in a parametric manner on a running average of performance. [sent-318, score-0.881]
98 The challenge of high-dimensional spaces comes primarily from computational overhead in selecting the next policy to evaluate. [sent-320, score-0.435]
99 However, this computational burden can be greatly relaxed by switching from a global optimization perspective to a local perspective: instead of considering candidate policies in the entire space, active selection might consider only policies in the neighborhood of previously explored policies. [sent-321, score-0.408]
100 Stimulus similarity relations modulate benefits for blocking versus interleaving during category learning. [sent-365, score-0.351]
wordName wordTfidf (topN-words)
[('policy', 0.435), ('instructional', 0.275), ('ppf', 0.226), ('fading', 0.223), ('subjects', 0.206), ('policies', 0.204), ('ectiveness', 0.162), ('category', 0.15), ('forrester', 0.145), ('di', 0.137), ('blocking', 0.131), ('keane', 0.128), ('gpr', 0.128), ('exemplars', 0.124), ('glopnor', 0.113), ('graspability', 0.113), ('graspable', 0.113), ('fxs', 0.1), ('ratings', 0.097), ('objects', 0.093), ('duration', 0.083), ('trial', 0.08), ('exploitation', 0.071), ('student', 0.071), ('interleaving', 0.07), ('subject', 0.069), ('fx', 0.067), ('pashler', 0.066), ('exemplar', 0.065), ('filliter', 0.065), ('goldstone', 0.065), ('mcmullen', 0.065), ('salmon', 0.065), ('trials', 0.064), ('budget', 0.064), ('gp', 0.059), ('training', 0.058), ('ucb', 0.057), ('experiment', 0.057), ('vanlehn', 0.057), ('corbett', 0.057), ('teach', 0.057), ('pomdp', 0.056), ('exploration', 0.056), ('cs', 0.055), ('individuals', 0.054), ('concept', 0.054), ('ns', 0.053), ('sec', 0.052), ('posterior', 0.052), ('trade', 0.051), ('traditional', 0.051), ('boundary', 0.051), ('rating', 0.051), ('argmaxx', 0.049), ('ective', 0.048), ('exhausts', 0.048), ('polled', 0.048), ('strategy', 0.047), ('erent', 0.047), ('teaching', 0.047), ('garnett', 0.047), ('osborne', 0.047), ('optimum', 0.046), ('shifting', 0.046), ('roberts', 0.045), ('repetition', 0.045), ('srinivas', 0.045), ('mozer', 0.043), ('erences', 0.043), ('anderson', 0.042), ('grey', 0.042), ('grasp', 0.039), ('concentrating', 0.039), ('faces', 0.038), ('cognitive', 0.038), ('experimental', 0.037), ('regions', 0.036), ('erence', 0.035), ('khan', 0.035), ('population', 0.035), ('presentation', 0.035), ('reliable', 0.034), ('krause', 0.034), ('psychological', 0.033), ('trust', 0.032), ('alternations', 0.032), ('bjork', 0.032), ('boeck', 0.032), ('conrad', 0.032), ('disks', 0.032), ('examplars', 0.032), ('favorite', 0.032), ('ferris', 0.032), ('fxb', 0.032), ('gri', 0.032), ('jonge', 0.032), ('koedinger', 0.032), ('kornell', 0.032), ('minear', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 241 nips-2013-Optimizing Instructional Policies
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
2 0.36592856 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
3 0.33795196 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.28625843 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
5 0.20907198 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
Author: Nguyen Viet Cuong, Wee Sun Lee, Nan Ye, Kian Ming A. Chai, Hai Leong Chieu
Abstract: We introduce a new objective function for pool-based Bayesian active learning with probabilistic hypotheses. This objective function, called the policy Gibbs error, is the expected error rate of a random classifier drawn from the prior distribution on the examples adaptively selected by the active learning policy. Exact maximization of the policy Gibbs error is hard, so we propose a greedy strategy that maximizes the Gibbs error at each iteration, where the Gibbs error on an instance is the expected error of a random classifier selected from the posterior label distribution on that instance. We apply this maximum Gibbs error criterion to three active learning scenarios: non-adaptive, adaptive, and batch active learning. In each scenario, we prove that the criterion achieves near-maximal policy Gibbs error when constrained to a fixed budget. For practical implementations, we provide approximations to the maximum Gibbs error criterion for Bayesian conditional random fields and transductive Naive Bayes. Our experimental results on a named entity recognition task and a text classification task show that the maximum Gibbs error criterion is an effective active learning criterion for noisy models. 1
6 0.19974381 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
7 0.19161221 257 nips-2013-Projected Natural Actor-Critic
8 0.18667574 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
9 0.17237061 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
11 0.15474983 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
12 0.14960781 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
13 0.14930746 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
14 0.14187266 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
15 0.14067467 347 nips-2013-Variational Planning for Graph-based MDPs
16 0.13744472 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
17 0.13738705 69 nips-2013-Context-sensitive active sensing in humans
18 0.13189921 54 nips-2013-Bayesian optimization explains human active search
19 0.13066457 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
20 0.12331489 165 nips-2013-Learning from Limited Demonstrations
topicId topicWeight
[(0, 0.236), (1, -0.312), (2, -0.221), (3, 0.154), (4, -0.003), (5, 0.029), (6, -0.086), (7, 0.067), (8, -0.022), (9, 0.045), (10, -0.045), (11, -0.026), (12, -0.098), (13, 0.024), (14, 0.035), (15, -0.079), (16, -0.059), (17, -0.037), (18, -0.066), (19, 0.018), (20, 0.047), (21, -0.035), (22, -0.058), (23, 0.026), (24, -0.002), (25, 0.014), (26, 0.005), (27, 0.05), (28, -0.081), (29, -0.008), (30, -0.074), (31, -0.02), (32, 0.065), (33, -0.087), (34, 0.089), (35, 0.054), (36, 0.109), (37, -0.046), (38, -0.007), (39, 0.059), (40, -0.042), (41, 0.001), (42, 0.009), (43, 0.023), (44, -0.043), (45, 0.039), (46, 0.079), (47, -0.026), (48, -0.027), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.97267258 241 nips-2013-Optimizing Instructional Policies
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
2 0.85529733 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
3 0.84586585 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.84293014 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer
Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1
5 0.7910111 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
6 0.74990875 348 nips-2013-Variational Policy Search via Trajectory Optimization
7 0.72215974 257 nips-2013-Projected Natural Actor-Critic
8 0.69828272 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
9 0.66594768 165 nips-2013-Learning from Limited Demonstrations
10 0.66588998 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
11 0.60363197 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
12 0.59301186 347 nips-2013-Variational Planning for Graph-based MDPs
13 0.57023948 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
14 0.56670779 69 nips-2013-Context-sensitive active sensing in humans
15 0.56225812 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
16 0.53270632 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
17 0.52168518 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
18 0.50519389 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
19 0.50270861 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
20 0.50010151 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
topicId topicWeight
[(2, 0.015), (16, 0.038), (33, 0.13), (34, 0.127), (41, 0.021), (49, 0.036), (56, 0.09), (64, 0.299), (70, 0.039), (85, 0.033), (89, 0.037), (93, 0.04), (95, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.79322577 241 nips-2013-Optimizing Instructional Policies
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
2 0.73395944 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
Author: Xiaojin Zhu
Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1
3 0.67399722 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Author: Adrien Todeschini, François Caron, Marie Chavent
Abstract: We propose a novel class of algorithms for low rank matrix completion. Our approach builds on novel penalty functions on the singular values of the low rank matrix. By exploiting a mixture model representation of this penalty, we show that a suitably chosen set of latent variables enables to derive an ExpectationMaximization algorithm to obtain a Maximum A Posteriori estimate of the completed low rank matrix. The resulting algorithm is an iterative soft-thresholded algorithm which iteratively adapts the shrinkage coefficients associated to the singular values. The algorithm is simple to implement and can scale to large matrices. We provide numerical comparisons between our approach and recent alternatives showing the interest of the proposed approach for low rank matrix completion. 1
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
5 0.58667147 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.58282799 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
7 0.58238816 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
8 0.58213323 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
9 0.58174032 201 nips-2013-Multi-Task Bayesian Optimization
10 0.58139414 5 nips-2013-A Deep Architecture for Matching Short Texts
11 0.57903737 350 nips-2013-Wavelets on Graphs via Deep Learning
12 0.57851928 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
13 0.57846534 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
14 0.57844567 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
15 0.57844532 294 nips-2013-Similarity Component Analysis
16 0.57796502 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
17 0.57778955 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
18 0.57774383 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
19 0.57714736 86 nips-2013-Demixing odors - fast inference in olfaction
20 0.57694542 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models