nips nips2013 nips2013-181 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xiaojin Zhu
Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We propose an optimal teaching framework aimed at learners who employ Bayesian models. [sent-4, score-0.984]
2 Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. [sent-5, score-1.252]
3 In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. [sent-7, score-1.209]
4 Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. [sent-8, score-0.978]
5 If the learner receives iid noiseless training examples where 1 ˆ x ∼ uniform[0, 1], then with large probability |θ − θ| = O( n ). [sent-16, score-0.33]
6 The key difference is that an active learner still needs to explore the boundary, while a teacher can guide. [sent-20, score-0.391]
7 Understanding the optimal teaching strategies is important for both machine learning and education: (i) When the learner is a human student (as modeled in cognitive psychology), optimal teaching theory can design the best lessons for education. [sent-23, score-2.086]
8 ” Optimal teaching quantifies the power and limits of such adversaries. [sent-25, score-0.882]
9 (iii) Optimal teaching informs robots as to the best ways to utilize human teaching, and vice versa. [sent-26, score-0.882]
10 The first thread is the teaching dimension theory by Goldman and Kearns [10] and its extensions in computer science(e. [sent-28, score-0.899]
11 Our framework allows for probabilistic, noisy learners with infinite hypothesis space, arbitrary loss functions, and the notion of teaching effort. [sent-31, score-0.983]
12 2 we will show that the original teaching dimension is a special case of our framework. [sent-33, score-0.882]
13 We defer the discussion on this variant, known as “collusion” in computational teaching theory, and its connection to information theory to section 5. [sent-42, score-0.872]
14 In addition, our optimal teaching framework may shed light on the optimality of different method of teaching humans [9, 13, 17, 18]. [sent-43, score-1.791]
15 Interactive systems have been built which employ or study teaching heuristics [4, 6]. [sent-45, score-0.872]
16 Our framework provides a unifying optimization view that balances the future loss of the learner and the effort of the teacher. [sent-46, score-0.369]
17 2 Optimal Teaching for General Learners We start with a general framework for teaching and gradually specialize the framework in later sections. [sent-47, score-0.894]
18 Future test items for the learner will be drawn iid from this model. [sent-50, score-0.34]
19 Without loss of generality let θ∗ ∈ Θ, the hypothesis space of the learner (if not, we can always admit approximation error and define θ∗ to be the distribution in Θ closest to the world distribution). [sent-53, score-0.32]
20 1 However, it can only teach the learner by providing teaching (or, from the learner’s perspective, training) examples. [sent-58, score-1.205]
21 The teacher’s goal is to design a teaching set D so that the learner learns θ∗ as accurately and effortlessly as possible. [sent-59, score-1.134]
22 In this paper, we consider batch teaching where the teacher presents D to the learner all at once, and the teacher can use any item in the example domain. [sent-60, score-1.391]
23 The quantity fD represents the state of the learner after seeing the teaching set D. [sent-65, score-1.149]
24 The function effort() measures the difficulty the teacher experiences when teaching with D. [sent-66, score-0.989]
25 Despite its appearance, the optimal teaching problem (1) is completely different from regularized parameter estimation in machine learning. [sent-67, score-0.896]
26 The optimization is instead over the teaching set D. [sent-69, score-0.883]
27 The optimal teaching problem (1) so far is rather abstract. [sent-72, score-0.896]
28 However, we note that our framework can be adapted to other types of learners, as long as we know how they react to the teaching set D. [sent-74, score-0.898]
29 It can be relaxed in future work, where the teacher has to estimate the state of the learner by “probing” it with tests. [sent-76, score-0.379]
30 2 3 Optimal Teaching for Bayesian Learners We focus on Bayesian learners because they are widely used in both machine learning and cognitive science [7, 23, 24] and because of their predictability: they react to any teaching examples in D by performing Bayesian updates. [sent-77, score-1.022]
31 1 The KL Loss and Various Effort Functions, with Examples The choice of loss() and effort() is problem-specific and depends on the teaching goal. [sent-83, score-0.872]
32 With the KL loss, it is easy to verify that the optimal teaching problem (1) can be equivalently written as min − log p(θ∗ | D) + effort(D). [sent-86, score-0.914]
33 Instead, the intuition is to find a good teaching set D to make θ∗ “stand out” in the posterior distribution. [sent-88, score-0.897]
34 The effort() function reflects resource constraints on the teacher and the learner: how hard is it to create the teaching examples, to deliver them to the learner, and to have the learner absorb them? [sent-89, score-1.251]
35 For most of the paper we use the cardinality of the teaching set effort(D) = c|D| where c is a positive per-item cost. [sent-90, score-0.907]
36 This assumes that the teaching effort is proportional to the number of teaching items, which is reasonable in many problems. [sent-91, score-1.791]
37 The Teaching Impedance (TI) of a teaching set D is the objective value − log p(θ∗ | D) + effort(D). [sent-97, score-0.89]
38 We now give examples to illustrate our optimal teaching framework for Bayesian learners. [sent-99, score-0.942]
39 The teacher wants the learner to arrive at a posterior p(θ | D) peaked at θ∗ by designing a small D. [sent-106, score-0.414]
40 The optimal teaching problem becomes 1 minn,x1 ,y1 ,. [sent-112, score-0.896]
41 One solution is the limiting case i i with a teaching set of size two D = {(θ∗ − /2, −1), (θ∗ + /2, 1)} as → 0, since the Teaching Impedance T I = log( ) + 2c approaches −∞. [sent-116, score-0.872]
42 We may 2 Bayesian learners typically assume that the training data is iid; optimal teaching intentionally violates this assumption because the designed teaching examples in D will typically be non-iid. [sent-122, score-1.884]
43 That is, the teaching exi j amples require more effort to learn if any two items are too close. [sent-126, score-0.968]
44 With two teaching examples as in Example 1, T I = log( ) + c/ . [sent-127, score-0.898]
45 The optimal teaching set is D = {(θ∗ − c/2, −1), (θ∗ + c/2, 1)}. [sent-129, score-0.896]
46 The learner has Θ = {θA , θB }, and we want to teach it the fact 2 4 2 1 that the world is using θ∗ = θA . [sent-132, score-0.368]
47 For this example, let us suppose that teaching with extreme item values is undesirable (note xi → −∞ minimizes the KL loss). [sent-140, score-0.919]
48 In other words, the teaching items must be in some interval [−d, d]. [sent-142, score-0.921]
49 This leads to the optimal n n teaching problem minn,x1 ,. [sent-143, score-0.896]
50 , when c ≥ d): the effort of teaching outweighs the benefit. [sent-155, score-0.919]
51 2 Teaching Dimension is a Special Case In this section we provide a comparison to one of the most influential teaching models, namely the original teaching dimension theory [10]. [sent-158, score-1.754]
52 It may seem that our optimal teaching setting (2) is more restrictive than theirs, since we make strong assumptions about the learner (that it is Bayesian, and the form of the prior and likelihood). [sent-159, score-1.181]
53 Their query learning setting in fact makes equally strong assumptions, in that the learner updates its version space to be consistent with all teaching items. [sent-160, score-1.134]
54 The learner’s posterior after teaching with D is 1/(number of concepts in C consistent with D), if c is consistent with D P (θc | D) = (3) 0, otherwise Teaching dimension T D(c∗ ) is the minimum cardinality of D that uniquely identifies the target concept. [sent-166, score-0.986]
55 We can formulate this using our optimal teaching framework min − log P (θc∗ | D) + γ|D|, (4) D where we used the cardinality effort() function (and renamed the cost γ for clarity). [sent-167, score-0.96]
56 But since T D(c∗ ) is unknown beforehand, we can set γ ≤ |C| since |C| ≥ T D(c∗ ) (one can at least eliminate one concept from the version space with each well-designed teaching item). [sent-169, score-0.885]
57 The solution D to (4) is then a minimum teaching set for the target concept c∗ , and |D| = T D(c∗ ). [sent-170, score-0.929]
58 4 Optimal Teaching for Bayesian Learners in the Exponential Family While we have proposed an optimization-based framework for teaching any Bayesian learner and provided three examples, it is not clear if there is a unified approach to solve the optimization 4 problem (2). [sent-171, score-1.156]
59 For this subset of Bayesian learners, finding the optimal teaching set D naturally decomposes into two steps: In the first step one solves a convex optimization problem to find the optimal aggregate sufficient statistics for D. [sent-173, score-1.04]
60 In the second step one “unpacks” the aggregate sufficient statistics into actual teaching examples. [sent-174, score-0.972]
61 If we further assume that effort(D) can be expressed in n and s, then we can write our optimal teaching problem (2) as min −θ∗ (λ1 + s) + A(θ∗ )(λ2 + n) + A0 (λ1 + s, λ2 + n) + effort(n, s), n,s (6) where n ∈ Z≥0 and s ∈ {t ∈ RD | ∃{xi }i∈I such that t = i∈I T (xi )}. [sent-187, score-0.896]
62 5 We then need to find n teaching examples whose aggregate sufficient statistics is s. [sent-194, score-0.998]
63 Given n and s we can easily unpack the teaching set D = {x1 , . [sent-198, score-0.917]
64 We initialize the n teaching examples iid by xi ∼ p(x | θ∗ ), i = 1 . [sent-213, score-0.951]
65 6 As we will see later, such iid samples from the target distribution are not great teaching examples for two main reasons: (i) We really should compensate for the learner’s prior by aiming not at the target distribution but overshooting a bit in the opposite direction of the prior. [sent-226, score-1.041]
66 Algorithm 1 Approximately optimal teaching for Bayesian learners in the exponential family input target θ∗ ; learner information T (), A(), A0 (), λ1 , λ2 ; effort() Step 1: Solve for aggregate sufficient statistics n, s by convex optimization (6) Step 2: Unpacking: n ← max(0, [n]); find x1 , . [sent-232, score-1.425]
67 To find a good teaching set D, in step 1 we first find its optimal cardinality n and aggregate sufficient statistics s = i∈D xi using (6). [sent-243, score-1.055]
68 (9) σ0 s Note an interesting fact here: the average of teaching examples n is not the target µ∗ , but should compensate for the learner’s initial belief µ0 . [sent-247, score-0.943]
69 Another interesting fact is that the optimal teaching strategy may be to not teach at all (n = 0). [sent-257, score-0.967]
70 Intuitively, the learner is too stubborn to change its prior belief by much, and such minuscule change does not justify the teaching effort. [sent-259, score-1.157]
71 Yet another interesting fact is that such an alarming teaching set (with n identical examples) is likely to contradict the world’s model variance σ 2 , but the discrepancy does not affect teaching because the learner fixes σ 2 . [sent-265, score-2.006]
72 The teacher needs to decide the total k=1 πk Γ(βk ) number of teaching items n and the split s = (s1 , . [sent-276, score-1.038]
73 Let the teaching target be π ∗ = ( 10 , 10 , 10 ). [sent-291, score-0.905]
74 If we say that teaching requires no effort by setting effort(s) = 0, then the optimal teaching set D found by Algorithm 1 is s = (317, 965, 1933) as implemented with Matlab fmincon. [sent-293, score-1.815]
75 K But if we say teaching is costly by setting effort(s) = 0. [sent-301, score-0.872]
76 Our optimal teaching set (0, 2, 8) has Teaching Impedance T I = 2. [sent-304, score-0.896]
77 We can also attempt to sample teaching sets of size ten from multinomial(10, π ∗ ). [sent-309, score-0.883]
78 In 100,000 simulations with such random teaching sets the average T I = 4. [sent-310, score-0.872]
79 In summary, our optimal teaching set (0, 2, 8) is very good. [sent-315, score-0.896]
80 For instance, with the machinery in Example 5 one can teach the learner a full generative model for a Na¨ve Bayes ı ∗ classifier. [sent-317, score-0.333]
81 The optimal choice of these n’s and m’s for teaching can be solved separately as in Example 5 as long as effort() can be separated. [sent-340, score-0.896]
82 The unpacking step is easy: we know we need nk teaching documents with label k. [sent-341, score-1.01]
83 In the end, each teaching document with label k will have the bag-of-words mk1 , . [sent-344, score-0.884]
84 Now we consider the general case of teaching both the mean and the covariance of a multivariate Gaussian. [sent-349, score-0.872]
85 Again, such iid samples are typically not good teaching examples. [sent-371, score-0.901]
86 Algorithm 1 decides to use n = 4 teaching examples with s = 4. [sent-380, score-0.898]
87 These teaching examples form a tetrahedron (edges added for clarity). [sent-389, score-0.898]
88 This is sensible: in fact, one can show that the minimum teaching set for a D-dimensional Gaussian is the D + 1 points at the vertices of a D-dimensional tetrahedron. [sent-390, score-0.883]
89 In contrast, teaching sets with four iid random samples from the target N (µ∗ , Σ∗ ) have worse TI. [sent-395, score-0.934]
90 In 100,000 simulations such random teaching sets have average T I = 9. [sent-396, score-0.872]
91 For example, the task in Figure 1 may only require a single teaching example D = {x1 = θ∗ }, and the learner can figure out that this x1 encodes the decision boundary. [sent-408, score-1.145]
92 In fact, this is known as “collusion” in computational teaching theory (see e. [sent-410, score-0.872]
93 In one extreme of collusion, the teacher and the learner agree upon an information-theoretical coding scheme beforehand. [sent-413, score-0.379]
94 Then, the teaching set D is not used in a traditional machine learning training set sense, but rather as source coding. [sent-414, score-0.885]
95 We introduced an optimal teaching framework that balances teaching loss and effort. [sent-417, score-1.817]
96 we hope this paper provides a “stepping stone” for follow-up work, such as 0-1 loss() for classification, nonBayesian learners, uncertainty in learner’s state, and teaching materials beyond training items. [sent-418, score-0.885]
97 Generalized teaching dimensions and the query complexity of learning. [sent-488, score-0.872]
98 How do humans teach: On curriculum learning and teaching dimension. [sent-494, score-0.908]
99 Complexity of teaching by a restricted number of examples. [sent-500, score-0.872]
100 Success and failure in teaching the [r]-[l] contrast to Japanese adults: Tests of a Hebbian model of plasticity and stabilization in spoken language perception. [sent-524, score-0.872]
wordName wordTfidf (topN-words)
[('teaching', 0.872), ('ort', 0.287), ('learner', 0.262), ('teacher', 0.117), ('unpacking', 0.09), ('aggregate', 0.088), ('learners', 0.077), ('teach', 0.071), ('items', 0.049), ('sk', 0.047), ('effort', 0.047), ('unpack', 0.045), ('world', 0.035), ('cardinality', 0.035), ('target', 0.033), ('impedance', 0.032), ('cognitive', 0.032), ('iid', 0.029), ('bayesian', 0.028), ('fd', 0.027), ('collusion', 0.027), ('niw', 0.027), ('examples', 0.026), ('posterior', 0.025), ('optimal', 0.024), ('curriculum', 0.024), ('xi', 0.024), ('item', 0.023), ('prior', 0.023), ('loss', 0.023), ('xn', 0.022), ('grif', 0.022), ('nk', 0.022), ('kl', 0.021), ('family', 0.02), ('log', 0.018), ('mki', 0.018), ('mkv', 0.018), ('rafferty', 0.018), ('taught', 0.018), ('unpacks', 0.018), ('exponential', 0.017), ('thread', 0.017), ('suf', 0.017), ('rd', 0.016), ('shafto', 0.016), ('tenenbaum', 0.016), ('multinomial', 0.015), ('cn', 0.015), ('seeing', 0.015), ('goldman', 0.015), ('react', 0.015), ('stem', 0.015), ('ths', 0.015), ('balances', 0.015), ('conjugate', 0.014), ('mle', 0.014), ('documents', 0.014), ('guides', 0.014), ('sii', 0.014), ('concept', 0.013), ('training', 0.013), ('exp', 0.013), ('nonnegativity', 0.013), ('colt', 0.013), ('opposite', 0.013), ('likelihood', 0.013), ('compensate', 0.012), ('label', 0.012), ('dot', 0.012), ('visualized', 0.012), ('active', 0.012), ('clarity', 0.012), ('word', 0.012), ('humans', 0.012), ('statistics', 0.012), ('mind', 0.011), ('minimum', 0.011), ('na', 0.011), ('decision', 0.011), ('relax', 0.011), ('framework', 0.011), ('optimization', 0.011), ('ten', 0.011), ('passive', 0.011), ('univariate', 0.01), ('wants', 0.01), ('limits', 0.01), ('black', 0.01), ('robots', 0.01), ('white', 0.01), ('gaussian', 0.01), ('sensible', 0.01), ('integers', 0.01), ('dimension', 0.01), ('uniform', 0.009), ('mini', 0.009), ('sure', 0.009), ('convex', 0.009), ('illustrate', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
Author: Xiaojin Zhu
Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1
2 0.10567392 230 nips-2013-Online Learning with Costly Features and Labels
Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari
Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label significantly increases the regret of the problem. 1
3 0.066412441 241 nips-2013-Optimizing Instructional Policies
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
4 0.063279465 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
Author: Leonidas Lefakis, François Fleuret
Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1
5 0.058046449 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
Author: Alexander Zimin, Gergely Neu
Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1
6 0.039124165 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?
7 0.037973955 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
8 0.033623267 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
9 0.032393027 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
10 0.0308783 325 nips-2013-The Pareto Regret Frontier
11 0.030187221 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
12 0.030104615 309 nips-2013-Statistical Active Learning Algorithms
13 0.028577425 171 nips-2013-Learning with Noisy Labels
14 0.026981965 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
15 0.026765265 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
16 0.025835356 211 nips-2013-Non-Linear Domain Adaptation with Boosting
17 0.023311276 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
18 0.02291631 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.022485917 252 nips-2013-Predictive PAC Learning and Process Decompositions
20 0.022433169 149 nips-2013-Latent Structured Active Learning
topicId topicWeight
[(0, 0.073), (1, -0.012), (2, 0.007), (3, -0.027), (4, 0.021), (5, 0.016), (6, 0.006), (7, -0.008), (8, 0.004), (9, 0.01), (10, -0.013), (11, -0.011), (12, -0.022), (13, 0.003), (14, 0.017), (15, -0.044), (16, -0.005), (17, 0.03), (18, 0.019), (19, -0.008), (20, -0.043), (21, 0.012), (22, -0.031), (23, 0.015), (24, -0.01), (25, 0.03), (26, -0.023), (27, 0.033), (28, -0.013), (29, 0.016), (30, -0.017), (31, 0.075), (32, -0.01), (33, -0.024), (34, -0.026), (35, -0.007), (36, -0.048), (37, -0.06), (38, 0.025), (39, 0.036), (40, -0.088), (41, 0.021), (42, -0.022), (43, -0.045), (44, 0.026), (45, 0.031), (46, -0.097), (47, -0.036), (48, -0.056), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.87785387 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
Author: Xiaojin Zhu
Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1
2 0.63111842 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
Author: Leonidas Lefakis, François Fleuret
Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1
3 0.52174705 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
Author: Alexander Zimin, Gergely Neu
Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1
4 0.51429129 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari
Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1
5 0.47132301 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
Author: Amit Daniely, Nati Linial, Shai Shalev-Shwartz
Abstract: The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a computational resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a natural supervised learning problem — we consider agnostic PAC learning of halfspaces over 3-sparse vectors in {−1, 1, 0}n . This class is inefficiently learnable using O n/ 2 examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random 3CNF formulas is hard, it is impossible to efficiently learn this class using only O n/ 2 examples. We further show that under stronger hardness assumptions, even O n1.499 / 2 examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently ˜ using Ω n2 / 2 examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem. 1
6 0.45992652 255 nips-2013-Probabilistic Movement Primitives
7 0.4582549 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics
9 0.4507063 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?
10 0.44967264 134 nips-2013-Graphical Models for Inference with Missing Data
11 0.44914302 230 nips-2013-Online Learning with Costly Features and Labels
12 0.42327383 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
13 0.41241109 340 nips-2013-Understanding variable importances in forests of randomized trees
14 0.40473038 171 nips-2013-Learning with Noisy Labels
15 0.39956895 332 nips-2013-Tracking Time-varying Graphical Structure
16 0.38724962 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
17 0.38455626 252 nips-2013-Predictive PAC Learning and Process Decompositions
18 0.37437838 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
19 0.37278807 164 nips-2013-Learning and using language via recursive pragmatic reasoning about other agents
20 0.37082651 225 nips-2013-One-shot learning and big data with n=2
topicId topicWeight
[(2, 0.022), (16, 0.029), (33, 0.123), (34, 0.067), (41, 0.04), (49, 0.018), (56, 0.129), (64, 0.281), (70, 0.028), (85, 0.036), (89, 0.022), (93, 0.045), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.77032328 241 nips-2013-Optimizing Instructional Policies
Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler
Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1
same-paper 2 0.7567181 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
Author: Xiaojin Zhu
Abstract: What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework. 1
3 0.67053312 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Author: Adrien Todeschini, François Caron, Marie Chavent
Abstract: We propose a novel class of algorithms for low rank matrix completion. Our approach builds on novel penalty functions on the singular values of the low rank matrix. By exploiting a mixture model representation of this penalty, we show that a suitably chosen set of latent variables enables to derive an ExpectationMaximization algorithm to obtain a Maximum A Posteriori estimate of the completed low rank matrix. The resulting algorithm is an iterative soft-thresholded algorithm which iteratively adapts the shrinkage coefficients associated to the singular values. The algorithm is simple to implement and can scale to large matrices. We provide numerical comparisons between our approach and recent alternatives showing the interest of the proposed approach for low rank matrix completion. 1
4 0.63106394 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
5 0.58612108 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
7 0.58318484 318 nips-2013-Structured Learning via Logistic Regression
8 0.58131325 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
9 0.58130074 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
10 0.58085138 355 nips-2013-Which Space Partitioning Tree to Use for Search?
11 0.58042127 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
12 0.58010596 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
13 0.57926309 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
14 0.57910013 11 nips-2013-A New Convex Relaxation for Tensor Completion
15 0.57829535 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
16 0.57813179 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
17 0.57781661 252 nips-2013-Predictive PAC Learning and Process Decompositions
18 0.57774061 233 nips-2013-Online Robust PCA via Stochastic Optimization
19 0.57730281 149 nips-2013-Latent Structured Active Learning
20 0.57706362 230 nips-2013-Online Learning with Costly Features and Labels