nips nips2010 nips2010-14 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. [sent-6, score-0.415]
2 We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. [sent-7, score-0.431]
3 We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). [sent-9, score-1.084]
4 Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. [sent-10, score-0.556]
5 This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. [sent-11, score-0.519]
6 This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. [sent-12, score-0.425]
7 1 Introduction Apprenticeship learning is a variant of reinforcement learning, first introduced by Abbeel & Ng [1] (see also [2, 3, 4, 5, 6]), designed to address the difficulty of correctly specifying the reward function in many reinforcement learning problems. [sent-13, score-0.313]
8 The basic idea underlying apprenticeship learning is that a learning agent, called the apprentice, is able to observe another agent, called the expert, behaving in a Markov Decision Process (MDP). [sent-14, score-0.243]
9 The goal of the apprentice is to learn a policy that is at least as good as the expert’s policy, relative to an unknown reward function. [sent-15, score-1.051]
10 This is a weaker requirement than the usual goal in reinforcement learning, which is to find a policy that maximizes reward. [sent-16, score-0.659]
11 The development of the apprenticeship learning framework was prompted by the observation that, although reward functions are often difficult to specify, demonstrations of good behavior by an expert are usually available. [sent-17, score-0.755]
12 Existing apprenticeship learning algorithms have a number of limitations. [sent-19, score-0.228]
13 However, there may be cases where the apprentice is unwilling or unable to assume that the rewards have this structure. [sent-21, score-0.424]
14 Additionally, most formulations of apprenticeship learning are actually harder than reinforcement learning; apprenticeship learning algorithms typically invoke reinforcement learning algorithms as subroutines, and their performance guarantees depend strongly on the quality of these subroutines. [sent-22, score-0.663]
15 Consequently, these apprenticeship learning algorithms suffer from the same challenges of large state spaces, exploration vs. [sent-23, score-0.289]
16 This fact is somewhat contrary to the intuition that demonstrations from an expert — especially a good expert — should make the problem easier, not harder. [sent-27, score-0.75]
17 Another approach to using expert demonstrations that has received attention primarily in the empirical literature is to passively imitate the expert using a classification algorithm (see [7, Section 4] for a comprehensive survey). [sent-28, score-0.857]
18 Put differently, we show that apprenticeship learning can be reduced to classification. [sent-32, score-0.249]
19 First, we show that the differ√ ence between the value of the apprentice’s policy and the expert’s policy is O( ǫ),1 where ǫ ∈ (0, 1] is the error of the learned classifier. [sent-35, score-1.056]
20 Secondly, and perhaps more interestingly, we extend our first result to prove that the difference in policy values is only O(ǫ) when the expert’s policy is close to optimal. [sent-36, score-1.084]
21 Of course, if one could perfectly imitate the expert, then naturally a near-optimal expert policy is preferred. [sent-37, score-0.944]
22 If one is certain a priori that the expert is demonstrating good behavior, then our result implies that many fewer demonstrations need to be collected than if this were not the case. [sent-40, score-0.452]
23 This can yield substantial savings when expert demonstrations are expensive or difficult to obtain. [sent-41, score-0.425]
24 2 Related Work Several authors have reduced reinforcement learning to simpler problems. [sent-42, score-0.132]
25 Bagnell et al [10] described an algorithm for constructing a good nonstationary policy from a sequence of good “onestep” policies. [sent-43, score-0.779]
26 These policies are only concerned with maximizing reward collected in a single time step, and are learned with the help of observations from an expert. [sent-44, score-0.147]
27 Langford & Zadrozny [11] reduced reinforcement learning to a sequence of classification problems (see also Blatt & Hero [12]), but these problems have an unusual structure, and the authors are only able to provide a small amount of guidance as to how data for these problems can be collected. [sent-45, score-0.132]
28 Kakade & Langford [13] reduced reinforcement learning to regression, but required additional assumptions about how easily a learning algorithm can access the entire state space. [sent-46, score-0.193]
29 Importantly, all this work makes the standard reinforcement learning assumptions that the true rewards are known, and that a learning algorithm is able to interact directly with the environment. [sent-47, score-0.153]
30 In this paper we are interested in settings where the reward function is not known, and where the learning algorithm is limited to passively observing an expert. [sent-48, score-0.146]
31 They also assume that the expert follows a deterministic policy, and assumption we do not make. [sent-51, score-0.399]
32 We will allow the state space S to be infinite, but assume that the action space A is finite. [sent-53, score-0.109]
33 Let α be the initial state distribution, and θ the transition function, where θ(s, a, ·) specifies the next-state distribution from state s ∈ S under action a ∈ A. [sent-54, score-0.155]
34 The only assumption we make about the unknown reward function R is that 0 ≤ R(s) ≤ Rmax for all states s ∈ S, where Rmax is a finite upper bound on the reward of any state. [sent-55, score-0.243]
35 A policy π is stationary if it is a mapping from states to distributions over actions. [sent-59, score-0.589]
36 A policy π is nonstationary if it belongs to the set ΠH = Π × · · · (H times) · · · × Π . [sent-62, score-0.709]
37 In this case, πt (s, a) denotes the probability of taking action a in state s at time t. [sent-63, score-0.129]
38 Also, if π is nonstationary, then πt refers to the stationary policy that is equal to the tth component of π. [sent-64, score-0.58]
39 A (stationary or nonstationary) policy π is deterministic if each one of its action distributions is concentrated on a single action. [sent-65, score-0.647]
40 If a deterministic policy π is stationary, then π(s) is the action taken in state s, and if π is nonstationary, the πt (s) is the action taken in state s at time t. [sent-66, score-0.822]
41 We define the value function Vtπ (s) for a nonstationary policy π at time t as follows in the usual manner: H Vtπ (s) E t′ =t R(st′ ) st = s, at′ ∼ πt′ (st′ , ·), st′ +1 ∼ θ(st′ , at′ , ·) . [sent-67, score-0.795]
42 So Vtπ (s) is the expected cumulative reward for following policy π when starting at state s and time step t. [sent-68, score-0.685]
43 Note that there are several value functions per nonstationary policy, one for each time step t. [sent-69, score-0.201]
44 The value of a policy is defined to be V (π) E[V1π (s) | s ∼ α(·)], and an optimal policy π ∗ is one that satisfies π ∗ arg maxπ V (π). [sent-70, score-1.056]
45 We write π E to denote the (possibly nonstationary) expert policy, and VtE (s) as an abbreviation for E Vtπ (s). [sent-71, score-0.363]
46 Our goal is to find a nonstationary apprentice policy π A such that V (π A ) ≥ V (π E ). [sent-72, score-1.106]
47 Note that the values of these policies are with respect to the unknown reward function. [sent-73, score-0.127]
48 π Let Dt be the distribution on state-action pairs at time t under policy π. [sent-74, score-0.548]
49 In other words, a sample π (s, a) is drawn from Dt by first drawing s1 ∼ α(·), then following policy π for time steps 1 through E t, which generates a trajectory (s1 , a1 , . [sent-75, score-0.645]
50 4 Details and Justification of the Reduction Our goal is to reduce apprenticeship learning to classification, so let us describe exactly how this reduction is defined, and also justify the utility of such a reduction. [sent-81, score-0.246]
51 The existence of PAC-learnable hypothesis classes is the reason that reducing apprenticeship learning to classification is a sensible endeavor. [sent-92, score-0.306]
52 Suppose that the apprentice observes m independent trajectories from the expert’s policy π E , where the ith trajectory is a sequence si , ai , . [sent-93, score-1.071]
53 Now consider a PAC-learnable hypothesis class H, where H contains a set of functions map1 ping the state space S to the finite action space A. [sent-98, score-0.152]
54 If m = poly( Hδ , 1 ), then for each time step ǫ ˆ t, the apprentice can use a PAC learning algorithm for H to learn a hypothesis ht ∈ H such that, 1 ˆ E with probability at least 1 − Hδ , we have Pr(s,a)∼Dt ht (s) = a ≤ ǫ∗ E + ǫ. [sent-99, score-0.509]
55 This policy uses the learned natural choice for the apprentice’s policy π is to set πt = h classifiers to imitate the behavior of the expert. [sent-102, score-1.158]
56 The apprentice policy π A is a deterministic policy that satisfies A E Pr(s,a)∼Dt (πt (s) = a) ≤ ǫ for some ǫ > 0 and all time steps t. [sent-105, score-1.564]
57 As we have shown, an apprentice policy satisfying Assumption 1 with small ǫ can be found with high probability, provided that expert’s policy is well-approximated by a PAC-learnable hypothesis class and that the apprentice is given enough trajectories from the expert. [sent-106, score-1.937]
58 A reasonable intuition is that the value of the policy π A in Assumption 1 is nearly as high as the value of the policy π E ; the remainder of this paper is devoted to confirming this intuition. [sent-107, score-1.056]
59 5 Guarantee for Any Expert If the error rate ǫ in Assumption 1 is small, then the apprentice’s policy π A closely imitates the expert’s policy π E , and we might hope that this implies that V (π A ) is not much less than V (π E ). [sent-108, score-1.089]
60 The main challenge in proving Theorem 1 is that this assumption does not hold for the classification problems to which we have reduced the apprenticeship learning problem. [sent-113, score-0.3]
61 So our strategy for proving Theorem 1 will be to show that these differences do not cause the value of the apprentice policy to degrade too much relative to the value of the expert’s policy. [sent-115, score-0.947]
62 Say that a state s is good if πt (s, πt (s)) ≥ 1 − ǫ1 , and that s is bad otherwise. [sent-121, score-0.188]
63 In the proofs of these lemmas, we write sa to denote a trajectory, where sa = (¯1 , a1 , . [sent-125, score-0.67]
64 Also, let dPπ denote s ¯ ¯ ¯ H the probability measure induced on trajectories by following policy π, and let R(sa) = t=1 R(¯t ) s 4 denote the sum of the rewards of the states in trajectory sa. [sent-129, score-0.693]
65 sa The next lemma proves that if a deterministic policy “almost” agrees with the expert’s policy π E in every state and time step, then its value is not much worse the value of π E . [sent-131, score-1.618]
66 Say a trajectory sa is good if it is “consistent” with π — that is, π (¯t ) = at for all time steps t — and that sa is bad otherwise. [sent-136, score-0.913]
67 The second inequality holds because good trajectories are assigned at least as much measure by Pπ as by PπE , ˆ because π is deterministic. [sent-138, score-0.157]
68 ˆ The next lemma proves a slightly different statement than Lemma 2: If a policy exactly agrees with the expert’s policy π E in “almost” every state and time step, then its value is not much worse the value of π E . [sent-139, score-1.235]
69 Say a trajectory sa is good if πt (¯t , ·) = πt (¯t , ·) for all time steps t, and that sa is bad otherwise. [sent-144, score-0.913]
70 We have V (ˆ ) = π R(sa)dPπ ˆ sa = R(sa)dPπ + ˆ sa good = R(sa)dPπ ˆ sa bad R(sa)dPπE + sa good = sa R(sa)dPπE − R(sa)dPπ ˆ sa bad R(sa)dPπE + sa bad ≥ V (π E ) − ǫH 2 Rmax + R(sa)dPπ ˆ sa bad R(sa)dPπ ˆ sa bad ≥ V (π E ) − ǫH 2 Rmax . [sent-145, score-3.548]
71 The first inequality holds because, by the union bound, PπE assigns at most an ǫH fraction of its measure to bad trajectories, and the maximum reward of a trajectory is HRmax . [sent-146, score-0.356]
72 The second inequality holds by our assumption that all rewards are nonnegative. [sent-147, score-0.117]
73 Since the apprentice’s policy π A satisfies Assumption 1, by Lemma 1 we can choose any ǫ1 ∈ (0, 1] and have E A E Prs∼Dt πt (s, πt (s)) ≥ 1 − ǫ1 ≥ 1 − ǫǫ1 . [sent-150, score-0.528]
74 E Now construct a “dummy” policy π as follows: For all time steps t, let πt (s, ·) = πt (s, ·) for any ˆ ˆ E A A state s where πt (s, πt (s)) ≥ 1 − ǫ1 . [sent-151, score-0.629]
75 However, in many cases it may be reasonable to assume that the expert is following a near-optimal policy (indeed, if she is not, then we should question the decision to select her as an expert). [sent-157, score-0.842]
76 The next theorem shows that the dependence of V (π A ) on the classification error ǫ is significantly better when the expert is following a near-optimal policy. [sent-158, score-0.366]
77 If Assumption 1 holds, then V (π A ) ≥ V (π E ) − 4ǫH 3 Rmax + ∆πE , where ∆πE V (π ∗ ) − V (π E ) is the suboptimality of the expert’s policy π E . [sent-160, score-0.528]
78 We can interpret this bound as follows: If our goal is to learn an apprentice policy whose value is within ∆πE of the expert policy’s value, we can double our progress towards that goal by halving the classification error rate. [sent-162, score-1.239]
79 To see why a near-optimal expert policy should yield a weaker dependence on ǫ, consider an expert policy π E that is an optimal policy, but in every state s ∈ S selects one of two actions as and 1 as uniformly at random. [sent-164, score-1.791]
80 A deterministic apprentice policy π A that closely imitates the expert will 2 either set π A (s) = as or π A (s) = as , but in either case the classification error will not be less than 2 1 1 E s s 2 . [sent-165, score-1.328]
81 However, since π is optimal, both actions a1 and a2 must be optimal actions for state s, and so A the apprentice policy π will be optimal as well. [sent-166, score-1.017]
82 The next several definitions will be with respect to an arbitrary nonstationary base policy π B ; in the proof of Theorem 2, we will make a particular choice for the base policy. [sent-169, score-0.769]
83 Fix a deterministic nonstationary policy π B,ǫ that satisfies B,ǫ B πt (s, πt (s)) ≥ 1 − ǫ for some ǫ ∈ (0, 1] and all states s and time steps t. [sent-170, score-0.852]
84 Such a policy always exists by letting ǫ = 1, but if ǫ is close to zero, then π B,ǫ is a deterministic policy that “almost” agrees with π B in every state and time step. [sent-171, score-1.224]
85 Of course, depending on the choice of π B , a policy π B,ǫ may not exist for small ǫ, but let us set aside that concern for the moment; in the proof of Theorem 2, the base policy π B will be chosen so that ǫ can be as small as we like. [sent-172, score-1.094]
86 In other words, in each state s and time step t, the distribution is obtained by proportionally redistributing the B,ǫ B probability assigned to action πt (s) by the distribution πt (s, ·) to all other actions. [sent-174, score-0.146]
87 Let π B+ be a deterministic policy defined by B B+ π πt (s) = arg max E Vt+1 (s′ ) s′ ∼ θ(s, a, ·) a B+ for all states s ∈ S and time steps t. [sent-176, score-0.671]
88 In other words, πt (s) is the best action in state s at time t, assuming that the policy π B is followed thereafter. [sent-177, score-0.657]
89 A mixed policy consists of a finite set of deterministic nonstationary policies, along with a distribution over those policies; the mixed policy is followed by drawing a single policy according to the distribution in the initial time step, and following that policy exclusively thereafter. [sent-179, score-2.443]
90 More formally, a mixed policy is defined by a set of ordered pairs {(π i , λ(i))}N for some finite N , where each component policy π i is a deterministic i=1 N nonstationary policy, i=1 λ(i) = 1 and λ(i) ≥ 0 for all i ∈ [N ]. [sent-180, score-1.353]
91 We define a mixed policy π B,ǫ,+ as follows: For each component policy π i and each time step t, ˜ B,ǫ B+ i i either πt = πt or πt = πt . [sent-181, score-1.136]
92 There is one component policy for each possible choice; this yields |H| N =2 component policies. [sent-182, score-0.574]
93 And the probability λ(i) assigned to each component policy π i is B,ǫ i λ(i) = (1 − ǫ)k(i) ǫH−k(i) , where k(i) is the number of times steps t for which πt = πt . [sent-183, score-0.603]
94 Having established these definitions, we are now ready to prove several lemmas that will help us prove Theorem 2. [sent-184, score-0.134]
95 Clearly VH (s) = VH (s) for all states π s, since the value function VH for any policy π depends only on the reward function R. [sent-189, score-0.651]
96 Since π B,ǫ,+ is a mixed policy, by the linearity of expectation we have ˜ N λ(i)V (π i ) V (˜ B,ǫ,+ ) = π i=1 i where each π is a component policy of π ˜ V (˜ B,ǫ,+ ) = π B,ǫ,+ and λ(i) is its associated probability. [sent-197, score-0.588]
97 Here we used the fact that probability (1 − ǫ)H ≥ 1 − ǫH is assigned to a component policy that is identical to π B,ǫ , and the value of any component policy is at most V (π ∗ ). [sent-199, score-1.119]
98 Since the apprentice’s policy π A satisfies Assumption 1, by Lemma 1 we can 1 choose any ǫ1 ∈ (0, H ) and have E A E Prs∼Dt πt (s, πt (s)) ≥ 1 − ǫ1 ≥ 1 − ǫ ǫ1 . [sent-208, score-0.528]
99 As in the proof of Theorem 1, let us construct a “dummy” policy π as follows: For all time steps ˆ E E A t, let πt (s, ·) = πt (s, ·) for any state s where πt (s, πt (s)) ≥ 1 − ǫ1 . [sent-209, score-0.645]
100 ∆π ≤ ∆πE + H R ˆ ǫ1 (2) Now observe that, if we set the base policy π B = π , then by definition π A is a valid choice for ˆ 1 π B,ǫ1 . [sent-213, score-0.55]
wordName wordTfidf (topN-words)
[('policy', 0.528), ('apprentice', 0.397), ('sa', 0.327), ('expert', 0.314), ('apprenticeship', 0.213), ('prs', 0.182), ('nonstationary', 0.181), ('vt', 0.162), ('rmax', 0.158), ('dt', 0.141), ('dp', 0.122), ('bad', 0.107), ('imitate', 0.102), ('reinforcement', 0.096), ('reward', 0.091), ('demonstrations', 0.087), ('hv', 0.071), ('lemma', 0.064), ('action', 0.063), ('trajectory', 0.062), ('pr', 0.061), ('deterministic', 0.056), ('poly', 0.053), ('lemmas', 0.053), ('st', 0.051), ('bagnell', 0.049), ('state', 0.046), ('langford', 0.044), ('trajectories', 0.044), ('zadrozny', 0.044), ('hypothesis', 0.043), ('classi', 0.042), ('passively', 0.04), ('umar', 0.04), ('vh', 0.037), ('mixed', 0.037), ('policies', 0.036), ('abbeel', 0.035), ('steps', 0.035), ('good', 0.035), ('theorem', 0.034), ('syed', 0.034), ('inequality', 0.033), ('abbreviation', 0.033), ('bianca', 0.033), ('hrmax', 0.033), ('imitates', 0.033), ('states', 0.032), ('andrew', 0.031), ('ross', 0.031), ('assumption', 0.029), ('blatt', 0.029), ('stationary', 0.029), ('prove', 0.028), ('holds', 0.028), ('rewards', 0.027), ('agrees', 0.026), ('ready', 0.025), ('dummy', 0.025), ('kakade', 0.025), ('mdp', 0.024), ('savings', 0.024), ('actions', 0.023), ('princeton', 0.023), ('proves', 0.023), ('component', 0.023), ('imitation', 0.023), ('pieter', 0.023), ('proving', 0.022), ('base', 0.022), ('reduced', 0.021), ('si', 0.021), ('sham', 0.021), ('robert', 0.021), ('nitions', 0.021), ('assigns', 0.02), ('weaker', 0.02), ('letting', 0.02), ('time', 0.02), ('sensible', 0.019), ('ai', 0.019), ('schapire', 0.019), ('cation', 0.019), ('reduction', 0.018), ('dependence', 0.018), ('assigned', 0.017), ('easier', 0.017), ('horizon', 0.017), ('ht', 0.017), ('survey', 0.016), ('write', 0.016), ('reducing', 0.016), ('proof', 0.016), ('fix', 0.016), ('fewer', 0.016), ('learning', 0.015), ('exploration', 0.015), ('purposes', 0.015), ('union', 0.015), ('usual', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
2 0.42054644 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
3 0.37368959 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
4 0.35109085 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
5 0.35060742 43 nips-2010-Bootstrapping Apprenticeship Learning
Author: Abdeslam Boularias, Brahim Chaib-draa
Abstract: We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert’s policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations. 1
6 0.33079317 208 nips-2010-Policy gradients in linearly-solvable MDPs
7 0.29133648 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
8 0.27997839 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
9 0.24778543 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
10 0.24051146 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
11 0.21363448 134 nips-2010-LSTD with Random Projections
12 0.19701092 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
13 0.16087356 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.15817936 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
15 0.1372159 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
16 0.13654737 64 nips-2010-Distributionally Robust Markov Decision Processes
17 0.1350245 269 nips-2010-Throttling Poisson Processes
18 0.13104481 4 nips-2010-A Computational Decision Theory for Interactive Assistants
19 0.12727557 229 nips-2010-Reward Design via Online Gradient Ascent
20 0.12075565 212 nips-2010-Predictive State Temporal Difference Learning
topicId topicWeight
[(0, 0.225), (1, -0.557), (2, -0.072), (3, -0.069), (4, 0.021), (5, -0.034), (6, 0.054), (7, -0.014), (8, 0.042), (9, -0.123), (10, -0.011), (11, 0.003), (12, -0.013), (13, 0.033), (14, -0.009), (15, -0.001), (16, 0.072), (17, -0.004), (18, -0.02), (19, -0.046), (20, 0.011), (21, 0.001), (22, 0.042), (23, 0.033), (24, -0.002), (25, 0.029), (26, -0.003), (27, 0.045), (28, -0.034), (29, 0.005), (30, 0.028), (31, 0.026), (32, 0.026), (33, -0.028), (34, 0.002), (35, -0.014), (36, 0.007), (37, 0.057), (38, 0.061), (39, 0.032), (40, -0.045), (41, 0.027), (42, 0.056), (43, 0.054), (44, -0.07), (45, -0.031), (46, 0.051), (47, 0.062), (48, 0.044), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.97735542 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
2 0.8926816 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
3 0.87699836 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
4 0.87476844 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
5 0.84952939 43 nips-2010-Bootstrapping Apprenticeship Learning
Author: Abdeslam Boularias, Brahim Chaib-draa
Abstract: We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert’s policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations. 1
6 0.84917253 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
7 0.84128571 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
8 0.80707121 208 nips-2010-Policy gradients in linearly-solvable MDPs
9 0.77375662 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
10 0.69264108 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
11 0.63357192 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
12 0.61130959 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
13 0.60891956 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
14 0.58244741 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
15 0.56161886 134 nips-2010-LSTD with Random Projections
16 0.55467862 4 nips-2010-A Computational Decision Theory for Interactive Assistants
17 0.55202979 212 nips-2010-Predictive State Temporal Difference Learning
18 0.53038204 64 nips-2010-Distributionally Robust Markov Decision Processes
19 0.51415902 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
20 0.46776545 229 nips-2010-Reward Design via Online Gradient Ascent
topicId topicWeight
[(0, 0.352), (13, 0.029), (17, 0.013), (27, 0.041), (30, 0.064), (45, 0.194), (50, 0.028), (52, 0.02), (60, 0.048), (77, 0.046), (78, 0.017), (90, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.78428292 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
2 0.68382871 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
3 0.661358 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
4 0.65815765 5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
5 0.53493989 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
6 0.53358936 63 nips-2010-Distributed Dual Averaging In Networks
7 0.53224218 282 nips-2010-Variable margin losses for classifier design
8 0.53184372 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
9 0.53029311 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
10 0.53017336 290 nips-2010-t-logistic regression
11 0.5293566 27 nips-2010-Agnostic Active Learning Without Constraints
12 0.52932978 155 nips-2010-Learning the context of a category
13 0.52913332 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
14 0.52834946 1 nips-2010-(RF)^2 -- Random Forest Random Field
15 0.52783716 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
16 0.52776372 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
17 0.52773213 287 nips-2010-Worst-Case Linear Discriminant Analysis
18 0.52772224 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
19 0.5276202 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
20 0.52733254 280 nips-2010-Unsupervised Kernel Dimension Reduction