nips nips2007 nips2007-15 knowledge-graph by maker-knowledge-mining

15 nips-2007-A general agnostic active learning algorithm

Source: pdf

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A general agnostic active learning algorithm Sanjoy Dasgupta UC San Diego dasgupta@cs. [sent-1, score-0.48]

2 edu Abstract We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. [sent-7, score-0.614]

3 Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. [sent-8, score-0.206]

4 Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. [sent-9, score-0.596]

5 We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. [sent-10, score-0.596]

6 Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. [sent-11, score-0.356]

7 An active learner is given unlabeled data and must pay to view any label. [sent-16, score-0.283]

8 The hope is that signiﬁcantly fewer labeled examples are used than in the supervised (non-active) learning model. [sent-17, score-0.135]

9 In this paper we formalize, extend, and provide label complexity guarantees for one of the earliest and simplest approaches to active learning—one due to Cohn, Atlas, and Ladner [1]. [sent-19, score-0.382]

10 The scheme of [1] examines data one by one in a stream and requests the label of any data point about which it is currently unsure. [sent-20, score-0.376]

11 For example, suppose the hypothesis class consists of linear separators in the plane, and assume that the data is linearly separable. [sent-21, score-0.152]

12 1 1 0 0 1 0 1 0 1 0 The learner does not need to request the label of the seventh point (indicated by the arrow) because it is not unsure about the label: any straight line with the ⊕s and ⊖s on opposite sides has the seventh point with the ⊖s. [sent-23, score-0.434]

13 Put another way, the point is not in the region of uncertainty [1], the portion of the data space for which there is disagreement among hypotheses consistent with the present labeled data. [sent-24, score-0.271]

14 Although very elegant and intuitive, this approach to active learning faces two problems: 1. [sent-25, score-0.164]

15 We provide a simple generalization of the selective sampling scheme of [1] that tolerates adversarial noise and never requests many more labels than a standard agnostic supervised learner would to learn a hypothesis with the same error. [sent-30, score-0.932]

16 In the previous example, an agnostic active learner (one that does not assume a perfect separator exists) is actually still uncertain about the label of the seventh point, because all six of the previous labels could be inconsistent with the best separator. [sent-31, score-0.875]

17 On the other hand, after enough points have been labeled, if an unlabeled point occurs at the position shown below, chances are its label is not needed. [sent-33, score-0.219]

18 Now, somewhat counter-intuitively, the labels in S are completely reliable, whereas the labels in T could be inconsistent with the best separator. [sent-35, score-0.19]

19 To decide if we are uncertain about the label of a new point x, we reduce to a supervised learning task: for each possible label y ∈ {±1}, we ˆ learn a hypothesis hy consistent with the labels in S ∪ {(x, y )} and with minimal empirical ˆ ˆ error on T . [sent-36, score-0.754]

20 If, say, the error of the hypothesis h+1 is much larger than that of h−1 , we can safely infer that the best separator must also label x with −1 without requesting a label; if the error diﬀerence is only modest, we explicitly request a label. [sent-37, score-0.515]

21 Indeed, this is in a sense the core sampling problem that has always plagued active learning: the labeled sample T might not be i. [sent-46, score-0.254]

22 A direct consequence is that the label complexity of our algorithm (the number of labels requested before achieving a desired error) is never much more than the usual sample complexity of supervised learning (and in some cases, is signiﬁcantly less). [sent-54, score-0.616]

23 An important algorithmic detail is the speciﬁc choice of generalization bound we use in deciding whether to request a label or not. [sent-55, score-0.329]

24 Our algorithm magniﬁes this small polynomial diﬀerence in the bound into an exponential diﬀerence in label complexity, so it is crucial for us to use a good bound. [sent-59, score-0.215]

25 We use a normalized bound that takes into account the empirical error (computed on S∪T ) of the hypothesis in question. [sent-60, score-0.217]

26 In this paper, we present and analyze a simple agnostic active learning algorithm for general hypothesis classes of bounded VC dimension. [sent-61, score-0.614]

27 It extends the selective sampling scheme of Cohn et al. [sent-62, score-0.139]

28 [1] to the agnostic setting, using normalized generalization bounds, which we apply in a simple but subtle manner. [sent-63, score-0.372]

29 For certain hypothesis classes and distributions, our analysis yields improved label complexity guarantees over the standard sample complexity of supervised learning. [sent-64, score-0.476]

30 1 Related work Our algorithm extends the selective sampling scheme of Cohn et al. [sent-67, score-0.16]

31 Most previous work on active learning either makes strong distributional assumptions (e. [sent-69, score-0.186]

32 A natural way to formulate active learning in the agnostic setting is to ask the learner to return a hypothesis with error at most ν + ε (where ν is the error of the best hypothesis in 2 the speciﬁed class) using as few labels as possible. [sent-73, score-0.908]

33 A basic constraint on the label complexity was pointed out by K¨¨ri¨inen [11], who showed that for any ν ∈ (0, 1/2), there are data aa a distributions that force any active learner that achieves error at most ν + ε to request Ω((ν/ε)2 ) labels. [sent-74, score-0.609]

34 The ﬁrst rigorously-analyzed agnostic active learning algorithm, called A2 , was developed recently by Balcan, Beygelzimer, and Langford [9]. [sent-75, score-0.459]

35 Like Cohn-AtlasLadner [1], this algorithm uses a region of uncertainty, although the lack of separability complicates matters and A2 ends up explicitly maintaining an ε-net of the hypothesis space. [sent-76, score-0.155]

36 Subsequently, Hanneke [12] characterized the label complexity of the A2 algorithm in terms of a parameter called the disagreement coeﬃcient. [sent-77, score-0.362]

37 Our algorithm overcomes their complications by employing reductions to supervised learning. [sent-79, score-0.131]

38 1 We bound the label complexity of our method in terms of the same parameter as used for A2 [12], and get a somewhat better dependence (linear rather than quadratic). [sent-80, score-0.276]

39 In our active learning model, the learner receives unlabeled data sampled from DX ; for any sampled point x, it can optionally request the label y sampled from the conditional distribution at x. [sent-84, score-0.558]

40 This process can be viewed as sampling (x, y) from D and revealing only x to the learner, keeping the label y hidden unless the learner explicitly requests it. [sent-85, score-0.386]

41 The error of a hypothesis h under D is errD (h) = Pr(x,y)∼D [h(x) = y], and on a ﬁnite sample Z ⊂ X ×{±1}, the empirical error of h is err(h, Z) = (x,y)∈Z 1 l[h(x) = y]/|Z|, where 1 is the 0-1 indicator function. [sent-86, score-0.224]

42 We assume for simplicity that the minimal error l[·] ν = inf{errD (h) : h ∈ H} is achieved by a hypothesis h∗ ∈ H. [sent-87, score-0.148]

43 Disagreement coeﬃcient We will bound the label complexity of our algorithm in terms of (a slight variation of) the disagreement coeﬃcient θ introduced in [12] for analyzing the label complexity of A2 . [sent-99, score-0.607]

44 The quantity θ bounds the rate at which the disagreement mass of the ball B(h∗ , r) – the probability mass of points on which hypotheses in B(h∗ , r) disagree with h∗ – grows as a function of the radius r. [sent-105, score-0.232]

45 Although our algorithm is most naturally seen as an extension of Cohn-Atlas-Ladner, a similar reduction to supervised learning (in the agnostic setting) can be used for A2 [10]. [sent-108, score-0.387]

46 Else request yn ; Sn ← Sn−1 and Tn ← Tn−1 ∪ {(xn , yn )}. [sent-124, score-0.19]

47 3 Agnostic selective sampling Here we state and analyze our general algorithm for agnostic active learning. [sent-130, score-0.568]

48 The main techniques employed by the algorithm are reductions to a supervised learning task and generalization bounds applied to diﬀerences of empirical errors. [sent-131, score-0.242]

49 1 A general algorithm for agnostic active learning Figure 1 states our algorithm in full generality. [sent-133, score-0.501]

50 2 For S, T ⊂ X × {±1}, let LEARNH (S, T ) denote a supervised learner that returns a hypothesis h ∈ H consistent with S, and with minimum error on T . [sent-137, score-0.348]

51 Upon receiving xn , it learns two hypotheses, hy = LEARNH (S ∪ {(xn , y )}, T ) for y ∈ {±1}, and then compares ˆ ˆ ˆ their empirical errors on S ∪ T . [sent-139, score-0.169]

52 If the diﬀerence is large enough3 , it is possible to infer how h∗ labels xn (as we show in Lemma 3). [sent-140, score-0.139]

53 Otherwise, the algorithm requests the label yn and adds (xn , yn ) to T . [sent-142, score-0.396]

54 Thus, S contains examples with inferred labels consistent with h∗ , and T contains examples with their requested labels. [sent-143, score-0.315]

55 Because h∗ might err on some examples in T , we just insist that LEARNH ﬁnd a hypothesis with minimal error on T . [sent-144, score-0.378]

56 Meanwhile, by construction, h∗ is consistent with S, so we require LEARNH to only consider hypotheses consistent with S. [sent-145, score-0.125]

57 2 Bounds for error diﬀerences We still need to specify ∆n , the threshold value for error diﬀerences that determines whether the algorithm requests a label or not. [sent-147, score-0.38]

58 Let Sn be the set of labeled examples identical to those in Sn , except with the true hidden labels swapped in. [sent-153, score-0.15]

59 (h) = err(h, Sn ∪ Tn ) n and errn (h) = err(h, Sn ∪ Tn ). [sent-161, score-0.633]

60 If LEARNH cannot ﬁnd a hypothesis consistent with S ∪ {(xn , y)} for some y, then it is clear that h∗ (x) = −y. [sent-163, score-0.154]

61 (h), but n we cannot use such bounds algorithmically: we do not request the true labels for points in Sn and thus cannot reliably compute err! [sent-169, score-0.256]

62 Then, with probability at least 1 − δ, for all n ≥ 1 and all (h, h′ ) ∈ H × H consistent with Sn , we have 2 errn (h) − errn (h′ ) ≤ errD (h) − errD (h′ ) + βn + βn ( errn (h) + errn (h′ )). [sent-188, score-2.593]

63 ′ because errn (h) − errn (h ) = errn (h) − errn (h′ ); and because gh,h′ (x, y) ≤ 1 l[h(x) = y] and − + ′ ′ gh,h′ (x, y) ≤ 1 (x) = y] for (h, h ) consistent with Sn , so ESn ∪Tn [gh,h′ ] ≤ errn (h) and l[h ! [sent-195, score-3.204]

64 Corollary 1 implies that we can eﬀectively apply the normalized uniform convergence bounds from Lemma 1 to empirical error diﬀerences on Sn ∪ Tn , even though Sn ∪ Tn is not an i. [sent-198, score-0.163]

65 3 errn (h+1 ) + ˜ (4/n) ln(8(n2 + n)S(H, 2n)2 /δ) = O( errn (h−1 ) (1) d log n/n) as per Corollary 1. [sent-203, score-1.285]

66 With probability at least 1 − δ, the hypothesis h∗ = arg inf h∈H errD (h) is consistent with Sn for all n ≥ 0 in Algorithm 1. [sent-206, score-0.225]

67 Suppose upon receiving xn+1 , we discover errn (h+1 ) − errn (h−1 ) > ∆n . [sent-211, score-1.284]

68 We know the that errn (h∗ ) ≥ errn (h+1 ) (by 2 inductive hypothesis) and errn (h+1 ) − errn (h−1 ) > βn + βn ( errn (h+1 ) + errn (h−1 )). [sent-214, score-3.798]

69 Therefore, errn (h∗ ) − errn (h−1 ) = (errn (h∗ ) − errn (h+1 )) + (errn (h+1 ) − errn (h−1 )) > errn (h+1 )( errn (h∗ ) − > βn ( errn = 2 βn + βn ( (h∗ ) − errn 2 errn (h+1 )) + βn + βn ( errn (h+1 ) + errn (h+1 )) + (h∗ ) + 2 βn + βn ( errn (h+1 ) + errn (h−1 )) errn (h−1 )). [sent-216, score-8.862]

70 If Algorithm 1 is given a stream of m unlabeled examples, then with probability at least 1 − δ, the algorithm returns a hypothesis with error at most ν + c · ((1/m)(d log m + log(1/δ)) + (ν/m)(d log m + log(1/δ))). [sent-221, score-0.355]

71 Using the same bounds from Corollary 1 (already applied in Lemma 3) on h∗ and hf together √ 2 with the fact errm (hf ) ≤ errm (h∗ ), we have errD (hf ) ≤ ν + βm + βm ν + βm errD (hf ), √ 2 which in turn implies errD (hf ) ≤ ν + 3βm + 2βm ν. [sent-224, score-0.217]

72 ˜ So, Algorithm 1 returns a hypothesis with error at most ν + ε when m = O((d/ε)(1 + ν/ε)); this is (asymptotically) the usual sample complexity of supervised learning. [sent-225, score-0.314]

73 Since the ˜ algorithm requests at most m labels, its label complexity is always at most O((d/ε)(1+ν/ε)). [sent-226, score-0.365]

74 4 Label complexity analysis We can also bound the label complexity of our algorithm in terms of the disagreement coeﬃcient θ. [sent-228, score-0.44]

75 The key to deriving our label complexity bounds based on θ is noting that the probability of requesting the (n + 1)th label is intimately related to θ and ∆n (see [10] for the complete proof). [sent-230, score-0.474]

76 Then, the ˆ probability that Algorithm 1 requests the label yn+1 is Prxn+1 ∼DX [Request yn+1 ] ≤ c · θ · (ν + √ 2 2 βn ), where θ = θ(D, H, 3βm + 2βm ν) is the disagreement coeﬃcient, ν = errD (h∗ ), and ˜ βn = O( d log n/n) is as deﬁned in Corollary 1. [sent-234, score-0.435]

77 Now we give our main label complexity bound for agnostic active learning. [sent-235, score-0.704]

78 If ν ≤ (c2 − 1)βm , Algorithm 1 returns a hypothesis with error as bounded in Theorem 1 and the expected number of labels requested is at most 1 1 + c1 c2 θ · d log2 m + log log m . [sent-239, score-0.462]

79 Else, the same holds except the expected number of labels requested is at most 1 1 + c1 θ · νm + d log2 m + log log m . [sent-241, score-0.272]

80 δ Furthermore, if L is the expected number of labels requested as per above, then with probability at least 1 − δ ′ , the algorithm requests no more than L + 3L log(1/δ ′ ) labels. [sent-242, score-0.403]

81 For an illustrative consequence of this, suppose DX is the uniform distribution on the sphere 4 It may be possible to reduce A2 ’s quadratic dependence to a linear dependence by using normalized bounds, as we do here. [sent-248, score-0.131]

82 The plots show the number of labels requested (vertical axis) versus the total number of points seen (labeled + unlabeled, horizontal axis) using Algorithm 1. [sent-250, score-0.234]

83 The top histogram shows the locations of ﬁrst 400 label requests (the x-axis is the unit interval); the bottom histogram is for all (2141) label requests. [sent-269, score-0.46]

84 The ﬁrst 200 requests occurred at the ×s, the next 200 at the ▽s, and the ﬁnal 109 at the s. [sent-273, score-0.126]

85 Then the label complexity of A2 depends at least quadratically on the dimension, whereas the corresponding dependence for our algorithm is d3/2 . [sent-275, score-0.292]

86 4 Experiments We implemented Algorithm 1 in a few simple cases to experimentally demonstrate the label complexity improvements. [sent-276, score-0.218]

87 For this hypothesis class, we used two diﬀerent noise models, each of which ensured inf h∈H errD (h) = errD (h∗ ) = ν for a prespeciﬁed ν ∈ [0, 1]. [sent-281, score-0.189]

88 The results show that the number of labels requested by Algorithm 1 was exponentially smaller than the total number of data seen (m) under the ﬁrst noise model, and was polynomially smaller under the second noise model (see Figure 2 (a & b); we veriﬁed the polynomial vs. [sent-288, score-0.284]

89 In the case of intervals, we observe an initial phase (of duration roughly ∝ 1/p+ ) in which every label is requested, followed by a more eﬃcient phase, conﬁrming the known active-learnability of this class [4,12]. [sent-290, score-0.167]

90 These improvements show that our algorithm needed signiﬁcantly fewer labels to achieve the same error as a standard supervised algorithm that uses labels for all points seen. [sent-291, score-0.341]

91 As a sanity check, we examined the locations of data for which Algorithm 1 requested a label. [sent-292, score-0.148]

92 Figure 2 (c & d) shows that, early on, labels were requested everywhere. [sent-299, score-0.234]

93 But as the algorithm progressed, label requests concentrated near the boundary of the target hypothesis. [sent-300, score-0.314]

94 7 5 Conclusion and future work We have presented a simple and natural approach to agnostic active learning. [sent-301, score-0.459]

95 We prescribe a very simple and natural choice for ∆n – a normalized generalization bound from supervised learning – but one could hope for a more clever or aggressive choice, akin to those in [6] for linear separators. [sent-307, score-0.145]

96 In such cases, reduction-based active learning algorithms can be relatively eﬃcient (answering some questions posed in [16]). [sent-309, score-0.164]

97 On the other hand, agnostic learning suﬀers from severe computational intractability for many hypothesis classes (e. [sent-310, score-0.41]

98 [17]), and of course, agnostic active learning is at least as hard in the worst case. [sent-312, score-0.481]

99 A bound on the label complexity of agnostic active learning. [sent-385, score-0.704]

100 Learning with online constraints: shifting concepts and active learning. [sent-389, score-0.164]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('errn', 0.633), ('errd', 0.309), ('agnostic', 0.295), ('err', 0.209), ('sn', 0.182), ('label', 0.167), ('active', 0.164), ('tn', 0.158), ('requested', 0.148), ('requests', 0.126), ('disagreement', 0.123), ('hypothesis', 0.115), ('request', 0.108), ('learnh', 0.108), ('hf', 0.093), ('dx', 0.087), ('labels', 0.086), ('ez', 0.082), ('coe', 0.079), ('supervised', 0.071), ('learner', 0.067), ('di', 0.066), ('cohn', 0.065), ('bounds', 0.062), ('selective', 0.062), ('corollary', 0.057), ('lemma', 0.054), ('hy', 0.054), ('xn', 0.053), ('unlabeled', 0.052), ('stream', 0.051), ('erences', 0.051), ('complexity', 0.051), ('inf', 0.049), ('hypotheses', 0.047), ('seventh', 0.046), ('vcdim', 0.046), ('erence', 0.043), ('misclassi', 0.043), ('labeled', 0.043), ('yn', 0.041), ('prx', 0.04), ('reductions', 0.039), ('dasgupta', 0.039), ('consistent', 0.039), ('separators', 0.037), ('atlas', 0.034), ('vc', 0.034), ('intervals', 0.034), ('error', 0.033), ('separator', 0.032), ('scheme', 0.032), ('dependence', 0.031), ('errm', 0.031), ('esn', 0.031), ('ladner', 0.031), ('balcan', 0.031), ('diego', 0.031), ('subtle', 0.03), ('colt', 0.027), ('generalization', 0.027), ('castro', 0.027), ('ectively', 0.027), ('requesting', 0.027), ('bound', 0.027), ('uniform', 0.026), ('sampling', 0.026), ('noise', 0.025), ('beygelzimer', 0.025), ('returns', 0.023), ('sphere', 0.023), ('improvements', 0.023), ('empirical', 0.022), ('errors', 0.022), ('distributional', 0.022), ('hsu', 0.022), ('least', 0.022), ('algorithm', 0.021), ('uc', 0.021), ('sample', 0.021), ('examples', 0.021), ('ln', 0.021), ('san', 0.02), ('else', 0.02), ('normalized', 0.02), ('dotted', 0.019), ('log', 0.019), ('hardness', 0.019), ('sm', 0.019), ('aa', 0.019), ('separability', 0.019), ('bounded', 0.019), ('uncertainty', 0.019), ('extends', 0.019), ('correctness', 0.018), ('erent', 0.018), ('inconsistent', 0.018), ('receiving', 0.018), ('vapnik', 0.018), ('independently', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

2 0.10898753 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

3 0.073901296 136 nips-2007-Multiple-Instance Active Learning

Author: Burr Settles, Mark Craven, Soumya Ray

Abstract: We present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners assume that every instance in a bag labeled negative is actually negative, whereas at least one instance in a bag labeled positive is actually positive. We consider the particular case in which an MI learner is allowed to selectively query unlabeled instances from positive bags. This approach is well motivated in domains in which it is inexpensive to acquire bag labels and possible, but expensive, to acquire instance labels. We describe a method for learning from labels at mixed levels of granularity, and introduce two active query selection strategies motivated by the MI setting. Our experiments show that learning from instance labels can signiﬁcantly improve performance of a basic MI learning algorithm in two multiple-instance domains: content-based image retrieval and text classiﬁcation. 1

4 0.072799392 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

Author: Venkat Chandrasekaran, Alan S. Willsky, Jason K. Johnson

Abstract: We consider the estimation problem in Gaussian graphical models with arbitrary structure. We analyze the Embedded Trees algorithm, which solves a sequence of problems on tractable subgraphs thereby leading to the solution of the estimation problem on an intractable graph. Our analysis is based on the recently developed walk-sum interpretation of Gaussian estimation. We show that non-stationary iterations of the Embedded Trees algorithm using any sequence of subgraphs converge in walk-summable models. Based on walk-sum calculations, we develop adaptive methods that optimize the choice of subgraphs used at each iteration with a view to achieving maximum reduction in error. These adaptive procedures provide a signiﬁcant speedup in convergence over stationary iterative methods, and also appear to converge in a larger class of models. 1

5 0.072021589 59 nips-2007-Continuous Time Particle Filtering for fMRI

Author: Lawrence Murray, Amos J. Storkey

Abstract: We construct a biologically motivated stochastic differential model of the neural and hemodynamic activity underlying the observed Blood Oxygen Level Dependent (BOLD) signal in Functional Magnetic Resonance Imaging (fMRI). The model poses a difﬁcult parameter estimation problem, both theoretically due to the nonlinearity and divergence of the differential system, and computationally due to its time and space complexity. We adapt a particle ﬁlter and smoother to the task, and discuss some of the practical approaches used to tackle the difﬁculties, including use of sparse matrices and parallelisation. Results demonstrate the tractability of the approach in its application to an effective connectivity study. 1

6 0.063838892 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

7 0.063554265 110 nips-2007-Learning Bounds for Domain Adaptation

8 0.060447186 32 nips-2007-Bayesian Co-Training

9 0.05827501 195 nips-2007-The Generalized FITC Approximation

10 0.056006305 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

11 0.053466298 166 nips-2007-Regularized Boost for Semi-Supervised Learning

12 0.053350791 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

13 0.052512079 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

14 0.051369067 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

15 0.048438173 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

16 0.04428466 39 nips-2007-Boosting the Area under the ROC Curve

17 0.044126172 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

18 0.043116406 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

19 0.043109655 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

20 0.042796031 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, 0.01), (2, -0.052), (3, 0.069), (4, 0.043), (5, 0.061), (6, 0.031), (7, -0.064), (8, 0.008), (9, -0.046), (10, 0.031), (11, 0.043), (12, -0.082), (13, -0.06), (14, 0.042), (15, 0.008), (16, 0.039), (17, 0.015), (18, 0.037), (19, -0.065), (20, 0.102), (21, 0.003), (22, 0.125), (23, -0.134), (24, 0.066), (25, 0.017), (26, 0.026), (27, -0.015), (28, 0.025), (29, -0.146), (30, -0.104), (31, 0.027), (32, 0.058), (33, 0.108), (34, -0.072), (35, 0.02), (36, 0.09), (37, -0.02), (38, -0.026), (39, 0.075), (40, 0.061), (41, -0.05), (42, 0.066), (43, 0.005), (44, -0.037), (45, 0.028), (46, 0.082), (47, -0.008), (48, 0.157), (49, 0.214)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93888557 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

2 0.57729119 136 nips-2007-Multiple-Instance Active Learning

Author: Burr Settles, Mark Craven, Soumya Ray

3 0.55500388 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

4 0.5034492 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

Author: Venkat Chandrasekaran, Alan S. Willsky, Jason K. Johnson

5 0.44246408 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for speciﬁc learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

6 0.43373156 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

7 0.42404324 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

8 0.40843377 159 nips-2007-Progressive mixture rules are deviation suboptimal

9 0.34294429 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

10 0.34230509 77 nips-2007-Efficient Inference for Distributions on Permutations

11 0.33205494 59 nips-2007-Continuous Time Particle Filtering for fMRI

12 0.33176762 110 nips-2007-Learning Bounds for Domain Adaptation

13 0.32901597 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

14 0.3227331 19 nips-2007-Active Preference Learning with Discrete Choice Data

15 0.3043707 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

16 0.3035675 195 nips-2007-The Generalized FITC Approximation

17 0.27729648 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

18 0.27549219 26 nips-2007-An online Hebbian learning rule that performs Independent Component Analysis

19 0.27300912 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

20 0.2679989 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (13, 0.021), (16, 0.022), (18, 0.013), (21, 0.096), (28, 0.326), (34, 0.025), (35, 0.04), (46, 0.02), (47, 0.101), (83, 0.127), (85, 0.012), (90, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71536952 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

2 0.71033639 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

3 0.6401819 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufﬁciently rich trajectory generated by some policy. We study a variant of ﬁtted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the ﬁrst ﬁnite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is deﬁned by a quintuple (X , A, P, S, γ), where X is the (possible inﬁnite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) deﬁning the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is deﬁned as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to ﬁnd a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisﬁes V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we deﬁne its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), deﬁned by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is deﬁned by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is deﬁned by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a ﬁnite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for ﬁtted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset deﬁning a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is deﬁned by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any ﬁxed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem deﬁned by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the ﬁrst iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to deﬁne the regression problem. Will FQI still work despite these issues? To illustrate the potential difﬁculties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now deﬁne the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the speciﬁc form of fM,U . On the other hand, if we restricted M to a ﬁnite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions deﬁning the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares ﬁtted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a speciﬁc example for F consider neural networks with some ﬁxed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is deﬁned in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for ﬁnite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an inﬁnite number of actions since these previous bounds grow to inﬁnity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is ﬁnite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be inﬁnite over (0, 1/2). Without going into further details, let us just note that the ﬁniteness of the fat-shattering function is a sufﬁcient and necessary condition for learnability and the ﬁniteness of the fat-shattering function is implied by the ﬁniteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difﬁcult to get a ﬁne control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the ﬁtted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, deﬁned by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) deﬁnes the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a ﬁtted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we deﬁne the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we deﬁne the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the ﬁrst term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisﬁes π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefﬁcients do not need to be known. In the case when no mixing condition is satisﬁed, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the deﬁnition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the deﬁnition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisﬁes m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefﬁcient of the future-state distributions. The number c(m) measures how much ρ can get ampliﬁed in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is ﬁnite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is ﬁnite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisﬁed with α = 2−dA . Without assuming any smoothness of the MDP, learning in inﬁnite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly inﬂuence the performance of the algorithm, as we shall see it from the bounds. The ﬁrst assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Deﬁnition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is deﬁned as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we deﬁne VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us ﬁrst consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is deﬁned as follows: For a ﬁxed policy π, the one-step Bellman-error of F w.r.t. T π is deﬁned as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is deﬁned as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the ﬁrst ﬁnite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the ﬁrst analysis of ﬁtted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difﬁcult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based ﬁtted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based ﬁtted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artiﬁcial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural ﬁtted Q iteration – ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

4 0.5183326 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

Abstract: This paper describes a new model for human visual classiﬁcation that enables the recovery of image features that explain human subjects’ performance on different visual classiﬁcation tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classiﬁer operating on raw image pixels. Instead, it represents classiﬁcation as the combination of multiple feature detectors. This approach extracts more information about human visual classiﬁcation than previous methods and provides a foundation for further exploration. 1

5 0.51773232 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

6 0.51455736 156 nips-2007-Predictive Matrix-Variate t Models

7 0.51447809 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

8 0.51372582 158 nips-2007-Probabilistic Matrix Factorization

9 0.51189899 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

10 0.5112012 86 nips-2007-Exponential Family Predictive Representations of State

11 0.5111953 69 nips-2007-Discriminative Batch Mode Active Learning

12 0.51104897 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

13 0.50929296 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

14 0.50854343 63 nips-2007-Convex Relaxations of Latent Variable Training

15 0.50833118 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

16 0.50822473 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

17 0.50707251 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

18 0.50649595 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

19 0.50645518 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

20 0.50575006 175 nips-2007-Semi-Supervised Multitask Learning