jmlr jmlr2009 jmlr2009-9 knowledge-graph by maker-knowledge-mining

9 jmlr-2009-Analysis of Perceptron-Based Active Learning


Source: pdf

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. [sent-7, score-0.232]

2 This distinction is not captured in standard models of supervised learning, and has motivated the field of active learning, in which the labels of data points are initially hidden, and the learner must pay for each label it wishes revealed. [sent-12, score-0.263]

3 The paper showed that if the data is drawn uniformly from the surface of the unit sphere in Rd , and the hidden labels correspond c 2009 Sanjoy Dasgupta, Adam Tauman Kalai and Claire Monteleoni. [sent-19, score-0.263]

4 (An Ω(d log 1 ) label complexity can be seen to be optimal ε by counting the number of spherical caps of radius ε that can be packed onto the surface of the unit sphere in Rd . [sent-23, score-0.312]

5 In this paper, we show how a simple modification of the perceptron update can be used to ˜ achieve the same sample complexity bounds (within O factors), under the same streaming model and the same uniform input distribution. [sent-25, score-0.267]

6 If label yt is requested: Update step: Set vt+1 based on vt , xt , yt . [sent-37, score-1.183]

7 The regular perceptron update, whose convergence behavior was first analyzed by Rosenblatt (1958), consists of the following simple rule: if (xt , yt ) is misclassified, then vt+1 = vt + yt xt . [sent-41, score-1.307]

8 √ It turns out that this update cannot yield an error rate better than Ω(1/ lt ), where lt is the number of labels queried up to time t, no matter what filtering scheme is used. [sent-42, score-0.22]

9 Let kt be the number of updates performed upto time t, let vt be the hypothesis at time t, and let θt be the angle between u and vt . [sent-48, score-1.788]

10 When the points are distributed uniformly over the unit sphere, θt ≥ sin θt (for θt ≤ π ) is proportional to the error rate of vt . [sent-51, score-1.115]

11 In this paper, the O notation is used to suppress multiplicative terms in log d, log log 1 and log 1 . [sent-53, score-0.288]

12 As we will shortly see, the reason for this slow rate is that the magnitude of the perceptron update is too large for points near the decision boundary of the current hypothesis. [sent-57, score-0.235]

13 So instead we use a variant of the update rule, originally due to Motzkin and Schoenberg (1954): if (xt , yt ) is misclassified, then vt+1 = vt − 2(vt · xt )xt (where xt is assumed normalized to unit length). [sent-58, score-1.413]

14 Note that the update can also be written as vt+1 = vt + 2yt |vt · xt |xt , since updates are only made on mistakes, in which case yt = SGN(vt · xt ), by definition. [sent-59, score-1.412]

15 Thus we are scaling the standard perceptron’s additive update by a factor of 2|vt · xt | to avoid oscillations caused by points close to the half-space represented by the current hypothesis. [sent-60, score-0.286]

16 Consider a stream of data points xt drawn uniformly at random from the surface of the unit sphere in Rd , and corresponding labels yt that are consistent with some linear separator. [sent-67, score-0.569]

17 Given the limited information the algorithm keeps, a natural filtering rule is to query points xt when |vt · xt | is less than some threshold st . [sent-74, score-0.608]

18 On the other hand, if st is too small, then the waiting time for a query might be prohibitive, and when an update is actually made, the magnitude of this update might be tiny. [sent-77, score-0.319]

19 Consider a stream of data points xt drawn uniformly at random from the surface of the unit sphere in Rd , and corresponding labels yt that are consistent with some linear 283 DASGUPTA , K ALAI AND M ONTELEONI separator. [sent-81, score-0.569]

20 With probability 1 − δ, if the active modified Perceptron algorithm (Figure 4) is given a ˜ ˜ ˜ stream of O( d log 1 ) such unlabeled points, it will request O(d log 1 ) labels, make O(d log 1 ) errors ε ε ε ε (on all points, labeled or not), and have final error ≤ ε. [sent-82, score-0.453]

21 Related Work Much of the early theory work on active learning was in the query learning model, in which the learner has the ability to synthesize arbitrary data points and request their labels. [sent-85, score-0.229]

22 Dasgupta, Hsu, and Monteleoni (2007) recently gave an active learning algorithm for general concept classes in the non-Bayesian, agnostic setting (a generalization of the original selective sampling algorithm of Cohn et al. [sent-119, score-0.229]

23 (2004) analyzed an algorithm which conforms to roughly the same template as ours but differs in both the update and filtering rule—it uses the regular perceptron update and it queries points xt according to a fixed, randomized rule which favors small |vt · xt |. [sent-123, score-0.724]

24 For instance, in our setting, in which data is distributed uniformly over the unit sphere, Dasgupta (2004) showed that if the target linear separator is allowed to be nonhomogeneous, then the number of labels required to reach error ε is Ω( 1 ), no matter what active ε learning scheme is used. [sent-128, score-0.311]

25 Preliminaries In our model, all data xt lie on the surface of the unit ball in Rd , which we denote by S: S = x ∈ Rd x =1 . [sent-132, score-0.272]

26 Their labels yt are either −1 or +1, and the target function is a half-space u · x ≥ 0 represented by a unit vector u ∈ Rd which classifies all points perfectly, that is, yt (u · xt ) > 0 for all t, with probability one. [sent-133, score-0.433]

27 ˆ Our lower bound (Theorem 1) is distribution-free; thereafter we will assume that the data points xt are drawn independently from the uniform distribution over S. [sent-135, score-0.257]

28 285 DASGUPTA , K ALAI AND M ONTELEONI Figure 1: The projection of the error region ξt onto the plane defined by u and vt . [sent-138, score-0.91]

29 For a hypothesis vt , we will denote the angle between u and vt by θt , and we will define the error region of vt as ξt = {x ∈ S |SGN(vt · x) = SGN(u · x) }. [sent-139, score-2.608]

30 Figure 1 provides a schematic of the projection of the error region onto the plane defined by u and vt . [sent-140, score-0.91]

31 vt+1 = vt + yx On any update, vt+1 · u = vt · u + y(x · u). [sent-155, score-1.694]

32 (3) Thus, if we assume for simplicity that v0 · u ≥ 0 (we can always just start count when this first occurs) then vt · u ≥ 0 always, and θt , the angle between u and vt is always acute. [sent-156, score-1.694]

33 Since u = 1, the following holds: vt cos θt = vt · u. [sent-157, score-1.743]

34 The update rule also implies vt+1 2 = vt 2 + 1 + 2y(vt · x). [sent-158, score-0.935]

35 Proof Figure 1 shows the unit circle in the plane defined by u and vt . [sent-165, score-0.921]

36 The dot product of any point x ∈ Rd with either u or vt depends only upon the projection of x into this plane. [sent-166, score-0.847]

37 For such points, y(u · x) is at most sin θt (point (i)) and y(vt · x) is at least − vt sin θt (point (ii)). [sent-168, score-1.131]

38 Combining this with Equations (3) and (4), we get vt+1 · u ≤ vt · u + sin θt , vt+1 2 ≥ vt 2 + 1 − 2 vt sin θt . [sent-169, score-2.825]

39 1 5 vt , and then conclude that sin θt ≥ θt+1 ≤ θt implies cos2 θt ≤ cos2 θt+1 = (u · vt+1 )2 ≤ vt+1 2 The final denominator is positive since sin θt ≤ ( vt 2 1 5 vt (u · vt + sin θt )2 . [sent-171, score-3.814]

40 Rearranging, + 1 − 2 vt sin θt ) cos2 θt ≤ (u · vt )2 + sin2 θt + 2(u · vt ) sin θt , and using vt cos θt = (u · vt ): (1 − 2 vt sin θt ) cos2 θt ≤ sin2 θt + 2 vt sin θt cos θt . [sent-173, score-6.595]

41 For t = 1 to M: Let (xt , yt ) be the next example with y(x · vt ) < 0. [sent-176, score-0.902]

42 vt+1 = vt − 2(vt · xt )xt Figure 2: The (non-active) modified Perceptron algorithm. [sent-177, score-1.039]

43 The standard Perceptron update, vt+1 = vt + yt xt , is in the same direction (note yt = −SGN(vt · xt )) but different magnitude (scaled by a factor of 2|vt · xt |). [sent-178, score-1.533]

44 2 Again, since sin θt ≤ 5 1 t , it follows that (1−2 vt sin θt ) ≥ 3 and that 2 vt sin θt cos θt ≤ 5 . [sent-179, score-2.169]

45 , Figure 1), when θt is tiny, the update will cause vt+1 to overshoot the mark and swing too far to the other side of u, unless 1 vt is very large: to be precise, we need vt = Ω( sin θt ). [sent-184, score-1.906]

46 If sin θt is proportional to the error of vt , as in the case of data distributed uniformly over the unit sphere, this means that the perceptron update cannot stably maintain an error rate ≤ ε until t = Ω( ε12 ). [sent-186, score-1.319]

47 Unlike the standard Perceptron, it ensures that vt · u is increasing, that is, the error of vt is monotonically decreasing. [sent-189, score-1.711]

48 Another difference from the standard update (and other versions) is that the magnitude of the current hypothesis, vt , is always 1, which is convenient for the analysis. [sent-190, score-0.917]

49 Note that v1 = 1 and vt+1 2 = vt 2 + 4(vt · xt )2 xt 2 − 4(vt · xt )2 = 1 by induction. [sent-193, score-1.423]

50 In contrast, for the standard perceptron update, the magnitude of vt increases steadily. [sent-194, score-0.988]

51 With the modified update, the error can only decrease, because vt · u only increases: vt+1 · u = vt · u − 2(vt · xt )(xt · u) = vt · u + 2|vt · xt ||xt · u|. [sent-195, score-2.942]

52 (5) The second equality follows from the fact that vt misclassified xt . [sent-196, score-1.039]

53 Thus vt · u is increasing, and the increase can be bounded from below by showing that |vt · xt ||xt · u| is large. [sent-197, score-1.039]

54 These names are likely due to the following geometric property of this update: xt · vt+1 = xt · vt − 2(xt · vt )(xt · xt ) = −(xt · vt ). [sent-200, score-3.117]

55 In general, one can consider modified updates of the form vt+1 = vt − α(vt · xt )xt , which corresponds to the “Relaxation” method of solving linear inequalities (Agmon, 1954; Motzkin and Schoenberg, 1954). [sent-201, score-1.095]

56 When α = 2, the vectors vt no longer remain of fixed length; however, one can verify that their corresponding unit vectors vt satisfy ˆ vt+1 · u = (vt · u + α|vt · xt ||xt · u|)/ 1 − α(2 − α)(vt · xt )2 , ˆ ˆ ˆ ˆ and thus any choice of α ∈ [0, 2] guarantees non-increasing error. [sent-202, score-2.135]

57 (1996) used α = 1 to guarantee progress in the denominator (their analysis did not rely on progress in the numerator) as long as vt · u and (vt · xt )2 were bounded away from 0. [sent-204, score-1.039]

58 In our sequential framework, we can bound |vt · xt ||xt · u| away from 0 in expectation, under the uniform distribution, and hence ˆ the choice of α = 2 is most convenient, but α = 1 would work as well. [sent-206, score-0.233]

59 As we shall see, in d dimensions, one expects each of these terms to be on the order of d −1/2 sin θt , where sin θt = 1 − (vt · u)2 . [sent-210, score-0.284]

60 Lemma 6 For any vt , with probability at least 1 , 3 1 − vt+1 · u ≤ (1 − vt · u) 1 − 1 50d There exists a constant c > 0, such that with probability at least 1 − vt+1 · u ≤ (1 − vt · u) 1 − 63 64 , . [sent-213, score-2.541]

61 We will argue that each of |vt · xt |,|u · xt | is “small” with probability at most 1/3. [sent-217, score-0.384]

62 The error rate of vt is θt /π, where cos θt = vt · u. [sent-219, score-1.76]

63 In this case, 1−(vt ·u)2 100d , since θt2 ≥ sin2 θt = 1 − vt+1 · u ≤ 1 − vt · u − 2|vt · xt ||u · xt | 1 − (vt · u)2 ≤ 1 − vt · u − 50d 1 + vt · u , = (1 − vt · u) 1 − 50d where the first inequality is by application of (5). [sent-226, score-3.772]

64 Theorem 7 With probability 1 − δ with respect to the uniform distribution on the unit sphere, in the supervised, realizable setting, after M = O(d(log 1 + log 1 )) mistakes, the generalization error of ε δ the modified Perceptron algorithm is at most ε. [sent-228, score-0.227]

65 Proof By the above lemma, we can conclude that, for any vector vt , E[1 − vt+1 · u] ≤ (1 − vt · u) 1 − 1 . [sent-229, score-1.694]

66 ˜ The additional factor of 1 in the bound on unlabeled samples (O( d log 1 )) follows by upper ε ε ε bounding the number of unlabeled samples until an update: when the hypothesis has error rate ε, the waiting time (in samples) until an update is 1 , in expectation. [sent-235, score-0.278]

67 ε 290 A NALYSIS OF P ERCEPTRON -BASED ACTIVE L EARNING Figure 3: The active learning rule is to query for labels on points x in L which is defined by the threshold st on |vt · x|. [sent-236, score-0.394]

68 An Active Modified Perceptron The ideal objective in designing an active learning rule that minimizes label complexity would be to query for labels only on points in the error region, ξt . [sent-238, score-0.333]

69 The intuition behind our active learning rule is to approximate the error region, given the information the algorithm does have: vt . [sent-240, score-1.002]

70 As shown in Figure 3, the labeling region L is simply formed by thresholding the margin of a candidate example with respect to vt . [sent-241, score-0.876]

71 For its filtering rule, we maintain a threshold st and we only ask for labels of examples with |vt · x| ≤ st . [sent-244, score-0.316]

72 The idea of the analysis is as follows: Definition 7 We say the tth update is “good” if, 1 − vt+1 · u ≤ (1 − vt · u) 1 − c . [sent-248, score-0.932]

73 (Lemma 8) First, we argue that st is not too small (we do not decrease st too quickly). [sent-251, score-0.226]

74 √ s1 = 1/ d For t = 1 to L: Wait for the next example x : |x · vt | ≤ st and query its label. [sent-257, score-1.009]

75 If (xt · vt )yt < 0, then: vt+1 = vt − 2(vt · xt )xt st+1 = st else: vt+1 = vt If predictions were correct on R consecutive labeled examples (i. [sent-259, score-2.846]

76 We first lower-bound st with respect to our error, showing that, with high probability, the threshold st is never too small. [sent-271, score-0.246]

77 Lemma 8 With probability at least 1 − L st ≥ 3 R 4 , we have: 1 − (u · vt )2 for t = 1, 2, . [sent-272, score-0.96]

78 As before, let us define ξt = {x ∈ S|(x · vt )(x · u) < 0}. [sent-277, score-0.847]

79 Lemma 9 For any γ ∈ 0, 1−(u·vt )2 4d , 1 Pxt ∈S xt ∈ ξt |xt · vt | < γ ≥ . [sent-278, score-1.039]

80 4 Proof Let x be a random example from S such that |x · vt | < γ and, without loss of generality, suppose that 0 ≤ x · vt ≤ γ. [sent-279, score-1.694]

81 We 292 A NALYSIS OF P ERCEPTRON -BASED ACTIVE L EARNING can decompose x = x′ + (x · vt )vt where x′ = x − (x · vt )vt is the component of x orthogonal to vt , that is, x′ · vt = 0. [sent-281, score-3.388]

82 Hence, u · x = (u′ + (u · vt )vt ) · (x′ + (x · vt )vt ) = u′ · x′ + (u · vt )(x · vt ). [sent-283, score-3.388]

83 In other words, we err iff u′ · x′ < −(u · vt )(x · vt ). [sent-284, score-1.724]

84 [0, (1 − (u · vt )2 )/(4d)], we conclude that if u′ · x′ < − then we must err. [sent-285, score-0.847]

85 Also, let x′ = ˆ to check that x′ = 1 − (x · vt x′ x′ 2. [sent-286, score-0.847]

86 ) Using u · vt ∈ [0, 1] and since x · vt ∈ 1 − (u · vt )2 , 4d (7) be the unit vector in the direction of x′ . [sent-287, score-2.598]

87 Substituting these ˆ 1−(u·vt ) into (7), we must err if, u′ · x′ < −1/ 4d(1 − (x · vt ))2 , and since ˆ ˆ suffices to show that, ˆ ˆ Px∈S u′ · x′ < −1 4d(1 − 1/(4d)) 1 − (x · vt )2 ≥ 1 − 1/(4d), it 1 0 ≤ x · vt ≤ γ ≥ . [sent-289, score-2.571]

88 Well, one way to pick x ∈ S would be to first pick x · vt and then to pick x′ uniformly at random from the set S′ = {x′ ∈ S|x′ · vt = 0}, which is a unit sphere ˆ ˆ ˆ in one fewer dimensions. [sent-291, score-1.947]

89 Moreover, since t is the smallest such number, and u · vt is increasing, it must be the case that st = st−1 /2, that is we just saw a run of R labeled examples (xi , yi ), for i = t − R, . [sent-298, score-0.96]

90 ,t − 1, with no mistakes, vi = vt , and si = 2st < 1 − (u · vt )2 = 4d 1 − (u · vi )2 . [sent-301, score-1.694]

91 In particular, Lemma 10 Given that st ≥ (1 − (u · vt )2 )/(16d), upon the tth update, each erroneous example is queried with probability at least 1/32, that is, Px∈S |x · vt | ≤ st x ∈ ξt ≥ 293 1 . [sent-306, score-1.984]

92 32 DASGUPTA , K ALAI AND M ONTELEONI Proof Using Lemmas 9 and 4, we have Px∈S [x ∈ ξt ∧ |x · vt | ≤ st ] ≥ Px∈S x ∈ ξt ∧ |x · vt | ≤ ≥ ≥ ≥ 1 Px∈S |x · vt | ≤ 4 1 − (u · vt )2 16d 1 − (u · vt )2 16d 1 1 1 − (u · vt )2 = sin θt 64 64 θt . [sent-307, score-5.337]

93 However, Px∈S [x ∈ ξt ] = θt /π, so we are querying an error x ∈ ξt with probability at least 1/32, that is, the above inequality implies, Px∈S |x · vt | ≤ st x ∈ ξt = Px∈S [x ∈ ξt ∧ |x · vt | ≤ st ] θt /(32π) 1 ≥ = . [sent-309, score-1.963]

94 Lemma 11 Assuming that st ≥ at least 1/2, that is, (1 − (u · vt )2 )/(16d), a random update is good with probability Pxt ∈S (1 − vt+1 · u) ≤ (1 − vt · u) 1 − c d 1 |x · vt | ≤ st ∧ xt ∈ ξt ≥ . [sent-311, score-3.029]

95 We know, by Lemma 8 that with probability 3 1 − L( 4 )R , sin θt θt st ≥ √ ≥ √ 4 d 2π d (9) for all t. [sent-319, score-0.255]

96 E[1 − u · vt+1 |vt ] ≤ (1 − u · vt ) 1 − 2d 294 A NALYSIS OF P ERCEPTRON -BASED ACTIVE L EARNING Hence, after U updates, using Markov’s inequality, c 4 1− P 1 − u · vL ≥ δ 2d U δ 3 ≤ +L 4 4 R . [sent-322, score-0.847]

97 We choose R = 10 log 1 d log εδ 2L = Θ log δR δ = O log θL π ≤ ε, as d 1 . [sent-329, score-0.288]

98 Discussion and Conclusions In the evolving theory of active learning, the most concrete, nontrivial scenario in which active learning has been shown to give an exponential improvement in sample complexity is that of learning a linear separator for data distributed uniformly over the unit sphere. [sent-336, score-0.385]

99 5 log 1 ) ε Yes QBC ˜ O( d log 1 ) ε ε ˜ O(d log 1 ) ε ˜ O(d log 1 ) ε No BBZ’07 ˜ O( d log 1 ) ε ε ˜ O(d log 1 ) ε ˜ O(d log 1 ) ε Yes Our algorithm ˜ O( d log 1 ) ε ε ˜ O(d log 1 ) ε ˜ O(d log 1 ) ε Unknown Table 1: Results in context, for learning half-spaces through the origin. [sent-351, score-0.72]

100 A bound on the label complexity of agnostic active learning. [sent-474, score-0.24]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vt', 0.847), ('xt', 0.192), ('sin', 0.142), ('perceptron', 0.141), ('dasgupta', 0.138), ('active', 0.12), ('st', 0.113), ('sphere', 0.105), ('px', 0.099), ('alai', 0.091), ('erceptron', 0.091), ('onteleoni', 0.091), ('separators', 0.077), ('log', 0.072), ('update', 0.07), ('mistakes', 0.062), ('realizable', 0.061), ('monteleoni', 0.061), ('unit', 0.057), ('updates', 0.056), ('yt', 0.055), ('balcan', 0.052), ('motzkin', 0.05), ('labels', 0.05), ('nalysis', 0.05), ('queried', 0.049), ('query', 0.049), ('selective', 0.049), ('cos', 0.049), ('agnostic', 0.044), ('rd', 0.043), ('qbc', 0.04), ('separator', 0.039), ('sgn', 0.039), ('stream', 0.035), ('label', 0.034), ('ari', 0.031), ('schoenberg', 0.031), ('err', 0.03), ('hampson', 0.03), ('unlabeled', 0.03), ('region', 0.029), ('uniformly', 0.028), ('lemma', 0.026), ('querying', 0.026), ('points', 0.024), ('ad', 0.024), ('ltering', 0.023), ('earning', 0.023), ('cohn', 0.023), ('seung', 0.023), ('surface', 0.023), ('committee', 0.021), ('misclassi', 0.021), ('sl', 0.021), ('complexity', 0.021), ('hypothesis', 0.021), ('bound', 0.021), ('pick', 0.021), ('exion', 0.02), ('kibler', 0.02), ('pxt', 0.02), ('tauman', 0.02), ('ask', 0.02), ('uniform', 0.02), ('request', 0.02), ('threshold', 0.02), ('supervised', 0.019), ('modi', 0.019), ('distributional', 0.018), ('freund', 0.018), ('homogeneous', 0.018), ('rule', 0.018), ('kt', 0.017), ('adam', 0.017), ('lt', 0.017), ('plane', 0.017), ('error', 0.017), ('atlas', 0.017), ('canadian', 0.017), ('claire', 0.017), ('gale', 0.017), ('inen', 0.017), ('sanjoy', 0.017), ('technological', 0.017), ('toyota', 0.017), ('waiting', 0.017), ('batch', 0.017), ('vm', 0.017), ('adaptively', 0.017), ('analyzed', 0.017), ('sampling', 0.016), ('blum', 0.016), ('learner', 0.016), ('beygelzimer', 0.015), ('tth', 0.015), ('vl', 0.015), ('bounds', 0.015), ('errors', 0.015), ('lewis', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

2 0.33519277 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

3 0.18371046 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

4 0.14164476 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness

5 0.12397009 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

6 0.12235885 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

7 0.11900026 90 jmlr-2009-Structure Spaces

8 0.088987313 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

9 0.073492594 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

10 0.071941338 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

11 0.05796656 50 jmlr-2009-Learning When Concepts Abound

12 0.043201879 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

13 0.041483905 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

14 0.040871557 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

15 0.039623663 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

16 0.036185496 49 jmlr-2009-Learning Permutations with Exponential Weights

17 0.034560248 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

18 0.032638334 46 jmlr-2009-Learning Halfspaces with Malicious Noise

19 0.03241219 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

20 0.031372346 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.272), (1, 0.285), (2, -0.181), (3, 0.081), (4, 0.05), (5, -0.167), (6, -0.004), (7, 0.029), (8, -0.157), (9, -0.198), (10, 0.228), (11, 0.102), (12, 0.024), (13, 0.03), (14, -0.086), (15, -0.144), (16, 0.288), (17, 0.069), (18, -0.025), (19, -0.017), (20, 0.197), (21, -0.027), (22, 0.164), (23, -0.076), (24, 0.093), (25, 0.081), (26, 0.014), (27, -0.093), (28, -0.049), (29, -0.003), (30, -0.03), (31, 0.107), (32, -0.02), (33, -0.017), (34, -0.091), (35, -0.074), (36, -0.053), (37, 0.132), (38, -0.034), (39, 0.106), (40, 0.044), (41, -0.037), (42, 0.077), (43, -0.018), (44, -0.053), (45, 0.178), (46, -0.026), (47, 0.011), (48, 0.075), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98774731 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

2 0.70879638 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

3 0.51997036 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

4 0.46587676 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness

5 0.32781544 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

Author: Barnabás Póczos, András Lőrincz

Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design

6 0.31923643 13 jmlr-2009-Bounded Kernel-Based Online Learning

7 0.27526304 90 jmlr-2009-Structure Spaces

8 0.22669865 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

9 0.21720622 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

10 0.20286801 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

11 0.18951938 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

12 0.18385133 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

13 0.15900724 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

14 0.14908907 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

15 0.14479284 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

16 0.14066193 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

17 0.13926835 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

18 0.1383736 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

19 0.13374999 46 jmlr-2009-Learning Halfspaces with Malicious Noise

20 0.12933759 50 jmlr-2009-Learning When Concepts Abound


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.306), (8, 0.016), (11, 0.051), (19, 0.029), (38, 0.034), (47, 0.01), (52, 0.052), (55, 0.093), (58, 0.033), (66, 0.124), (68, 0.047), (90, 0.046), (93, 0.015), (96, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72560138 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

2 0.48132887 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

Author: Junning Li, Z. Jane Wang

Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton

3 0.47992325 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

4 0.47811872 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

5 0.47586215 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

6 0.47191438 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

7 0.47025925 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

8 0.46333534 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

9 0.46284258 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

10 0.46202999 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

11 0.46154037 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

12 0.46072248 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

13 0.46023402 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

14 0.45929548 38 jmlr-2009-Hash Kernels for Structured Data

15 0.45858207 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

16 0.45585698 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

17 0.45563504 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

18 0.4539842 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

19 0.45047322 18 jmlr-2009-Consistency and Localizability

20 0.45047256 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence