nips nips2011 nips2011-162 knowledge-graph by maker-knowledge-mining

162 nips-2011-Lower Bounds for Passive and Active Learning


Source: pdf

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient. [sent-3, score-0.225]

2 ” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. [sent-4, score-0.546]

3 For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. [sent-5, score-0.775]

4 This was observed by Massart and N´ d´ lec e e [24], who showed that, when it comes to binary classification rates on a sample of size n under a margin condition, some classes admit rates of the order 1/n while others only (log n)/n. [sent-7, score-0.233]

5 As noted by Gin´ and Koltchinskii [15], the fine complexity e notion that defines this “richness” is in fact embodied in Alexander’s capacity function. [sent-9, score-0.228]

6 1 Somewhat surprisingly, the supremum of this function (called the disagreement coefficient by Hanneke [19]) plays a key role in risk bounds for active learning. [sent-10, score-0.562]

7 First, we prove lower bounds for passive learning based on Alexander’s capacity function, matching the upper bounds of [15] up to constants. [sent-12, score-0.774]

8 Second, we prove lower bounds for the number of label requests in active learning in terms of the capacity function. [sent-13, score-0.661]

9 Our proof techniques are information-theoretic in nature and provide a unified tool to study active and passive learning within the same framework. [sent-14, score-0.48]

10 In this framework, the learner is passive and has no control i=1 on how this sample is chosen. [sent-23, score-0.337]

11 The classical setting is well studied, and the following question has recently received attention: do we gain anything if data are obtained sequentially, and the learner is allowed to modify the design distribution Π of the predictor variable before receiving the next pair (Xi , Yi )? [sent-24, score-0.104]

12 That is, can the learner actively use the information obtained so far to facilitate faster learning? [sent-25, score-0.079]

13 Two paradigms often appear in the literature: (i) the design distribution is a Dirac delta function at some xi that depends on (xi−1 , Y i−1 ), or (ii) the design distribution is a restriction of the original distribution to some measurable set. [sent-26, score-0.211]

14 To be precise, the capacity function depends on the underlying probability distribution. [sent-31, score-0.262]

15 The setting (ii) is often called selective sampling [9, 13, 8], although the term active learning is also used. [sent-33, score-0.246]

16 In recent years, several interesting algorithms for active learning and selective sampling have appeared in the literature, most notably: the A2 algorithm of Balcan et al. [sent-39, score-0.246]

17 [11], which maintains the set Di implicitly through synthetic and real examples; and the importance-weighted active learning algorithm of Beygelzimer et al. [sent-41, score-0.194]

18 An insightful analysis has been carried out by Hanneke [20, 19], who distilled the role of the so-called disagreement coefficient in governing the performance of several of these active learning algorithms. [sent-43, score-0.377]

19 Finally, Koltchinskii [23] analyzed active learning procedures using localized Rademacher complexities and Alexander’s capacity function, which we discuss next. [sent-44, score-0.461]

20 We define the excess risk of a classifier f by EP (f ) RP (f ) − RP (f ∗ ), so that EP (f ) ≥ 0, with ∗ equality if and only if f = f Π-a. [sent-52, score-0.107]

21 The Alexander’s capacity function [15] is defined as τ (ε) Π(Dε (f ∗ ))/ε, (1) that is, τ (ε) measures the relative size (in terms of Π) of the disagreement region Dε compared to ε. [sent-57, score-0.411]

22 The function τ was originally introduced by Alexander [1, 2] in the context of exponential inequalities for empirical processes indexed by VC classes of functions, and Gin´ and Koltchine skii [15] generalized Alexander’s results. [sent-59, score-0.086]

23 The upper bound (2) suggests the importance of the Alexander’s capacity function for passive learning, leaving open the question of necessity. [sent-62, score-0.563]

24 Our first contribution is a lower bound which matches the upper bound (2) up to constant, showing that, in fact, dependence on the capacity is unavoidable. [sent-63, score-0.406]

25 Recently, Koltchinskii [23] made an important connection between Hanneke’s disagreement coefficient and Alexander’s capacity function. [sent-64, score-0.411]

26 Similar bounds based on the disagreement coefficient have appeared in [19, 20, 11]. [sent-66, score-0.273]

27 The second contribution of this paper is a lower bound on the expected number of queries based on Alexander’s capacity τ (ε). [sent-67, score-0.392]

28 For passive learning, Massart and N´ d´ lec [24] proved e e two lower bounds which, in fact, correspond to τ (ε) = 1/ε and τ (ε) = τ0 , the two endpoints on the complexity scale for the capacity function. [sent-69, score-0.829]

29 Without the capacity function at hand, the authors emphasize that “rich” VC classes yield a larger lower bound. [sent-70, score-0.339]

30 In the PAC framework, the lower bound Ω(d/ε + (1/ε) log(1/δ)) goes back to [12]. [sent-72, score-0.129]

31 It follows from our results that in the noisy version of the problem (h = 1), the lower bound is in fact Ω((d/ε) log(1/ε) + (1/ε) log(1/δ)) for classes with τ (ε) = Ω(1/ε). [sent-73, score-0.16]

32 For active learning, Castro and Nowak [7] derived lower bounds, but without the disagreement coefficient and under a Tsybakov-type noise condition. [sent-74, score-0.482]

33 Hanneke [19] proved a lower bound on the number of label requests specifically for the A2 algorithm in terms of the disagreement coefficient. [sent-76, score-0.422]

34 In contrast, lower bounds of Theorem 2 are valid for any algorithm and are in terms of Alexander’s capacity function. [sent-77, score-0.398]

35 Finally, a result by K¨ ari¨ inen [22] a¨ a (strengthened by [5]) gives a lower bound of Ω(ν 2 /ε2 ) where ν = inf f ∈F EP (f ). [sent-78, score-0.129]

36 A closer look at the construction of the lower bound reveals that it is achieved by considering a specific margin h = ε/ν. [sent-79, score-0.177]

37 This point of view is put forth by Massart and N´ d´ lec [24, p. [sent-81, score-0.104]

38 We now introduce Alexander’s capacity function (1) into the picture. [sent-89, score-0.228]

39 We also denote by T the set of all admissible capacity functions τ : (0, 1] → R+ , i. [sent-91, score-0.262]

40 An n-step learning scheme S consists of the following objects: n conditional proba(t) bility distributions ΠXt |X t−1 ,Y t−1 , t = 1, . [sent-101, score-0.111]

41 After the n samples {(Xt , Yt )}n are collected, the learner computes the candidate classifier fn = ψ(X n , Y n ). [sent-110, score-0.163]

42 Specifically, given some P = Π ⊗ PY |X ∈ P(Π, F, h), define the 3 following probability measure on X n × {0, 1}n : n (t) PS (xn , y n ) = PY |X (yt |xt )ΠXt |X t−1 ,Y t−1 (xt |xt−1 , y t−1 ). [sent-112, score-0.086]

43 e e With these preliminaries out of the way, we can state the main results of this paper: Theorem 1 (Lower bounds for passive learning). [sent-120, score-0.376]

44 Given any τ ∈ T , any sufficiently large d ∈ N and any ε ∈ (0, 1], there exist a probability measure Π ∈ P(X ) and a VC class F with VC-dim(F) = d with the following properties: (1) Fix any K > 1 and δ ∈ (0, 1/2). [sent-121, score-0.086]

45 If there exists an n-step passive learning scheme that (ε/2, δ)learns P(Π, F, h, τ, ε) for some h ∈ (0, 1 − K −1 ], then n=Ω log 1 (1 − δ)d log τ (ε) δ + Kεh2 Kεh2 . [sent-122, score-0.658]

46 (5) (2) If there exists an n-step passive learning scheme that (ε/2, δ)-learns P(Π, F, 1, τ, ε), then n=Ω (1 − δ)d ε . [sent-123, score-0.342]

47 (6) Theorem 2 (Lower bounds for active learning). [sent-124, score-0.284]

48 Given any τ ∈ T , any sufficiently large d ∈ N and any ε ∈ (0, 1], there exist a probability measure Π ∈ P(X ) and a VC class F with VC-dim(F) = d with the following property: Fix any K > 1 and any δ ∈ (0, 1/2). [sent-125, score-0.086]

49 If there exists an n-step active learning scheme that (ε/2, δ)-learns P(Π, F, h, τ, ε) for some h ∈ (0, 1 − K −1 ], then n=Ω (1 − δ)d log τ (ε) τ (ε) log + Kh2 Kh2 1 δ . [sent-126, score-0.566]

50 The lower bound in (6) is well-known and goes back to [12]. [sent-128, score-0.129]

51 In fact, there is a smooth transition between (5) and (6), with the extra log τ (ε) factor disappearing as h approaches 1. [sent-130, score-0.158]

52 As for the active learning lower bound, we conjecture that d log τ (ε) is, in fact, optimal, and the extra factor of τ0 in dτ0 log τ0 log(1/ε) in (3) arises from the use of a passive learning algorithm as a black box. [sent-131, score-0.876]

53 3 Information-theoretic framework Let P and Q be two probability distributions on a common measurable space W. [sent-134, score-0.108]

54 (9) Two particular choices of φ are of interest: φ(u) = u log u, which gives the ordinary Kullback– Leibler (KL) divergence D(P Q), and φ(u) = − log u, which gives the reverse KL divergence D(Q P), which we will denote by Dre (P Q). [sent-140, score-0.464]

55 , fN } ⊂ F and m assume that to each m ∈ [N ] we can associate a probability measure P m = Π⊗PY |X ∈ P(Π, F, h) ∗ with the Bayes classifier fPm = fm . [sent-149, score-0.236]

56 For each m ∈ [N ], let us define the induced measure n (t) m PY |X (yt |xt )ΠXt |X t−1 ,Y t−1 (xt |xt−1 , y t−1 ). [sent-150, score-0.08]

57 PS,m (xn , y n ) (11) t=1 Moreover, given any probability distribution π over [N ], let PS,π (m, xn , y n ) π(m)PS,m (xn , y n ). [sent-151, score-0.186]

58 Now consider M ≡ M (X n , Y n ) arg min fn − fm 1≤m≤N L1 (Π) . [sent-162, score-0.237]

59 (12) Then the following lemma is easily proved using triangle inequality: Lemma 1. [sent-163, score-0.139]

60 The second ingredient of our approach is an application of the data processing inequality (10) with a 1 judicious choice of φ. [sent-165, score-0.075]

61 Let W (M, X n , Y n ), let M be uniformly distributed over [N ], π(m) = N for all m ∈ [N ], and let P be the induced measure PS,π . [sent-166, score-0.108]

62 Then we have the following lemma (see also [17, 16]): Lemma 2. [sent-167, score-0.098]

63 Consider any probability measure Q for W , under which M is distributed according to π and independent of (X n , Y n ). [sent-168, score-0.086]

64 c 1 On the other hand, since Q can be factored as Q(m, xn , y n ) = N QX n ,Y n (xn , y n ), we have N Q(Z = 1) = Q(M = m, M = m) = m=1 1 N N QX n ,Y n (xn , y n )1{M (xn ,yn )=m} = c m=1 xn ,y n 1 . [sent-174, score-0.194]

65 Inspection of the right-hand side of (13) suggests that the usual Ω(log N ) lower bounds [14, 27, 24] can be obtained if φ(u) behaves like u log u for large u. [sent-179, score-0.354]

66 On the other hand, if φ(u) behaves like − log u for small u, then the lower bounds will be of the form Ω log 1 . [sent-180, score-0.512]

67 δ These observations naturally lead to the respective choices φ(u) = u log u and φ(u) = − log u, corresponding to the KL divergence D(P Q) and the reverse KL divergence Dre (P Q) = D(Q P). [sent-181, score-0.464]

68 One obvious choice of Q satisfying the conditions of the lemma is the product of N the marginals PM ≡ π and PX n ,Y n ≡ N −1 m=1 PS,m : Q = PM ⊗ PX n ,Y n . [sent-183, score-0.098]

69 With this Q and φ(u) = u log u, the left-hand side of (13) is given by D(P Q) = D(PM,X n ,Y n PM ⊗ PX n ,Y n ) = I(M ; X n , Y n ), (14) where I(M ; X n , Y n ) is the mutual information between M and (X n , Y n ) with joint distribution P. [sent-184, score-0.185]

70 On the other hand, it is not hard to show that the right-hand side of (13) can be lower-bounded by (1 − δ) log N − log 2. [sent-185, score-0.316]

71 Combining with (14), we get I(M ; X n , Y n ) ≥ (1 − δ) log N − log 2, which is (a commonly used variant of) the well-known Fano’s inequality [14, Lemma 4. [sent-186, score-0.361]

72 Given a learning scheme S, define the probability measure n (t) QS (xn , y n ) QY |X (yt |xt )ΠXt |X t−1 ,Y t−1 (xt |xt−1 , y t−1 ) 1 S N Q (x n n and let Q(m, xn , y n ) = , y ) for all m ∈ [N ]. [sent-193, score-0.267]

73 For each x ∈ X and y ∈ X , let N (y|xn ) D(P Q) = Dre (P Q) = 1 N 1 N (15) t=1 n n |{1 ≤ t ≤ n : xt = y}|. [sent-195, score-0.192]

74 Then, for d sufficiently large, there exists a set k Mk,d ⊂ {0, 1}k with the following properties: (i) log |Mk,d | ≥ d log 6d ; (ii) dH (β, β ) > d for d 4 (2) any two distinct β, β ∈ Mk,d ; (iii) for any j ∈ [k], 1 d ≤ 2k |Mk,d | βj ≤ β∈Mk,d 6 3d 2k (19) Proof of Theorem 1. [sent-210, score-0.316]

75 Let k = dτ (ε) (we increase ε if necessary to ensure that k ∈ N), and consider the probability measure Π that puts mass ε/d on each x = 1 through x = k and the remaining mass 1 − ετ (ε) on x = k + 1. [sent-212, score-0.086]

76 For p ∈ [0, 1], let νp d denote the probability distribution of a Bernoulli(p) random variable. [sent-218, score-0.089]

77 Now, to each fβ ∈ F let us β associate the following conditional probability measure PY |X : β PY |X (y|x) = ν(1+h)/2 (y)βx + ν(1−h)/2 (y)(1 − βx ) 1{x∈[k]} + 1{y=0} 1{x∈[k]} β It is easy to see that each PY |X belongs to C(F, h). [sent-219, score-0.169]

78 We have thus established that, β for each β ∈ {0, 1}k , the probability measure P β = Π ⊗ PY |X is an element of P(Π, F, h, τ, ε). [sent-223, score-0.086]

79 For each m ∈ [N ], let us denote by PY |X the (m) β m conditional probability measure PY |X , by P m the measure Π ⊗ PY |X on X × {0, 1}, and by fm ∈ G the corresponding Bayes classifier. [sent-232, score-0.321]

80 Now consider any n-step passive learning scheme that (ε/2, δ)-learns P(Π, F, h, τ, ε), and define the probability measure P on [N ] × X n × {0, 1}n by 1 P(m, xn , y n ) = N PS,m (xn , y n ), where PS,m is constructed according to (11). [sent-233, score-0.525]

81 In addition, for every γ ∈ (0, 1) define the auxiliary measure Qγ on [N ] × X n × {0, 1}n by Qγ (m, xn , y n ) = 1 n n S S N Qγ (x , y ), where Qγ is constructed according to (15) with Qγ |X (y|x) Y νγ (y)1{x∈[k]} + 1{y=0} 1{x∈[k]} . [sent-234, score-0.182]

82 Applying Lemma 2 with φ(u) = u log u, we can write D(P Qγ ) ≥ (1 − δ) log N − log 2 ≥ Next we apply Lemma 3. [sent-235, score-0.474]

83 Defining η = 1+h 2 (1 − δ)d k log − log 2 4 6d (20) and using the easily proved fact that m D(PY |X (·|x) Qγ |X (·|x)) = [d(η γ) − d(1 − η γ)] fm (x) + d(1 − η γ)1{x∈[k]} , Y we get D(P Qγ ) = nε [d(η γ) + (τ (ε) − 1)d(1 − η γ)] . [sent-236, score-0.482]

84 (20) and (21) and using the fact that k = dτ (ε), we obtain n≥ (1 − δ)d log τ (ε) − log 16 6 , 4ε [d(η γ) + (τ (ε) − 1)d(1 − η γ)] ∀γ ∈ (0, 1) (22) This bound is valid for all h ∈ (0, 1], and the optimal choice of γ for a given h can be calculated in h closed form: γ ∗ (h) = 1−h + τ (ε) . [sent-238, score-0.365]

85 Lemma 2 gives Dre (P Q1−η ) ≥ (1/2) log(1/δ) − log 2. [sent-241, score-0.158]

86 (18), we can write Dre (P Q1−η ) = nε · d(η 1 − η) = nε · h log 1+h . [sent-243, score-0.158]

87 1−h (24) We conclude that n≥ 1 2 log 1 δ − log 2 εh log 1+h 1−h . [sent-244, score-0.474]

88 1+h (1) For a fixed K > 1, it follows from the inequality log u ≤ u − 1 that h log 1−h ≤ Kh2 for all h ∈ (0, 1 − K −1 ]. [sent-247, score-0.361]

89 m=1 P N Then, by convexity, D(P Q) ≤ 1 N2 N n log EP t=1 m,m =1 which is upper bounded by nh log obtain m PY |X (Yt |Xt ) ≤n m PY |X (Yt |Xt ) 1+h 1−h . [sent-256, score-0.37]

90 n≥ (a) 1 N N k 6d − 1+h 4h log 1−h (1 − δ)d log = 1+h 2 . [sent-257, score-0.316]

91 k (c) = d(η 1 − η) x=1 ≤ Then k d(η 1 − η) fm (x)EQ1−η [N (x|X n )] N m=1 x=1 x=1 (e) (26) M =1 x=1 k ≤ . [sent-258, score-0.125]

92 Applying Lemma 2 with φ(u) = − log u, we get n≥ τ (ε) log 1 δ 3h log Combining (26) and (27) and using the bound h log 8 − log 4 1+h 1−h 1+h 1−h (27) ≤ Kh2 for h ∈ (0, 1 − K −1 ], we get (7). [sent-260, score-0.839]

93 A general class of coefficients of divergence of one distribution from another. [sent-276, score-0.081]

94 Linear classification and selective sampling under low noise conditions. [sent-318, score-0.077]

95 A general lower bound on the number of examples needed for learning. [sent-344, score-0.129]

96 Improved lower bounds for learning from noisy examples: an informationtheoretic approach. [sent-358, score-0.212]

97 Lower bounds for the minimax risk using f -divergences, and applications. [sent-371, score-0.203]

98 On Fano’s lemma and similar inequalities for the minimax risk. [sent-378, score-0.175]

99 A bound on the label complexity of agnostic active learning. [sent-390, score-0.315]

100 Rademacher complexities and bounding the excess risk of active learning. [sent-410, score-0.34]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('py', 0.418), ('dre', 0.339), ('passive', 0.286), ('massart', 0.261), ('capacity', 0.228), ('active', 0.194), ('disagreement', 0.183), ('alexander', 0.168), ('xt', 0.164), ('log', 0.158), ('fm', 0.125), ('hanneke', 0.112), ('fn', 0.112), ('qy', 0.105), ('lec', 0.104), ('lemma', 0.098), ('xn', 0.097), ('ps', 0.096), ('ep', 0.094), ('pz', 0.092), ('bounds', 0.09), ('koltchinskii', 0.086), ('yt', 0.082), ('lower', 0.08), ('qx', 0.079), ('gin', 0.079), ('vc', 0.075), ('dh', 0.072), ('qz', 0.069), ('risk', 0.064), ('fano', 0.063), ('kl', 0.059), ('px', 0.059), ('pm', 0.058), ('scheme', 0.056), ('nh', 0.054), ('divergence', 0.054), ('coef', 0.052), ('lautum', 0.052), ('verd', 0.052), ('measure', 0.052), ('beygelzimer', 0.052), ('fp', 0.052), ('selective', 0.052), ('learner', 0.051), ('bound', 0.049), ('measurable', 0.049), ('minimax', 0.049), ('margin', 0.048), ('di', 0.047), ('castro', 0.046), ('inequality', 0.045), ('classi', 0.045), ('er', 0.044), ('excess', 0.043), ('informationtheoretic', 0.042), ('proved', 0.041), ('dasgupta', 0.041), ('reverse', 0.04), ('ari', 0.04), ('bayes', 0.039), ('agnostic', 0.039), ('complexities', 0.039), ('hamming', 0.038), ('requests', 0.036), ('paradigm', 0.035), ('queries', 0.035), ('admissible', 0.034), ('balcan', 0.034), ('erm', 0.034), ('sequentially', 0.034), ('fix', 0.034), ('probability', 0.034), ('auxiliary', 0.033), ('label', 0.033), ('classes', 0.031), ('supremum', 0.031), ('ingredient', 0.03), ('rp', 0.03), ('bernoulli', 0.03), ('conditional', 0.03), ('xi', 0.029), ('eq', 0.029), ('inequalities', 0.028), ('let', 0.028), ('actively', 0.028), ('yi', 0.028), ('uni', 0.027), ('distribution', 0.027), ('indexed', 0.027), ('rademacher', 0.026), ('theorem', 0.026), ('design', 0.026), ('rich', 0.026), ('behaves', 0.026), ('rates', 0.025), ('distributions', 0.025), ('noise', 0.025), ('mention', 0.025), ('associate', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1

2 0.15039597 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

3 0.14690927 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1

4 0.12744661 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

5 0.12736042 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

6 0.11951917 220 nips-2011-Prediction strategies without loss

7 0.11679679 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

8 0.11520724 178 nips-2011-Multiclass Boosting: Theory and Algorithms

9 0.11481651 145 nips-2011-Learning Eigenvectors for Free

10 0.10852741 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

11 0.099878117 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

12 0.098658599 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

13 0.09294983 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

14 0.091026455 80 nips-2011-Efficient Online Learning via Randomized Rounding

15 0.088275142 198 nips-2011-On U-processes and clustering performance

16 0.087226868 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

17 0.086558446 256 nips-2011-Solving Decision Problems with Limited Information

18 0.085986726 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

19 0.080561191 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

20 0.072989948 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.2), (1, -0.119), (2, -0.057), (3, -0.109), (4, 0.08), (5, 0.028), (6, 0.049), (7, -0.15), (8, -0.074), (9, -0.023), (10, 0.011), (11, 0.054), (12, 0.159), (13, 0.045), (14, 0.015), (15, -0.052), (16, 0.13), (17, -0.027), (18, 0.016), (19, 0.107), (20, 0.041), (21, -0.056), (22, 0.054), (23, 0.002), (24, -0.024), (25, -0.011), (26, 0.071), (27, 0.006), (28, 0.001), (29, 0.004), (30, 0.237), (31, 0.028), (32, 0.066), (33, 0.008), (34, -0.021), (35, -0.117), (36, -0.067), (37, 0.05), (38, -0.066), (39, 0.037), (40, -0.014), (41, 0.044), (42, -0.008), (43, -0.008), (44, 0.107), (45, -0.101), (46, 0.038), (47, 0.02), (48, -0.021), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94577986 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1

2 0.78773618 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1

3 0.59305835 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

4 0.58780032 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

5 0.57951194 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n, ε−1 )) preference labels for a regret of ε times the optimal loss. This is strictly better, and often significantly better than what non-adaptive sampling could achieve. Our main result helps settle an open problem posed by learning-to-rank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? 1

6 0.52849942 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

7 0.51925248 80 nips-2011-Efficient Online Learning via Randomized Rounding

8 0.49660555 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

9 0.49135995 69 nips-2011-Differentially Private M-Estimators

10 0.48786476 145 nips-2011-Learning Eigenvectors for Free

11 0.48478019 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

12 0.47891104 198 nips-2011-On U-processes and clustering performance

13 0.47226465 220 nips-2011-Prediction strategies without loss

14 0.47223845 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

15 0.45179722 59 nips-2011-Composite Multiclass Losses

16 0.44143006 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

17 0.44051108 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

18 0.43858588 231 nips-2011-Randomized Algorithms for Comparison-based Search

19 0.43689579 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

20 0.43370953 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (4, 0.04), (20, 0.043), (26, 0.414), (31, 0.061), (33, 0.013), (43, 0.089), (45, 0.114), (57, 0.017), (74, 0.038), (83, 0.032), (99, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95565331 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n, ε−1 )) preference labels for a regret of ε times the optimal loss. This is strictly better, and often significantly better than what non-adaptive sampling could achieve. Our main result helps settle an open problem posed by learning-to-rank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? 1

same-paper 2 0.93515307 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1

3 0.89682102 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter

Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1

4 0.8675403 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach

Author: Qibin Zhao, Cesar F. Caiafa, Danilo P. Mandic, Liqing Zhang, Tonio Ball, Andreas Schulze-bonhage, Andrzej S. Cichocki

Abstract: A multilinear subspace regression model based on so called latent variable decomposition is introduced. Unlike standard regression methods which typically employ matrix (2D) data representations followed by vector subspace transformations, the proposed approach uses tensor subspace transformations to model common latent variables across both the independent and dependent data. The proposed approach aims to maximize the correlation between the so derived latent variables and is shown to be suitable for the prediction of multidimensional dependent data from multidimensional independent data, where for the estimation of the latent variables we introduce an algorithm based on Multilinear Singular Value Decomposition (MSVD) on a specially defined cross-covariance tensor. It is next shown that in this way we are also able to unify the existing Partial Least Squares (PLS) and N-way PLS regression algorithms within the same framework. Simulations on benchmark synthetic data confirm the advantages of the proposed approach, in terms of its predictive ability and robustness, especially for small sample sizes. The potential of the proposed technique is further illustrated on a real world task of the decoding of human intracranial electrocorticogram (ECoG) from a simultaneously recorded scalp electroencephalograph (EEG). 1

5 0.86622542 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

Author: Edouard Grave, Guillaume R. Obozinski, Francis R. Bach

Abstract: Using the 1 -norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1

6 0.70735854 22 nips-2011-Active Ranking using Pairwise Comparisons

7 0.64146411 21 nips-2011-Active Learning with a Drifting Distribution

8 0.61106223 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

9 0.60269517 29 nips-2011-Algorithms and hardness results for parallel large margin learning

10 0.60040456 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

11 0.59652442 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

12 0.58781117 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

13 0.58011156 231 nips-2011-Randomized Algorithms for Comparison-based Search

14 0.58000839 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

15 0.57301354 226 nips-2011-Projection onto A Nonnegative Max-Heap

16 0.5727849 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

17 0.56773531 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

18 0.55645925 153 nips-2011-Learning large-margin halfspaces with more malicious noise

19 0.55554664 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

20 0.55255556 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning