nips nips2011 nips2011-21 knowledge-graph by maker-knowledge-mining

21 nips-2011-Active Learning with a Drifting Distribution

Source: pdf

Author: Liu Yang

Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. [sent-3, score-0.127]

2 We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. [sent-4, score-0.675]

3 We further prove minimax lower bounds for this problem. [sent-5, score-0.069]

4 1 Introduction Most existing analyses of active learning are based on an i. [sent-6, score-0.097]

5 In this work, we assume the data are independent, but we allow the distribution from which the data are drawn to shift over time, while the target concept remains ﬁxed. [sent-10, score-0.097]

6 We consider this problem in a stream-based selective sampling model, and are interested in two quantities: the number of mistakes the algorithm makes on the ﬁrst T examples in the stream, and the number of label requests among the ﬁrst T examples in the stream. [sent-11, score-0.429]

7 In particular, we study scenarios in which the distribution may drift within a ﬁxed totally bounded family of distributions. [sent-12, score-0.19]

8 Unlike previous models of distribution drift [Bar92, CMEDV10], the minimax number of mistakes (or excess number of mistakes, in the noisy case) can be sublinear in the number of samples. [sent-13, score-0.46]

9 We speciﬁcally study the classic CAL active learning strategy [CAL94] in this context, and bound the number of mistakes and label requests the algorithm makes in the realizable case, under conditions on the concept space and the family of possible distributions. [sent-14, score-0.767]

10 We also exhibit lower bounds on these quantities that match our upper bounds in certain cases. [sent-15, score-0.081]

11 We further study a noise-robust variant of CAL, and analyze its number of mistakes and number of label requests in noisy scenarios where the noise distribution remains ﬁxed over time but the marginal distribution on X may shift. [sent-16, score-0.524]

12 In particular, we upper bound these quantities under Tsybakov’s noise conditions [MT99]. [sent-17, score-0.096]

13 We also prove minimax lower bounds under these same conditions, though there is a gap between our upper and lower bounds. [sent-18, score-0.069]

14 2 Deﬁnition and Notations As in the usual statistical learning problem, there is a standard Borel space X , called the instance space, and a set C of measurable classiﬁers h : X → {−1, +1}, called the concept space. [sent-19, score-0.045]

15 We additionally have a space D of distributions on X , called the distribution space. [sent-20, score-0.077]

16 In the learning problem, there is an unobservable sequence of distributions D1 , D2 , . [sent-26, score-0.088]

17 Based on these quantities, we let Z = {(Xt , Yt )}∞ denote an inﬁnite t=1 sequence of independent random variables, such that ∀t, Xt ∼ Dt , and the conditional distribution of Yt given Xt satisﬁes ∀x ∈ X , P(Yt = +1|Xt = x) = η(x). [sent-30, score-0.084]

18 Thus, the joint distribution of (Xt , Yt ) is speciﬁed by the pair (Dt , η), and the distribution of Z is speciﬁed by the collection {Dt }∞ along with η. [sent-31, score-0.06]

19 Note that the η conditional distribution is time-independent, since we are restricting ourselves to discussing drifting marginal distributions on X , rather than drifting concepts. [sent-36, score-0.291]

20 Concept drift is an important and interesting topic, but is beyond the scope of our present discussion. [sent-37, score-0.058]

21 T t=1 ˆ I Yt = Yt , is the cumulative T ˆ number of mistakes up to time T . [sent-39, score-0.209]

22 We are particularly interested in the asymptotic ˆ We are primarily interested in two quantities. [sent-42, score-0.044]

23 We refer ¯ T as the expected number of label requests, and to MT − M ∗ as the expected excess number ¯ ¯ to Q T of mistakes. [sent-44, score-0.262]

24 For any distribution P on X , we deﬁne erP (h) = EX∼P [η(X)I[h(X) = −1] + (1 − η(X))I[h(X) = +1]], the probability of h making a mistake for X ∼ P and Y with conditional probability of being +1 equal η(X). [sent-45, score-0.102]

25 Note that, abbreviating ert (h) = erDt (h) = P(h(Xt ) = Yt ), T ¯∗ we have MT = inf h∈C t=1 ert (h). [sent-46, score-0.247]

26 , sublinear) are considered desirable, as these represent cases in which we do “learn” the proper way to predict labels, while asymptotically using far fewer labels than passive learning. [sent-49, score-0.098]

27 For V ⊆ C, let diamt (V ) = t 1 suph,g∈V Dt ({x : h(x) = g(x)}). [sent-52, score-0.369]

28 1 Assumptions In addition to the assumption of independence of the Xt variables and that d < ∞, each result below is stated under various additional assumptions. [sent-57, score-0.05]

29 The weakest such assumption is that D is totally bounded, in the following sense. [sent-58, score-0.116]

30 We say that D is totally bounded if it satisﬁes the following assumption. [sent-62, score-0.066]

31 2 3 Related Work We discuss active learning under distribution drift, with ﬁxed target concept. [sent-73, score-0.149]

32 There are several branches of the literature that are highly relevant to this, including domain adaptation [MMR09, MMR08], online learning [Lit88], learning with concept drift, and empirical processes for independent but not identically distributed data [vdG00]. [sent-74, score-0.066]

33 They ﬁnd that with QT = O d α+2 T α+2 queries, 1 ∗ ¯ ˜ α+1 it is possible to achieve an expected excess number of mistakes MT − MT = O d α+2 · T α+2 . [sent-77, score-0.338]

34 At this time, we know of no work studying the number of mistakes and queries achievable by active learning in a stream-based setting where the distribution may change over time. [sent-78, score-0.46]

35 Stream-based Passive Learning with a Drifting Distribution There has been work on learning with a drifting distribution and ﬁxed target, in the context of passive learning. [sent-79, score-0.223]

36 They consider learning problems in which a changing environment is modeled by a slowly changing distribution on the product space. [sent-81, score-0.053]

37 The allowable drift is restricted by ensuring that consecutive probability distributions are close in total variation distance. [sent-82, score-0.083]

38 4 Active Learning in the Realizable Case Throughout this section, suppose C is a ﬁxed concept space and h∗ ∈ C is a ﬁxed target function: that is, ert (h∗ ) = 0. [sent-86, score-0.18]

39 The family of scenarios in which this is true are often collectively referred to as the realizable case. [sent-87, score-0.206]

40 We begin our analysis by studying this realizable case because it greatly simpliﬁes the analysis, laying bare the core ideas in plain form. [sent-88, score-0.145]

41 We will discuss more general scenarios, in which ert (h∗ ) ≥ 0, in later sections, where we ﬁnd that essentially the same principles apply there as in this initial realizable-case analysis. [sent-89, score-0.142]

42 We will be particularly interested in the performance of the following simple algorithm, due to [CAL94], typically referred to as CAL after its discoverers. [sent-90, score-0.047]

43 The version presented here is speciﬁed in terms of a passive learning subroutine A (mapping any sequence of labeled examples to a classiﬁer). [sent-91, score-0.131]

44 Request Yt , let Qt = Qt−1 ∪ {(Xt , Yt )} Else let Yt′ = argmin min er(h; Qt−1 ∪ {(Xt , y)}), and let Qt ← Qt−1 ∪ {(Xt , Yt′ )} ˆ 8. [sent-102, score-0.063]

45 ˆ Let ht = A(Qt ) y∈{−1,+1} h∈C Below, we let A1IG denote the one-inclusion graph prediction strategy of [HLW94]. [sent-103, score-0.12]

46 Speciﬁcally, the passive learning algorithm A1IG is speciﬁed as follows. [sent-104, score-0.075]

47 For a sequence of data points U ∈ X t+1 , 3 the one-inclusion graph is a graph, where each vertex represents a distinct labeling of U that can be realized by some classiﬁer in C, and two vertices are adjacent if and only if their corresponding labelings for U differ by exactly one label. [sent-105, score-0.133]

48 , (xt , yt )}, and one test point xt+1 we are asked to predict a label for, we ﬁrst construct the one-inclusion graph on U = {x1 , . [sent-110, score-0.337]

49 [Han07, Han11] Deﬁne the disagreement coefﬁcient of h∗ under a distribution P as θP (ǫ) = sup P (DIS(BP (h∗ , r))) /r. [sent-118, score-0.071]

50 For any distribution P on X , if D = {P }, then running CAL with A = ¯ ¯ A1IG achieves expected mistake bound MT = O (d log(T )) and expected query bound QT = 2 O θP (ǫT )d log (T ) , for ǫT = d log(T )/T . [sent-120, score-0.328]

51 2 Learning with a Drifting Distribution We now generalize the above results to any sequence of distributions from a totally bounded space D. [sent-123, score-0.124]

52 First, we prove a basic result stating that CAL can achieve a sublinear number of mistakes, and under conditions on the disagreement coefﬁcient, also a sublinear number of queries. [sent-125, score-0.123]

53 If D is totally bounded (Assumption 1), then CAL (with A any empirical risk minimiza¯ tion algorithm) achieves an expected mistake bound MT = o(T ), and if θD (ǫ) = o(1/ǫ), then CAL ¯ T = o(T ). [sent-127, score-0.245]

54 We have that ∀t > Tǫ , diamt (Vǫ ) ≤ diamk(t) (Vǫ ) + ǫ. [sent-140, score-0.348]

55 4 Combining the above arguments, T T T E t=1 diamt (C[Zt−1 ]) ≤ Tǫ + t=Tǫ +1 E [diamt (Vǫ )] ≤ Tǫ + ǫT + E diamk(t) (Vǫ ) t=Tǫ +1 T ≤ Tǫ + ǫT + L(ǫ)ǫT + ≤ Tǫ + ǫT + L(ǫ)ǫT + E diamk(t) (Vk(t) ) t=Tǫ +1 √ ǫT. [sent-147, score-0.348]

56 Let ǫT be any nonincreasing sequence in (0, 1) such that 1 ≪ TǫT ≪ T . [sent-148, score-0.058]

57 Thus, noting that limǫ→0 L(ǫ)ǫ = 0, we have T E t=1 diamt (C[Zt−1 ]) ≤ TǫT + ǫT T + L(ǫT )ǫT T + √ ǫT T ≪ T. [sent-150, score-0.348]

58 (1) ˆ ˆ ¯ The result on MT now follows by noting that for any ht−1 ∈ C[Zt−1 ] has ert (ht−1 ) ≤ diamt (C[Zt−1 ]), so T T ¯ MT = E ert ˆ ht−1 t=1 ≤E t=1 diamt (C[Zt−1 ]) ≪ T. [sent-151, score-0.922]

59 Letting rT = T −1 E T t=1 diamt (C[Zt−1 ]) , we see that rT → 0 by (1), and since θD (ǫ) = ¯ o(1/ǫ), we also have θD (rT )rT → 0, so that θD (rT )rT T ≪ T . [sent-153, score-0.348]

60 Therefore, QT equals T T t=1 P(Request Yt ) ≤ θD (rT )·rT ·T +θD (rT )·E t=1 diamt (C[Zt−1 ]) = 2θD (rT )·rT ·T ≪ T. [sent-154, score-0.348]

61 If Assumption 2 is satisﬁed, then CAL (with A any empirical risk minimization algo¯ ¯ ¯ rithm) achieves an expected mistake bound MT and expected number of queries QT such that MT = m 1 m 1 1 2 2 ¯ O T m+1 d m+1 log T and QT = O θD (ǫT ) T m+1 d m+1 log T , where ǫT = (d/T ) m+1 . [sent-157, score-0.427]

62 Then T E t=1 T T diamt (C[Zt−1 ]) ≤ E ′ diamt (C[Zt−1 ]) + t=1 t=1 Dt − Pk(t) T T ≤E t=1 ′ diamt (C[Zt−1 ]) + ǫT ≤ ′ E diamPk(t) (C[Zt−1 ]) + 2ǫT. [sent-167, score-1.044]

63 t=1 The classic convergence rates results from PAC learning [AB99, Vap82] imply T T ′ E diamPk(t) (C[Zt−1 ]) = O d log t |{i≤t:k(i)=k(t)}| t=1 t=1 ⌈T /|Dǫ |⌉ T ≤ O(d log T ) · 1 |{i≤t:k(i)=k(t)}| t=1 ≤ O(d log T ) · |Dǫ | · 5 1 u u=1 ≤ O d|Dǫ | log2 (T ) . [sent-168, score-0.168]

64 We therefore have T ¯ MT ≤ E T sup t=1 h∈C[Zt−1 ] ert (h) ≤ E Similarly, letting ǫT = (d/T ) 1 m+1 t=1 m ¯ , QT is at most T E t=1 1 diamt (C[Zt−1 ]) ≤ O d m+1 · T m+1 log2 (T ) . [sent-171, score-0.504]

65 T Dt (DIS(C[Zt−1 ])) ≤ E t=1 Dt (DIS (BDt (h∗ , max {diamt (C[Zt−1 ]), ǫT }))) T ≤E t=1 θD (ǫT ) · max {diamt (C[Zt−1 ]), ǫT } T ≤E t=1 1 m θD (ǫT ) · diamt (C[Zt−1 ]) + θD (ǫT ) T ǫT ≤ O θD (ǫT ) · d m+1 · T m+1 log2 (T ) . [sent-172, score-0.348]

66 We can additionally construct a lower bound for this scenario, as follows. [sent-173, score-0.046]

67 For instance, this is the case for linear separators (and most other natural “geometric” concept spaces). [sent-182, score-0.071]

68 For any C as above, for any active learning algorithm, ∃ a set D satsifying Assumption 2, a target function h∗ ∈ C, and a sequence of distributions {Dt }T in D such that the achieved t=1 m m m ¯ ¯ ¯ ¯ ¯ MT and QT satisfy MT = Ω T m+1 , and MT = O T m+1 =⇒ QT = Ω T m+1 . [sent-184, score-0.177]

69 5 Learning with Noise In this section, we extend the above analysis to allow for various types of noise conditions commonly studied in the literature. [sent-186, score-0.043]

70 We prove upper bounds achieved by ACAL, as well as (non-matching) minimax lower bounds. [sent-188, score-0.069]

71 A particularly interesting special case of Assumption 3 is given by Tsybakov’s noise conditions, which essentially control how common it is to have η values close to 1/2. [sent-193, score-0.05]

72 In the setting of shifting distributions, we will be interested in conditions for which the above assumptions are satisifed simultaneously for all distributions in D. [sent-196, score-0.069]

73 t ← 0, Lt ← ∅, Qt ← ∅, let ht be any element of C 2. [sent-204, score-0.09]

74 Let δi be a nonincreasing sequence of values in (0, 1). [sent-213, score-0.058]

75 For any strictly benign (P, η), if 2−2 ≪ δi ≪ 2−i /i, ACAL achieves an expected ∗ ¯ excess number of mistakes MT − MT = o(T ), and if θP (ǫ) = o(1/ǫ), then ACAL makes an ¯ T = o(T ). [sent-226, score-0.396]

76 For any (P, η) satisfying Assumption 4, if D = {P }, ACAL achieves an expected α+1 1 ⌊log(T )⌋ 1 ∗ ¯ ˜ δi 2i . [sent-228, score-0.11]

77 and excess number of mistakes MT − MT = O d α+2 · T α+2 log δ⌊log(T )⌋ + i=0 2 α ¯ ˜ an expected number of queries QT = O θP (ǫT ) · d α+2 · T α+2 log 1 δ⌊log(T )⌋ + ⌊log(T )⌋ δi 2i i=0 . [sent-229, score-0.538]

78 For any (P, η) satisfying Assumption 4, if D = {P } and δi = 2−i in ACAL, the ¯ ¯ algorithm achieves an expected number of mistakes MT and expected number of queries QT such α+1 α 2 α 1 ∗ ¯ ˜ ¯ ˜ that, for ǫT = T − α+2 , MT − MT = O d α+2 · T α+2 , and QT = O θP (ǫT ) · d α+2 · T α+2 . [sent-232, score-0.473]

79 4 Learning with a Drifting Distribution We can now state our results concerning ACAL, which are analogous to Theorems 2 and 3 proved earlier for CAL in the realizable case. [sent-234, score-0.145]

80 If D is totally bounded (Assumption 1) and η satisﬁes Assumption 3, then ACAL with ∗ ¯ δi = 2−i achieves an excess expected mistake bound MT − MT = o(T ), and if additionally ¯ T = o(T ). [sent-236, score-0.348]

81 θD (ǫ) = o(1/ǫ), then ACAL makes an expected number of queries Q The proof of Theorem 7 essentially follows from a combination of the reasoning for Theorem 2 and Theorem 8 below. [sent-237, score-0.183]

82 If Assumptions 2 and 5 are satisﬁed, then ACAL achieves an expected excess num(α+2)m+1 ⌊log(T )⌋ 1 ¯ ˜ + δi 2i , and an expected ber of mistakes MT − M ∗ = O T (α+2)(m+1) log T ¯ ˜ number of queries QT = O θD (ǫT )T T α − (α+2)(m+1) i=0 δ⌊log(T )⌋ (α+2)(m+1)−α (α+2)(m+1) . [sent-240, score-0.574]

83 7 log 1 δ⌊log(T )⌋ + ⌊log(T )⌋ δi 2i i=0 , where ǫT = The proof of this result is in many ways similar to that given above for the realizable case, and is included among the supplemental materials. [sent-241, score-0.219]

84 With δi = 2−i in ACAL, the algorithm achieves expected number of mistakes M and α ¯ T such that, for ǫT = T − (α+2)(m+1) , expected number of queries Q (α+2)m+1 ∗ ¯ ˜ MT − MT = O T (α+2)(m+1) ¯ ˜ and QT = O θD (ǫT ) · T (α+2)(m+1)−α (α+2)(m+1) . [sent-244, score-0.446]

85 Just as in the realizable case, we can also state a minimax lower bound for this noisy setting. [sent-245, score-0.212]

86 6 Discussion Querying before Predicting: One interesting alternative to the above framework is to allow the learner to make a label request before making its label predictions. [sent-249, score-0.275]

87 From a theoretical perspective, analysis of this alternative framework essentially separates out the mistakes due to over-conﬁdence from the mistakes due to recognized uncertainty. [sent-251, score-0.447]

88 Specifically, the natural modiﬁcation of CAL produces a method that (in the realizable case) makes the same number of label requests as before, except that now it makes zero mistakes, since CAL will request a label if there is any uncertainty about its label. [sent-254, score-0.533]

89 In particular, because the version space is only guaranteed to contain the best classiﬁer with high conﬁdence, there is still a small probability of making a prediction that disagrees with the best classiﬁer h∗ on each round that we do not request a label. [sent-256, score-0.105]

90 So controlling the number of mistakes in this setting comes down to controlling the probability of removing h∗ from the version space. [sent-257, score-0.209]

91 However, this conﬁdence parameter appears in the analysis of the number of queries, so that we have a natural trade-off between the number of mistakes and the number of label requests. [sent-258, score-0.294]

92 In particular, under Assumptions 2 and 5, this procedure achieves an expected ⌊log(T )⌋ ∗ ¯ excess number of mistakes MT − MT ≤ δi 2i , and an expected number of queries i=1 (α+2)(m+1)−α α ⌊log(T )⌋ 1 ¯ ˜ QT = O θD (ǫT ) · T (α+2)(m+1) log + δi 2i , where ǫT = T − (α+2)(m+1) . [sent-259, score-0.574]

93 i=0 δ⌊log(T )⌋ In particular, given any nondecreasing sequence MT , we can set this δi sequence to maintain ¯ MT − M ∗ ≤ MT for all T . [sent-260, score-0.066]

94 T Open Problems: What is not implied by the results above is any sort of trade-off between the number of mistakes and the number of queries. [sent-261, score-0.209]

95 In the batch setting, the analogous question is the tradeoff between the number of label requests and the number of unlabeled examples needed. [sent-263, score-0.198]

96 In the realizable case, that trade-off is tightly characterized by Dasgupta’s splitting index analysis [Das05]. [sent-264, score-0.164]

97 In the batch setting, in which unlabeled examples are considered free, and performance is only measured as a function of the number of label requests, [BHV10] have found that there is an important distinction between the veriﬁable label complexity and the unveriﬁable label complexity. [sent-266, score-0.255]

98 In particular, while the former is sometimes no better than passive learning, the latter can always provide improvements for VC classes. [sent-267, score-0.075]

99 Is there a method for every VC class that achieves O(log(T )) mistakes and o(T ) queries in the realizable case? [sent-270, score-0.495]

100 A bound on the label complexity of agnostic active learning. [sent-340, score-0.243]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mt', 0.434), ('qt', 0.368), ('diamt', 0.348), ('zt', 0.249), ('acal', 0.238), ('cal', 0.222), ('mistakes', 0.209), ('yt', 0.199), ('xt', 0.155), ('realizable', 0.145), ('diamk', 0.128), ('drifting', 0.118), ('requests', 0.113), ('ert', 0.113), ('queries', 0.106), ('request', 0.105), ('lt', 0.104), ('active', 0.097), ('rt', 0.087), ('dt', 0.085), ('label', 0.085), ('excess', 0.081), ('passive', 0.075), ('mistake', 0.072), ('dis', 0.07), ('ht', 0.069), ('er', 0.066), ('totally', 0.066), ('drift', 0.058), ('earn', 0.05), ('tsybakov', 0.05), ('assumption', 0.05), ('pk', 0.049), ('expected', 0.048), ('log', 0.047), ('concept', 0.045), ('minimax', 0.043), ('sublinear', 0.039), ('agnostic', 0.037), ('bdt', 0.037), ('diampk', 0.037), ('erqt', 0.037), ('scenarios', 0.036), ('vk', 0.036), ('ti', 0.035), ('achieves', 0.035), ('fixed', 0.034), ('xm', 0.034), ('sequence', 0.033), ('unveri', 0.032), ('distribution', 0.03), ('graph', 0.03), ('wortman', 0.03), ('unobservable', 0.03), ('quantities', 0.029), ('mansour', 0.029), ('dasgupta', 0.029), ('theorem', 0.029), ('essentially', 0.029), ('classic', 0.027), ('vertex', 0.027), ('supplemental', 0.027), ('satisfying', 0.027), ('separators', 0.026), ('vc', 0.026), ('bounds', 0.026), ('distributions', 0.025), ('letting', 0.025), ('nonincreasing', 0.025), ('referred', 0.025), ('vertices', 0.024), ('bound', 0.024), ('benign', 0.023), ('disagreement', 0.023), ('subroutine', 0.023), ('slowly', 0.023), ('predict', 0.023), ('satis', 0.023), ('ut', 0.023), ('ln', 0.022), ('target', 0.022), ('interested', 0.022), ('additionally', 0.022), ('enumerate', 0.022), ('mohri', 0.022), ('conditions', 0.022), ('minimal', 0.021), ('let', 0.021), ('inf', 0.021), ('adaptation', 0.021), ('noise', 0.021), ('pac', 0.021), ('ties', 0.02), ('bp', 0.019), ('tightly', 0.019), ('corollary', 0.019), ('colt', 0.019), ('labeling', 0.019), ('achievable', 0.018), ('sup', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

2 0.18521444 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a ﬁxed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We deﬁne the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we deﬁne a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with inﬁnite Littlestone dimension learnable. 1

3 0.16648848 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semideﬁnite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the ﬁrst sublinear time approximation algorithm for semideﬁnite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

4 0.15596387 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely diﬀerent approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the ﬁrst computationally eﬃcient online algorithm for collaborative ﬁltering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

5 0.14690927 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

Abstract: We develop uniﬁed information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefﬁcient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide ﬁrst known lower bounds based e e on the capacity function rather than the disagreement coefﬁcient. 1

6 0.10062665 220 nips-2011-Prediction strategies without loss

7 0.096871249 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

8 0.094500072 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

9 0.085478686 28 nips-2011-Agnostic Selective Classification

10 0.08221522 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

11 0.080979511 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

12 0.077522449 145 nips-2011-Learning Eigenvectors for Free

13 0.075208575 22 nips-2011-Active Ranking using Pairwise Comparisons

14 0.07272727 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

15 0.069286279 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

16 0.06753087 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

17 0.060439434 61 nips-2011-Contextual Gaussian Process Bandit Optimization

18 0.057033785 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

19 0.053982291 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

20 0.053299893 153 nips-2011-Learning large-margin halfspaces with more malicious noise

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.163), (1, -0.139), (2, -0.014), (3, -0.066), (4, 0.115), (5, -0.005), (6, 0.034), (7, -0.123), (8, -0.022), (9, -0.085), (10, 0.129), (11, 0.022), (12, 0.163), (13, 0.09), (14, 0.021), (15, -0.021), (16, 0.078), (17, 0.035), (18, 0.064), (19, -0.019), (20, -0.004), (21, 0.027), (22, 0.056), (23, -0.044), (24, 0.034), (25, 0.057), (26, 0.065), (27, 0.056), (28, 0.003), (29, -0.043), (30, 0.063), (31, 0.045), (32, 0.024), (33, 0.021), (34, -0.009), (35, -0.074), (36, -0.106), (37, 0.095), (38, 0.033), (39, 0.023), (40, 0.058), (41, 0.095), (42, -0.011), (43, 0.051), (44, 0.026), (45, -0.089), (46, 0.077), (47, -0.009), (48, -0.145), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94269139 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

2 0.75677729 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

3 0.7003544 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

4 0.68821818 162 nips-2011-Lower Bounds for Passive and Active Learning

Author: Maxim Raginsky, Alexander Rakhlin

5 0.67807281 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

Author: Zuoguan Wang, Gerwin Schalk, Qiang Ji

Abstract: Brain-computer interfaces (BCIs) use brain signals to convey a user’s intent. Some BCI approaches begin by decoding kinematic parameters of movements from brain signals, and then proceed to using these signals, in absence of movements, to allow a user to control an output. Recent results have shown that electrocorticographic (ECoG) recordings from the surface of the brain in humans can give information about kinematic parameters (e.g., hand velocity or ﬁnger ﬂexion). The decoding approaches in these demonstrations usually employed classical classiﬁcation/regression algorithms that derive a linear mapping between brain signals and outputs. However, they typically only incorporate little prior information about the target kinematic parameter. In this paper, we show that different types of anatomical constraints that govern ﬁnger ﬂexion can be exploited in this context. Speciﬁcally, we incorporate these constraints in the construction, structure, and the probabilistic functions of a switched non-parametric dynamic system (SNDS) model. We then apply the resulting SNDS decoder to infer the ﬂexion of individual ﬁngers from the same ECoG dataset used in a recent study. Our results show that the application of the proposed model, which incorporates anatomical constraints, improves decoding performance compared to the results in the previous work. Thus, the results presented in this paper may ultimately lead to neurally controlled hand prostheses with full ﬁne-grained ﬁnger articulation. 1

6 0.63512111 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

7 0.57645959 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

8 0.53690809 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

9 0.50905716 145 nips-2011-Learning Eigenvectors for Free

10 0.47096124 220 nips-2011-Prediction strategies without loss

11 0.4673183 225 nips-2011-Probabilistic amplitude and frequency demodulation

12 0.41826513 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

13 0.40130141 28 nips-2011-Agnostic Selective Classification

14 0.38101217 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

15 0.38003993 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

16 0.35792598 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

17 0.34551963 153 nips-2011-Learning large-margin halfspaces with more malicious noise

18 0.34101909 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

19 0.33598423 231 nips-2011-Randomized Algorithms for Comparison-based Search

20 0.33310705 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (4, 0.066), (20, 0.035), (26, 0.066), (31, 0.113), (33, 0.025), (43, 0.053), (45, 0.13), (57, 0.023), (68, 0.234), (74, 0.056), (83, 0.033), (99, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80834275 21 nips-2011-Active Learning with a Drifting Distribution

Author: Liu Yang

2 0.69560719 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

Author: Taiji Suzuki

Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1

3 0.67336047 231 nips-2011-Randomized Algorithms for Comparison-based Search

Author: Dominique Tschopp, Suhas Diggavi, Payam Delgosha, Soheil Mohajer

Abstract: This paper addresses the problem of ﬁnding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difﬁculty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which deﬁnes approximate triangle n inequalities on ranks. We present a lower bound of Ω(D log D + D2 ) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop 3 a randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD ) 3 questions. The learning requires asking O(nD3 log2 n + D log2 n log log nD ) questions and O(n log2 n/ log(2D)) bits to store.

4 0.66811109 22 nips-2011-Active Ranking using Pairwise Comparisons

Author: Kevin G. Jamieson, Robert Nowak

Abstract: This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of n objects can be identiﬁed by standard sorting methods using n log2 n pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Speciﬁcally, we assume that the objects can be embedded into a d-dimensional Euclidean space and that the rankings reﬂect their relative distances from a common reference point in Rd . We show that under this assumption the number of possible rankings grows like n2d and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than d log n adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis. 1

5 0.664419 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, speciﬁed by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-speciﬁc approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efﬁciency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1

6 0.65904897 283 nips-2011-The Fixed Points of Off-Policy TD

7 0.65895683 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.65740168 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

9 0.65699613 197 nips-2011-On Tracking The Partition Function

10 0.65615934 180 nips-2011-Multiple Instance Filtering

11 0.65447873 64 nips-2011-Convergent Bounds on the Euclidean Distance

12 0.65342474 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

13 0.65333974 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

14 0.6522795 303 nips-2011-Video Annotation and Tracking with Active Learning

15 0.65204668 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

16 0.65104145 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.64942652 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

18 0.64920473 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

19 0.64768428 156 nips-2011-Learning to Learn with Compound HD Models

20 0.64750028 258 nips-2011-Sparse Bayesian Multi-Task Learning