nips nips2011 nips2011-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liu Yang
Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. [sent-3, score-0.127]
2 We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. [sent-4, score-0.675]
3 We further prove minimax lower bounds for this problem. [sent-5, score-0.069]
4 1 Introduction Most existing analyses of active learning are based on an i. [sent-6, score-0.097]
5 In this work, we assume the data are independent, but we allow the distribution from which the data are drawn to shift over time, while the target concept remains fixed. [sent-10, score-0.097]
6 We consider this problem in a stream-based selective sampling model, and are interested in two quantities: the number of mistakes the algorithm makes on the first T examples in the stream, and the number of label requests among the first T examples in the stream. [sent-11, score-0.429]
7 In particular, we study scenarios in which the distribution may drift within a fixed totally bounded family of distributions. [sent-12, score-0.19]
8 Unlike previous models of distribution drift [Bar92, CMEDV10], the minimax number of mistakes (or excess number of mistakes, in the noisy case) can be sublinear in the number of samples. [sent-13, score-0.46]
9 We specifically study the classic CAL active learning strategy [CAL94] in this context, and bound the number of mistakes and label requests the algorithm makes in the realizable case, under conditions on the concept space and the family of possible distributions. [sent-14, score-0.767]
10 We also exhibit lower bounds on these quantities that match our upper bounds in certain cases. [sent-15, score-0.081]
11 We further study a noise-robust variant of CAL, and analyze its number of mistakes and number of label requests in noisy scenarios where the noise distribution remains fixed over time but the marginal distribution on X may shift. [sent-16, score-0.524]
12 In particular, we upper bound these quantities under Tsybakov’s noise conditions [MT99]. [sent-17, score-0.096]
13 We also prove minimax lower bounds under these same conditions, though there is a gap between our upper and lower bounds. [sent-18, score-0.069]
14 2 Definition and Notations As in the usual statistical learning problem, there is a standard Borel space X , called the instance space, and a set C of measurable classifiers h : X → {−1, +1}, called the concept space. [sent-19, score-0.045]
15 We additionally have a space D of distributions on X , called the distribution space. [sent-20, score-0.077]
16 In the learning problem, there is an unobservable sequence of distributions D1 , D2 , . [sent-26, score-0.088]
17 Based on these quantities, we let Z = {(Xt , Yt )}∞ denote an infinite t=1 sequence of independent random variables, such that ∀t, Xt ∼ Dt , and the conditional distribution of Yt given Xt satisfies ∀x ∈ X , P(Yt = +1|Xt = x) = η(x). [sent-30, score-0.084]
18 Thus, the joint distribution of (Xt , Yt ) is specified by the pair (Dt , η), and the distribution of Z is specified by the collection {Dt }∞ along with η. [sent-31, score-0.06]
19 Note that the η conditional distribution is time-independent, since we are restricting ourselves to discussing drifting marginal distributions on X , rather than drifting concepts. [sent-36, score-0.291]
20 Concept drift is an important and interesting topic, but is beyond the scope of our present discussion. [sent-37, score-0.058]
21 T t=1 ˆ I Yt = Yt , is the cumulative T ˆ number of mistakes up to time T . [sent-39, score-0.209]
22 We are particularly interested in the asymptotic ˆ We are primarily interested in two quantities. [sent-42, score-0.044]
23 We refer ¯ T as the expected number of label requests, and to MT − M ∗ as the expected excess number ¯ ¯ to Q T of mistakes. [sent-44, score-0.262]
24 For any distribution P on X , we define erP (h) = EX∼P [η(X)I[h(X) = −1] + (1 − η(X))I[h(X) = +1]], the probability of h making a mistake for X ∼ P and Y with conditional probability of being +1 equal η(X). [sent-45, score-0.102]
25 Note that, abbreviating ert (h) = erDt (h) = P(h(Xt ) = Yt ), T ¯∗ we have MT = inf h∈C t=1 ert (h). [sent-46, score-0.247]
26 , sublinear) are considered desirable, as these represent cases in which we do “learn” the proper way to predict labels, while asymptotically using far fewer labels than passive learning. [sent-49, score-0.098]
27 For V ⊆ C, let diamt (V ) = t 1 suph,g∈V Dt ({x : h(x) = g(x)}). [sent-52, score-0.369]
28 1 Assumptions In addition to the assumption of independence of the Xt variables and that d < ∞, each result below is stated under various additional assumptions. [sent-57, score-0.05]
29 The weakest such assumption is that D is totally bounded, in the following sense. [sent-58, score-0.116]
30 We say that D is totally bounded if it satisfies the following assumption. [sent-62, score-0.066]
31 2 3 Related Work We discuss active learning under distribution drift, with fixed target concept. [sent-73, score-0.149]
32 There are several branches of the literature that are highly relevant to this, including domain adaptation [MMR09, MMR08], online learning [Lit88], learning with concept drift, and empirical processes for independent but not identically distributed data [vdG00]. [sent-74, score-0.066]
33 They find that with QT = O d α+2 T α+2 queries, 1 ∗ ¯ ˜ α+1 it is possible to achieve an expected excess number of mistakes MT − MT = O d α+2 · T α+2 . [sent-77, score-0.338]
34 At this time, we know of no work studying the number of mistakes and queries achievable by active learning in a stream-based setting where the distribution may change over time. [sent-78, score-0.46]
35 Stream-based Passive Learning with a Drifting Distribution There has been work on learning with a drifting distribution and fixed target, in the context of passive learning. [sent-79, score-0.223]
36 They consider learning problems in which a changing environment is modeled by a slowly changing distribution on the product space. [sent-81, score-0.053]
37 The allowable drift is restricted by ensuring that consecutive probability distributions are close in total variation distance. [sent-82, score-0.083]
38 4 Active Learning in the Realizable Case Throughout this section, suppose C is a fixed concept space and h∗ ∈ C is a fixed target function: that is, ert (h∗ ) = 0. [sent-86, score-0.18]
39 The family of scenarios in which this is true are often collectively referred to as the realizable case. [sent-87, score-0.206]
40 We begin our analysis by studying this realizable case because it greatly simplifies the analysis, laying bare the core ideas in plain form. [sent-88, score-0.145]
41 We will discuss more general scenarios, in which ert (h∗ ) ≥ 0, in later sections, where we find that essentially the same principles apply there as in this initial realizable-case analysis. [sent-89, score-0.142]
42 We will be particularly interested in the performance of the following simple algorithm, due to [CAL94], typically referred to as CAL after its discoverers. [sent-90, score-0.047]
43 The version presented here is specified in terms of a passive learning subroutine A (mapping any sequence of labeled examples to a classifier). [sent-91, score-0.131]
44 Request Yt , let Qt = Qt−1 ∪ {(Xt , Yt )} Else let Yt′ = argmin min er(h; Qt−1 ∪ {(Xt , y)}), and let Qt ← Qt−1 ∪ {(Xt , Yt′ )} ˆ 8. [sent-102, score-0.063]
45 ˆ Let ht = A(Qt ) y∈{−1,+1} h∈C Below, we let A1IG denote the one-inclusion graph prediction strategy of [HLW94]. [sent-103, score-0.12]
46 Specifically, the passive learning algorithm A1IG is specified as follows. [sent-104, score-0.075]
47 For a sequence of data points U ∈ X t+1 , 3 the one-inclusion graph is a graph, where each vertex represents a distinct labeling of U that can be realized by some classifier in C, and two vertices are adjacent if and only if their corresponding labelings for U differ by exactly one label. [sent-105, score-0.133]
48 , (xt , yt )}, and one test point xt+1 we are asked to predict a label for, we first construct the one-inclusion graph on U = {x1 , . [sent-110, score-0.337]
49 [Han07, Han11] Define the disagreement coefficient of h∗ under a distribution P as θP (ǫ) = sup P (DIS(BP (h∗ , r))) /r. [sent-118, score-0.071]
50 For any distribution P on X , if D = {P }, then running CAL with A = ¯ ¯ A1IG achieves expected mistake bound MT = O (d log(T )) and expected query bound QT = 2 O θP (ǫT )d log (T ) , for ǫT = d log(T )/T . [sent-120, score-0.328]
51 2 Learning with a Drifting Distribution We now generalize the above results to any sequence of distributions from a totally bounded space D. [sent-123, score-0.124]
52 First, we prove a basic result stating that CAL can achieve a sublinear number of mistakes, and under conditions on the disagreement coefficient, also a sublinear number of queries. [sent-125, score-0.123]
53 If D is totally bounded (Assumption 1), then CAL (with A any empirical risk minimiza¯ tion algorithm) achieves an expected mistake bound MT = o(T ), and if θD (ǫ) = o(1/ǫ), then CAL ¯ T = o(T ). [sent-127, score-0.245]
54 We have that ∀t > Tǫ , diamt (Vǫ ) ≤ diamk(t) (Vǫ ) + ǫ. [sent-140, score-0.348]
55 4 Combining the above arguments, T T T E t=1 diamt (C[Zt−1 ]) ≤ Tǫ + t=Tǫ +1 E [diamt (Vǫ )] ≤ Tǫ + ǫT + E diamk(t) (Vǫ ) t=Tǫ +1 T ≤ Tǫ + ǫT + L(ǫ)ǫT + ≤ Tǫ + ǫT + L(ǫ)ǫT + E diamk(t) (Vk(t) ) t=Tǫ +1 √ ǫT. [sent-147, score-0.348]
56 Let ǫT be any nonincreasing sequence in (0, 1) such that 1 ≪ TǫT ≪ T . [sent-148, score-0.058]
57 Thus, noting that limǫ→0 L(ǫ)ǫ = 0, we have T E t=1 diamt (C[Zt−1 ]) ≤ TǫT + ǫT T + L(ǫT )ǫT T + √ ǫT T ≪ T. [sent-150, score-0.348]
58 (1) ˆ ˆ ¯ The result on MT now follows by noting that for any ht−1 ∈ C[Zt−1 ] has ert (ht−1 ) ≤ diamt (C[Zt−1 ]), so T T ¯ MT = E ert ˆ ht−1 t=1 ≤E t=1 diamt (C[Zt−1 ]) ≪ T. [sent-151, score-0.922]
59 Letting rT = T −1 E T t=1 diamt (C[Zt−1 ]) , we see that rT → 0 by (1), and since θD (ǫ) = ¯ o(1/ǫ), we also have θD (rT )rT → 0, so that θD (rT )rT T ≪ T . [sent-153, score-0.348]
60 Therefore, QT equals T T t=1 P(Request Yt ) ≤ θD (rT )·rT ·T +θD (rT )·E t=1 diamt (C[Zt−1 ]) = 2θD (rT )·rT ·T ≪ T. [sent-154, score-0.348]
61 If Assumption 2 is satisfied, then CAL (with A any empirical risk minimization algo¯ ¯ ¯ rithm) achieves an expected mistake bound MT and expected number of queries QT such that MT = m 1 m 1 1 2 2 ¯ O T m+1 d m+1 log T and QT = O θD (ǫT ) T m+1 d m+1 log T , where ǫT = (d/T ) m+1 . [sent-157, score-0.427]
62 Then T E t=1 T T diamt (C[Zt−1 ]) ≤ E ′ diamt (C[Zt−1 ]) + t=1 t=1 Dt − Pk(t) T T ≤E t=1 ′ diamt (C[Zt−1 ]) + ǫT ≤ ′ E diamPk(t) (C[Zt−1 ]) + 2ǫT. [sent-167, score-1.044]
63 t=1 The classic convergence rates results from PAC learning [AB99, Vap82] imply T T ′ E diamPk(t) (C[Zt−1 ]) = O d log t |{i≤t:k(i)=k(t)}| t=1 t=1 ⌈T /|Dǫ |⌉ T ≤ O(d log T ) · 1 |{i≤t:k(i)=k(t)}| t=1 ≤ O(d log T ) · |Dǫ | · 5 1 u u=1 ≤ O d|Dǫ | log2 (T ) . [sent-168, score-0.168]
64 We therefore have T ¯ MT ≤ E T sup t=1 h∈C[Zt−1 ] ert (h) ≤ E Similarly, letting ǫT = (d/T ) 1 m+1 t=1 m ¯ , QT is at most T E t=1 1 diamt (C[Zt−1 ]) ≤ O d m+1 · T m+1 log2 (T ) . [sent-171, score-0.504]
65 T Dt (DIS(C[Zt−1 ])) ≤ E t=1 Dt (DIS (BDt (h∗ , max {diamt (C[Zt−1 ]), ǫT }))) T ≤E t=1 θD (ǫT ) · max {diamt (C[Zt−1 ]), ǫT } T ≤E t=1 1 m θD (ǫT ) · diamt (C[Zt−1 ]) + θD (ǫT ) T ǫT ≤ O θD (ǫT ) · d m+1 · T m+1 log2 (T ) . [sent-172, score-0.348]
66 We can additionally construct a lower bound for this scenario, as follows. [sent-173, score-0.046]
67 For instance, this is the case for linear separators (and most other natural “geometric” concept spaces). [sent-182, score-0.071]
68 For any C as above, for any active learning algorithm, ∃ a set D satsifying Assumption 2, a target function h∗ ∈ C, and a sequence of distributions {Dt }T in D such that the achieved t=1 m m m ¯ ¯ ¯ ¯ ¯ MT and QT satisfy MT = Ω T m+1 , and MT = O T m+1 =⇒ QT = Ω T m+1 . [sent-184, score-0.177]
69 5 Learning with Noise In this section, we extend the above analysis to allow for various types of noise conditions commonly studied in the literature. [sent-186, score-0.043]
70 We prove upper bounds achieved by ACAL, as well as (non-matching) minimax lower bounds. [sent-188, score-0.069]
71 A particularly interesting special case of Assumption 3 is given by Tsybakov’s noise conditions, which essentially control how common it is to have η values close to 1/2. [sent-193, score-0.05]
72 In the setting of shifting distributions, we will be interested in conditions for which the above assumptions are satisifed simultaneously for all distributions in D. [sent-196, score-0.069]
73 t ← 0, Lt ← ∅, Qt ← ∅, let ht be any element of C 2. [sent-204, score-0.09]
74 Let δi be a nonincreasing sequence of values in (0, 1). [sent-213, score-0.058]
75 For any strictly benign (P, η), if 2−2 ≪ δi ≪ 2−i /i, ACAL achieves an expected ∗ ¯ excess number of mistakes MT − MT = o(T ), and if θP (ǫ) = o(1/ǫ), then ACAL makes an ¯ T = o(T ). [sent-226, score-0.396]
76 For any (P, η) satisfying Assumption 4, if D = {P }, ACAL achieves an expected α+1 1 ⌊log(T )⌋ 1 ∗ ¯ ˜ δi 2i . [sent-228, score-0.11]
77 and excess number of mistakes MT − MT = O d α+2 · T α+2 log δ⌊log(T )⌋ + i=0 2 α ¯ ˜ an expected number of queries QT = O θP (ǫT ) · d α+2 · T α+2 log 1 δ⌊log(T )⌋ + ⌊log(T )⌋ δi 2i i=0 . [sent-229, score-0.538]
78 For any (P, η) satisfying Assumption 4, if D = {P } and δi = 2−i in ACAL, the ¯ ¯ algorithm achieves an expected number of mistakes MT and expected number of queries QT such α+1 α 2 α 1 ∗ ¯ ˜ ¯ ˜ that, for ǫT = T − α+2 , MT − MT = O d α+2 · T α+2 , and QT = O θP (ǫT ) · d α+2 · T α+2 . [sent-232, score-0.473]
79 4 Learning with a Drifting Distribution We can now state our results concerning ACAL, which are analogous to Theorems 2 and 3 proved earlier for CAL in the realizable case. [sent-234, score-0.145]
80 If D is totally bounded (Assumption 1) and η satisfies Assumption 3, then ACAL with ∗ ¯ δi = 2−i achieves an excess expected mistake bound MT − MT = o(T ), and if additionally ¯ T = o(T ). [sent-236, score-0.348]
81 θD (ǫ) = o(1/ǫ), then ACAL makes an expected number of queries Q The proof of Theorem 7 essentially follows from a combination of the reasoning for Theorem 2 and Theorem 8 below. [sent-237, score-0.183]
82 If Assumptions 2 and 5 are satisfied, then ACAL achieves an expected excess num(α+2)m+1 ⌊log(T )⌋ 1 ¯ ˜ + δi 2i , and an expected ber of mistakes MT − M ∗ = O T (α+2)(m+1) log T ¯ ˜ number of queries QT = O θD (ǫT )T T α − (α+2)(m+1) i=0 δ⌊log(T )⌋ (α+2)(m+1)−α (α+2)(m+1) . [sent-240, score-0.574]
83 7 log 1 δ⌊log(T )⌋ + ⌊log(T )⌋ δi 2i i=0 , where ǫT = The proof of this result is in many ways similar to that given above for the realizable case, and is included among the supplemental materials. [sent-241, score-0.219]
84 With δi = 2−i in ACAL, the algorithm achieves expected number of mistakes M and α ¯ T such that, for ǫT = T − (α+2)(m+1) , expected number of queries Q (α+2)m+1 ∗ ¯ ˜ MT − MT = O T (α+2)(m+1) ¯ ˜ and QT = O θD (ǫT ) · T (α+2)(m+1)−α (α+2)(m+1) . [sent-244, score-0.446]
85 Just as in the realizable case, we can also state a minimax lower bound for this noisy setting. [sent-245, score-0.212]
86 6 Discussion Querying before Predicting: One interesting alternative to the above framework is to allow the learner to make a label request before making its label predictions. [sent-249, score-0.275]
87 From a theoretical perspective, analysis of this alternative framework essentially separates out the mistakes due to over-confidence from the mistakes due to recognized uncertainty. [sent-251, score-0.447]
88 Specifically, the natural modification of CAL produces a method that (in the realizable case) makes the same number of label requests as before, except that now it makes zero mistakes, since CAL will request a label if there is any uncertainty about its label. [sent-254, score-0.533]
89 In particular, because the version space is only guaranteed to contain the best classifier with high confidence, there is still a small probability of making a prediction that disagrees with the best classifier h∗ on each round that we do not request a label. [sent-256, score-0.105]
90 So controlling the number of mistakes in this setting comes down to controlling the probability of removing h∗ from the version space. [sent-257, score-0.209]
91 However, this confidence parameter appears in the analysis of the number of queries, so that we have a natural trade-off between the number of mistakes and the number of label requests. [sent-258, score-0.294]
92 In particular, under Assumptions 2 and 5, this procedure achieves an expected ⌊log(T )⌋ ∗ ¯ excess number of mistakes MT − MT ≤ δi 2i , and an expected number of queries i=1 (α+2)(m+1)−α α ⌊log(T )⌋ 1 ¯ ˜ QT = O θD (ǫT ) · T (α+2)(m+1) log + δi 2i , where ǫT = T − (α+2)(m+1) . [sent-259, score-0.574]
93 i=0 δ⌊log(T )⌋ In particular, given any nondecreasing sequence MT , we can set this δi sequence to maintain ¯ MT − M ∗ ≤ MT for all T . [sent-260, score-0.066]
94 T Open Problems: What is not implied by the results above is any sort of trade-off between the number of mistakes and the number of queries. [sent-261, score-0.209]
95 In the batch setting, the analogous question is the tradeoff between the number of label requests and the number of unlabeled examples needed. [sent-263, score-0.198]
96 In the realizable case, that trade-off is tightly characterized by Dasgupta’s splitting index analysis [Das05]. [sent-264, score-0.164]
97 In the batch setting, in which unlabeled examples are considered free, and performance is only measured as a function of the number of label requests, [BHV10] have found that there is an important distinction between the verifiable label complexity and the unverifiable label complexity. [sent-266, score-0.255]
98 In particular, while the former is sometimes no better than passive learning, the latter can always provide improvements for VC classes. [sent-267, score-0.075]
99 Is there a method for every VC class that achieves O(log(T )) mistakes and o(T ) queries in the realizable case? [sent-270, score-0.495]
100 A bound on the label complexity of agnostic active learning. [sent-340, score-0.243]
wordName wordTfidf (topN-words)
[('mt', 0.434), ('qt', 0.368), ('diamt', 0.348), ('zt', 0.249), ('acal', 0.238), ('cal', 0.222), ('mistakes', 0.209), ('yt', 0.199), ('xt', 0.155), ('realizable', 0.145), ('diamk', 0.128), ('drifting', 0.118), ('requests', 0.113), ('ert', 0.113), ('queries', 0.106), ('request', 0.105), ('lt', 0.104), ('active', 0.097), ('rt', 0.087), ('dt', 0.085), ('label', 0.085), ('excess', 0.081), ('passive', 0.075), ('mistake', 0.072), ('dis', 0.07), ('ht', 0.069), ('er', 0.066), ('totally', 0.066), ('drift', 0.058), ('earn', 0.05), ('tsybakov', 0.05), ('assumption', 0.05), ('pk', 0.049), ('expected', 0.048), ('log', 0.047), ('concept', 0.045), ('minimax', 0.043), ('sublinear', 0.039), ('agnostic', 0.037), ('bdt', 0.037), ('diampk', 0.037), ('erqt', 0.037), ('scenarios', 0.036), ('vk', 0.036), ('ti', 0.035), ('achieves', 0.035), ('fixed', 0.034), ('xm', 0.034), ('sequence', 0.033), ('unveri', 0.032), ('distribution', 0.03), ('graph', 0.03), ('wortman', 0.03), ('unobservable', 0.03), ('quantities', 0.029), ('mansour', 0.029), ('dasgupta', 0.029), ('theorem', 0.029), ('essentially', 0.029), ('classic', 0.027), ('vertex', 0.027), ('supplemental', 0.027), ('satisfying', 0.027), ('separators', 0.026), ('vc', 0.026), ('bounds', 0.026), ('distributions', 0.025), ('letting', 0.025), ('nonincreasing', 0.025), ('referred', 0.025), ('vertices', 0.024), ('bound', 0.024), ('benign', 0.023), ('disagreement', 0.023), ('subroutine', 0.023), ('slowly', 0.023), ('predict', 0.023), ('satis', 0.023), ('ut', 0.023), ('ln', 0.022), ('target', 0.022), ('interested', 0.022), ('additionally', 0.022), ('enumerate', 0.022), ('mohri', 0.022), ('conditions', 0.022), ('minimal', 0.021), ('let', 0.021), ('inf', 0.021), ('adaptation', 0.021), ('noise', 0.021), ('pac', 0.021), ('ties', 0.02), ('bp', 0.019), ('tightly', 0.019), ('corollary', 0.019), ('colt', 0.019), ('labeling', 0.019), ('achievable', 0.018), ('sup', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 21 nips-2011-Active Learning with a Drifting Distribution
Author: Liu Yang
Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1
2 0.18521444 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
3 0.16648848 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
4 0.15596387 80 nips-2011-Efficient Online Learning via Randomized Rounding
Author: Nicolò Cesa-bianchi, Ohad Shamir
Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1
5 0.14690927 162 nips-2011-Lower Bounds for Passive and Active Learning
Author: Maxim Raginsky, Alexander Rakhlin
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1
6 0.10062665 220 nips-2011-Prediction strategies without loss
7 0.096871249 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
8 0.094500072 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.085478686 28 nips-2011-Agnostic Selective Classification
10 0.08221522 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
11 0.080979511 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
12 0.077522449 145 nips-2011-Learning Eigenvectors for Free
13 0.075208575 22 nips-2011-Active Ranking using Pairwise Comparisons
14 0.07272727 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
15 0.069286279 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
16 0.06753087 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
17 0.060439434 61 nips-2011-Contextual Gaussian Process Bandit Optimization
18 0.057033785 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
19 0.053982291 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
20 0.053299893 153 nips-2011-Learning large-margin halfspaces with more malicious noise
topicId topicWeight
[(0, 0.163), (1, -0.139), (2, -0.014), (3, -0.066), (4, 0.115), (5, -0.005), (6, 0.034), (7, -0.123), (8, -0.022), (9, -0.085), (10, 0.129), (11, 0.022), (12, 0.163), (13, 0.09), (14, 0.021), (15, -0.021), (16, 0.078), (17, 0.035), (18, 0.064), (19, -0.019), (20, -0.004), (21, 0.027), (22, 0.056), (23, -0.044), (24, 0.034), (25, 0.057), (26, 0.065), (27, 0.056), (28, 0.003), (29, -0.043), (30, 0.063), (31, 0.045), (32, 0.024), (33, 0.021), (34, -0.009), (35, -0.074), (36, -0.106), (37, 0.095), (38, 0.033), (39, 0.023), (40, 0.058), (41, 0.095), (42, -0.011), (43, 0.051), (44, 0.026), (45, -0.089), (46, 0.077), (47, -0.009), (48, -0.145), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.94269139 21 nips-2011-Active Learning with a Drifting Distribution
Author: Liu Yang
Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1
2 0.75677729 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
3 0.7003544 80 nips-2011-Efficient Online Learning via Randomized Rounding
Author: Nicolò Cesa-bianchi, Ohad Shamir
Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1
4 0.68821818 162 nips-2011-Lower Bounds for Passive and Active Learning
Author: Maxim Raginsky, Alexander Rakhlin
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1
5 0.67807281 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
Author: Zuoguan Wang, Gerwin Schalk, Qiang Ji
Abstract: Brain-computer interfaces (BCIs) use brain signals to convey a user’s intent. Some BCI approaches begin by decoding kinematic parameters of movements from brain signals, and then proceed to using these signals, in absence of movements, to allow a user to control an output. Recent results have shown that electrocorticographic (ECoG) recordings from the surface of the brain in humans can give information about kinematic parameters (e.g., hand velocity or finger flexion). The decoding approaches in these demonstrations usually employed classical classification/regression algorithms that derive a linear mapping between brain signals and outputs. However, they typically only incorporate little prior information about the target kinematic parameter. In this paper, we show that different types of anatomical constraints that govern finger flexion can be exploited in this context. Specifically, we incorporate these constraints in the construction, structure, and the probabilistic functions of a switched non-parametric dynamic system (SNDS) model. We then apply the resulting SNDS decoder to infer the flexion of individual fingers from the same ECoG dataset used in a recent study. Our results show that the application of the proposed model, which incorporates anatomical constraints, improves decoding performance compared to the results in the previous work. Thus, the results presented in this paper may ultimately lead to neurally controlled hand prostheses with full fine-grained finger articulation. 1
6 0.63512111 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
7 0.57645959 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
8 0.53690809 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension
9 0.50905716 145 nips-2011-Learning Eigenvectors for Free
10 0.47096124 220 nips-2011-Prediction strategies without loss
11 0.4673183 225 nips-2011-Probabilistic amplitude and frequency demodulation
12 0.41826513 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
13 0.40130141 28 nips-2011-Agnostic Selective Classification
14 0.38101217 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
15 0.38003993 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
16 0.35792598 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
17 0.34551963 153 nips-2011-Learning large-margin halfspaces with more malicious noise
18 0.34101909 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
19 0.33598423 231 nips-2011-Randomized Algorithms for Comparison-based Search
20 0.33310705 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
topicId topicWeight
[(0, 0.034), (4, 0.066), (20, 0.035), (26, 0.066), (31, 0.113), (33, 0.025), (43, 0.053), (45, 0.13), (57, 0.023), (68, 0.234), (74, 0.056), (83, 0.033), (99, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.80834275 21 nips-2011-Active Learning with a Drifting Distribution
Author: Liu Yang
Abstract: We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem. 1
2 0.69560719 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
Author: Taiji Suzuki
Abstract: In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations. Our main target in this paper is dense type regularizations including ℓp -MKL that imposes ℓp -mixed-norm regularization instead of ℓ1 -mixed-norm regularization. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. As a by-product of our general result, we show a fast learning rate of ℓp -MKL that is tightest among existing bounds. We also show that our general learning rate achieves the minimax lower bound. Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization. 1
3 0.67336047 231 nips-2011-Randomized Algorithms for Comparison-based Search
Author: Dominique Tschopp, Suhas Diggavi, Payam Delgosha, Soheil Mohajer
Abstract: This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle n inequalities on ranks. We present a lower bound of Ω(D log D + D2 ) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop 3 a randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD ) 3 questions. The learning requires asking O(nD3 log2 n + D log2 n log log nD ) questions and O(n log2 n/ log(2D)) bits to store.
4 0.66811109 22 nips-2011-Active Ranking using Pairwise Comparisons
Author: Kevin G. Jamieson, Robert Nowak
Abstract: This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of n objects can be identified by standard sorting methods using n log2 n pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically, we assume that the objects can be embedded into a d-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in Rd . We show that under this assumption the number of possible rankings grows like n2d and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than d log n adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis. 1
5 0.664419 229 nips-2011-Query-Aware MCMC
Author: Michael L. Wick, Andrew McCallum
Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1
6 0.65904897 283 nips-2011-The Fixed Points of Off-Policy TD
7 0.65895683 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.65740168 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
9 0.65699613 197 nips-2011-On Tracking The Partition Function
10 0.65615934 180 nips-2011-Multiple Instance Filtering
11 0.65447873 64 nips-2011-Convergent Bounds on the Euclidean Distance
12 0.65342474 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
13 0.65333974 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
14 0.6522795 303 nips-2011-Video Annotation and Tracking with Active Learning
15 0.65204668 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
16 0.65104145 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
17 0.64942652 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
18 0.64920473 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
19 0.64768428 156 nips-2011-Learning to Learn with Compound HD Models
20 0.64750028 258 nips-2011-Sparse Bayesian Multi-Task Learning