jmlr jmlr2012 jmlr2012-105 knowledge-graph by maker-knowledge-mining

105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Source: pdf

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efﬁcient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efﬁcient, crowdsourcing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. [sent-10, score-0.647]

2 On round t of the game, the adversary presents the learner with an instance xt ∈ Rd and the learner responds by predicting a binary label yt ∈ {−1, +1}. [sent-18, score-0.672]

3 ˆ The learner must now decide whether or not to pay a unit cost and query the teacher for the correct binary label yt ∈ {−1, +1} from the teacher. [sent-20, score-0.789]

4 However, we can also convert our algorithm into an efﬁcient statistical active learning algorithm, which receives a sample of instances from some unknown distribution, queries the teacher for a subset of the labels, and outputs a hypothesis with a small risk. [sent-45, score-0.615]

5 On round t, some of the teachers may be experts on xt while others may not be. [sent-51, score-0.828]

6 A teacher who is an expert on xt is likely to provide the correct label, while a teacher who isn’t may give the wrong label. [sent-52, score-1.097]

7 To make this setting as realistic as possible, we assume that the areas of expertise and the overall competence levels of the different teachers are unknown to the learner, and any characterization of a teacher must be inferred from the observed labels. [sent-53, score-0.995]

8 On round t, the learner receives the instance xt from the adversary and makes the binary prediction yt . [sent-54, score-0.558]

9 Then, the learner has the option to query any subset of teachers: each teacher charges a unit ˆ cost per query and provides a binary label, without any indication of his conﬁdence and without the option of abstaining. [sent-55, score-0.738]

10 The labels received from the queried teachers may disagree, and the learner has no a-priori way of knowing which teacher to trust. [sent-56, score-1.094]

11 Some teachers may be better than others across the board and some teachers may be experts on speciﬁc topics, such as sports or politics. [sent-62, score-1.071]

12 Some teachers may know the right answer, while others may think they know the right answers but in fact do not—for this reason we do not rely on the teachers themselves to reveal their expertise regions. [sent-63, score-1.112]

13 The learner has no a-priori knowledge of which teacher to query for the label; yet, in our analysis we would like to compare the learner’s prediction to the label given by the expert teacher. [sent-67, score-0.681]

14 Our ﬁrst multipleteacher algorithm has the property that it either queries all of the teachers or does not query any teacher, on each round. [sent-72, score-0.751]

15 Our second algorithm is more sophisticated and queries only those teachers it believes to be experts on xt . [sent-73, score-0.911]

16 We can also distinguish between algorithms that rely on repeated labeling (where multiple teachers label each example), versus techniques that assume that each example is labeled only once. [sent-138, score-0.588]

17 At the opposite end of the spectrum, Dekel and Shamir (2009a) identify low-quality teachers and labels without any repeated labeling. [sent-146, score-0.565]

18 The technique presented in this paper falls in the latter category, since we actively determine which subset of teachers to query on each online round. [sent-147, score-0.669]

19 Still, we do compare to some majority vote of teachers in both our analysis and our experiments. [sent-149, score-0.526]

20 Next, we distinguish between papers that consider the overall quality score of each teacher (over the entire input space) from papers that assume that each teacher has a speciﬁc area of expertise. [sent-150, score-0.858]

21 The proactive learning setting (Domnez, 2010; Yang and Carbonell, 2009a,b) assumes that the learner has access to teachers of different global quality, with associated costs per label. [sent-163, score-0.648]

22 1 Preliminaries and Notation As mentioned above, on round t of the online selective sampling game, the learner receives an instance xt ∈ Rd , predicts a binary label yt ∈ {−1, +1}, and chooses whether or not to query the ˆ correct label yt ∈ {−1, +1}. [sent-187, score-0.99]

23 The only assumption we make on the process that generates xt is that xt ≤ 1; for all we know, instances may be generated by an adaptive adversary (an adversary that reacts to our previous actions). [sent-189, score-0.631]

24 , where each wt ∈ Rd , and ˆ ˆ predicts yt = sign(∆t ) where ∆t = wt−1 ⊤ xt . [sent-201, score-0.568]

25 After receiving xt and predicting ˆ yt = sign(∆t ), the algorithm computes an adaptive data-dependent threshold θt , deﬁned as ˆ t−1 −1 θt2 = xt⊤ At−1 xt 1 + 4 ∑ Zi ri + 36 log i=1 t δ , where ri = x⊤ A−1 xi . [sent-229, score-0.722]

26 xt⊤ At−1 xt depends on the relationship between the current instance xt and the previous instances on rounds where a query was issued. [sent-252, score-0.696]

27 On the other hand, if the adversary chooses xt outside of the subspace spanned −1 by the previous examples, then the term xt⊤ At−1 xt causes θt to be large, and the algorithm becomes more likely to issue a query. [sent-255, score-0.567]

28 Recall that a small value of ∆t implies that the teacher guesses the label yt almost at random. [sent-258, score-0.581]

29 the learner receives an input xt ∈ Rd , with xt ≤ 1, and outputs a binary prediction yt . [sent-438, score-0.756]

30 The expertise area of each teacher is unknown to the algorithm, and can only be inferred indirectly from the binary labels provided by that teacher and by other teachers. [sent-440, score-0.917]

31 If xt falls within the expertise region of teacher j, then that teacher can provide an accurate label. [sent-441, score-1.14]

32 If teacher j is queried on round t, he stochastically generates the binary label y j,t according to Pt (y j,t = 1|xt ) = (1 + ∆ j,t )/2, where ∆ j,t = u j ⊤ xt and, as in Section 2, xt can be chosen adversarially depending on previous x’s and y j ’s. [sent-453, score-1.034]

33 We consider |∆ j,t | to be the (hidden) conﬁdence of teacher j in his label for xt . [sent-454, score-0.718]

34 If xt is almost orthogonal to u j then teacher j has a very low conﬁdence in his label, and we say that xt lies outside the expertise region of teacher j. [sent-456, score-1.402]

35 It is no longer clear how we should evaluate the performance of the learner, since the K teachers will often give inconsistent labels on the given xt , and we do not have a well-deﬁned ground-truth to compare against. [sent-457, score-0.827]

36 Intuitively, we would like the learner to predict the label of xt as accurately as the teachers who are experts on xt . [sent-458, score-1.203]

37 To formalize this intuition,5 deﬁne the average margin of a generic 1 subset of teachers C ⊆ [K] as ∆C,t = |C| ∑i∈C ∆i,t . [sent-459, score-0.54]

38 (7) In words, jt⋆ is the most conﬁdent teacher at time t, and Ct is the set of conﬁdent teachers at time t. [sent-462, score-0.935]

39 Note that an equivalent way of generating yt is to pick a conﬁdent teacher j uniformly at random from Ct and to set yt = y j,t . [sent-473, score-0.659]

40 2670 S ELECTIVE S AMPLING AND ACTIVE L EARNING FROM S INGLE AND M ULTIPLE T EACHERS a label-efﬁcient expert algorithm by pretending that the missing ground-truth is provided by some function of the teachers we query. [sent-488, score-0.543]

41 In the ﬁrst version, the algorithm queries either all of the teachers or none of them. [sent-493, score-0.63]

42 The second version is more reﬁned in that the algorithm may query a different subset of teachers on each round. [sent-494, score-0.647]

43 The learner attempts to mimic the process of generating yt by choosing its own set of conﬁdent teachers on each round. [sent-499, score-0.738]

44 We present a simple criterion that either sets Zt = 1 and queries all of the teachers or sets Zt = 0 and queries none of them. [sent-519, score-0.734]

45 ˆ ˆ In words, Ht is the set of teachers with especially high conﬁdence, while Bt is the set of teachers ˆ with borderline conﬁdence. [sent-522, score-1.077]

46 Intuitively, the learner is unsure whether the teachers in Bt should or ˆ should not be included in Ct . [sent-523, score-0.613]

47 The ˆ ﬁrst case is when there exists a subset of borderline teachers S ⊆ Bt that causes the predicted label ˆ ˆˆ to ﬂip, namely, ∆t ∆Ht ∪S,t < 0. [sent-525, score-0.598]

48 The second case is when there exists a subset of borderline teachers 2671 D EKEL , G ENTILE AND S RIDHARAN Algorithm 2: Multiple Teacher Selective Sampler—ﬁrst version input conﬁdence level δ ∈ (0, 1], tolerance parameter τ ≥ 0 initialize A0 = I, ∀ j ∈ [K] w j,0 = 0 for t = 1, 2, . [sent-526, score-0.551]

49 receive xt ∈ Rd : ||xt || ≤ 1 −1 θt2 = xt⊤ At−1 xt 1 + 4 ∑t−1 Zi ri + 36log(Kt/δ) i=1 ˆ j,t = w j,t−1 ⊤ xt and jt = argmax j |∆ j,t | ˆ ˆ ∀ j ∈ [K] ∆ ˆ predict yt = sgn(∆t ) ∈ {−1, +1} ˆ ˆ ˆ ˆ ˆ ˆ ˆ 1 if ∃S ⊆ Bt : ∆t ∆S∪Ht ,t < 0 or |∆S∪Ht ,t | ≤ θt Zt = 0 otherwise if Zt = 1 query y1,t , . [sent-529, score-1.173]

50 , yK,t At = At−1 + xt xt⊤ , rt = xt⊤ At−1 xt for j = 1, . [sent-532, score-0.607]

51 If a query is issued, each weight vector w j,t is updated as in the single teacher case. [sent-537, score-0.53]

52 2 Analysis, First Version Our learning algorithm relies on labels it receives from a set of teachers, and therefore our bounds should naturally depend on the ability of those teachers to provide accurate labels for the sequence x1 , . [sent-544, score-0.642]

53 For example, if an input xt lies outside the expertise regions of all teachers, we cannot hope to learn anything from the labels provided by the teachers for this input. [sent-548, score-0.887]

54 Similarly, there 2672 S ELECTIVE S AMPLING AND ACTIVE L EARNING FROM S INGLE AND M ULTIPLE T EACHERS is nothing we can do on rounds where the set of conﬁdent teachers is split between two equally conﬁdent but conﬂicting opinions. [sent-549, score-0.556]

55 However it is interesting to note that even in a case where most teachers have low conﬁdence in their prediction on any given round, Tε can still be small provided that the experts in the ﬁeld have a conﬁdent opinion. [sent-552, score-0.545]

56 A more subtle difﬁculty presents itself when the collective opinion expressed by the set of conﬁdent teachers changes qualitatively with a small perturbation of the input xt or one of the weight vectors u j . [sent-553, score-0.788]

57 In contrast, the set Bε,t is the set of teachers with borderline conﬁdence: either teachers in Ct that would be excluded if their margin were smaller by ε, or teachers that are not in Ct that would be included if their margin were larger by ε. [sent-556, score-1.631]

58 We say that the average margin of the conﬁdent teachers is unstable with respect to τ and ε if |∆t | > ε but we can ﬁnd a subset S ⊆ Bε,t such that either ∆t ∆S∪Hε,t ,t < 0 or |∆S∪Hε,t ,t | < ε. [sent-557, score-0.54]

59 A straightforward stratiﬁcation of Lemma 7 in Section 2 over the K teachers veriﬁes that this condition holds with high probability. [sent-583, score-0.539]

60 Again as done previously, a straightforward j,t union bound over Lemma 7 in Section 2 applied to each of the K teachers veriﬁes that this condition holds with high probability which in turn concludes the proof of Theorem 17. [sent-617, score-0.555]

61 The goal of our experiments is to validate the theory and to quantify the effectiveness of our multiple teacher query selection mechanism in different multiple-teacher scenarios. [sent-624, score-0.545]

62 2677 D EKEL , G ENTILE AND S RIDHARAN four different multiple-teacher scenarios, distinguished by the number of teachers involved (“few” or “many”) and the amount of overlap between expertise regions (“nonoverlapping teachers” vs. [sent-644, score-0.586]

63 We simulated multiple teachers as follows: we grouped queries together in various ways (see below) and trained a linear classiﬁer using half of the urls associated with each query in the group. [sent-647, score-0.785]

64 We generated 5 teachers by partitioning the 2000 queries into 5 sets (the ﬁrst teacher is deﬁned by (half of) the ﬁrst 400 queries, the second teacher by (half of) the second 400 queries, and so on). [sent-655, score-1.448]

65 Hence each teacher has acquired expertise in the subset of 400 queries seen during training. [sent-656, score-0.573]

66 We generated 5 teachers by deﬁning 5 overlapping sets of queries. [sent-658, score-0.557]

67 We generated 100 teachers by partitioning the queries into 100 disjoint sets, each containing 20 queries. [sent-662, score-0.63]

68 The resulting teachers turned out to be quite unreliable; some of them gave labels that were not far from random guessing at test time. [sent-663, score-0.565]

69 All teachers share the ﬁrst 100 queries, and the remaining 1900 queries are partitioned equally. [sent-666, score-0.63]

70 So, in a sense, the set of teachers we generated in all scenarios collectively encode the same amount of information. [sent-671, score-0.547]

71 That is, the data used for training the teachers are of the same size and query mixture across scenarios. [sent-672, score-0.647]

72 As expected, best and worst teachers are farther apart in the nonoverlapping scenarios (correspondingly, ”stddev” ﬁgures are larger), with a larger variability in the many teacher settings. [sent-697, score-1.012]

73 After simulating the teachers, we implemented a simpliﬁed version of our second-version multiple teacher algorithm (Algorithm 3), where the thresholds θ2 are simpliﬁed to j,t θ2 = α xt⊤ A−1 xt log(1 + t), j,t j,t−1 (12) and α > 0 is a tunable parameter (independent of j and t). [sent-700, score-0.686]

74 , Weighted Majority—see Littlestone and Warmuth, 1994; Cesa-Bianchi and Lugosi, 2006), where teachers are experts, and the algorithm has at its disposal both the true labels of the test set and the prediction of all teachers. [sent-710, score-0.565]

75 The associated number of queries made to the teachers is the largest possible, that is, the size of the test set (119507) times the number of teachers (119507×5 in the “few teacher” scenarios, and 119507×100 in the “many teacher” scenarios). [sent-713, score-1.156]

76 9 Like the B EST T EACHER baseline, this algorithm queries all of the teachers all the time. [sent-733, score-0.63]

77 Since θ2 = ∞ implies Zt = 1 and j,t j,t ˆ Ct = [K] for all t (thereby making τ immaterial), this algorithm predicts by aggregating all teachers via a margin-based majority and, as before, querying all labels from all teachers. [sent-737, score-0.565]

78 Notice, however, that this comparison is unfairly penalizing our algorithm in that the baselines do achieve their results by asking all of the teachers all of the time. [sent-753, score-0.554]

79 2680 S ELECTIVE S AMPLING AND ACTIVE L EARNING FROM S INGLE AND M ULTIPLE T EACHERS Few nonoverlapping teacher scenario Few overlapping teacher scenario 20. [sent-774, score-0.935]

80 5 0 5 10 15 20 25 19 18 17 16 15 0 30 average per−teacher query rate (%) 5 10 15 20 25 30 average per−teacher query rate (%) Many nonoverlapping teacher scenario Many overlapping teacher scenario 21 average test error rate (%) 21 average test error rate (%) Our alg. [sent-784, score-1.177]

81 In any given scenario, the per-teacher query rate of our algorithm is the average fraction of labels requested to the teachers out of the total number of labels available in that scenario. [sent-803, score-0.725]

82 For instance, an average per-teacher query rate of 10% achieved in a few teacher scenario means that, averaged over the 10 repetitions, the total no. [sent-804, score-0.545]

83 of queries made to the ﬁve teachers was 119507×5×10% = 59753. [sent-805, score-0.63]

84 Notably, in our data set, many unreliable (but nonoverlapped) teachers queried all the times and aggregated by a ﬂat average (aka F LAT M AJORITY) is about as good in terms 2681 D EKEL , G ENTILE AND S RIDHARAN of accuracy as running more sophisticated weighted averages. [sent-817, score-0.573]

85 Still, our experiments show that there is no need to query all of the teachers in order to achieve this accuracy. [sent-818, score-0.647]

86 • Aggregating opinions of teachers with a good amount of overlapping expertise (as in the two “overlapping teacher” settings), might be detrimental, as evinced by comparing the ﬁrst row in Table 2 to the second one, and the third to the fourth one. [sent-819, score-0.617]

87 11 The value of τ does however play an important role in the “degree of aggregation” of teachers: When Zt = 1, setting τ close to 0 makes the algorithm query only the (estimated) most conﬁdent teacher at time t, whereas setting τ close to 1 causes the algorithm to query all teachers. [sent-824, score-0.651]

88 1, the most queried teacher receives 5440 queries (out of 119507), and the least queried teacher receives 3319. [sent-827, score-1.028]

89 0 and the same value of α, the most queried teacher receives 10849 queries while the least queried teacher gets only 1050. [sent-829, score-1.008]

90 We then lifted the above to the more involved setting where multiple unreliable teachers are available. [sent-835, score-0.555]

91 On the rounds we do query, At = At−1 + xt xt⊤ , and so by the matrix inversion formula −1 At−1 = At−1 − −1 −1 At−1 xt xt⊤ At−1 we see that −1 rt = xt⊤ At−1 xt − −1 1 + xt⊤ At−1 xt (xt⊤ At−1 xt )2 . [sent-904, score-1.423]

92 −1 1 + xt⊤ At−1 xt −1 −1 This automatically gives us that rt ≤ xt⊤ At−1 xt . [sent-905, score-0.607]

93 Further since xt⊤ At−1 xt ≤ 1, we can conclude that −1 1 rt ≥ 2 xt⊤ At−1 xt . [sent-906, score-0.607]

94 Now to prove (ii), note that since whenever we query, At = At−1 + xt xt⊤ , using the identity, |At | xt⊤ (At−1 + xt xt⊤ )−1 xt = 1 − |At−1 | and the fact that 1 − x ≤ log(x), we see that rt ≤ log |At | . [sent-908, score-0.902]

95 , we have Zt 2 ⊤ ′ ′ (yt − wt−1 xt )2 − (yt − u⊤ xt )2 ≤ Zt dt−1 (u, wt−1 ) − dt (u, wt′ ) + 2 log |At | . [sent-913, score-0.608]

96 2 2 Using the recursive deﬁnition At = At−1 + xt xt⊤ and rearranging terms in the right-hand side above, we get 1 ′ ′ ′ ′ (wt−1 )⊤ xt xt⊤ wt−1 − u⊤ xt xt⊤ u + (u⊤ − wt−1 )(At wt − At−1 wt−1 ) 2 1 ′ ′ ′ ⊤ = (wt−1 xt )2 − (u⊤ xt )2 + (u⊤ − wt−1 )(At wt − At−1 wt−1 ) . [sent-916, score-1.672]

97 2 αt = ′ By deﬁnition, At wt = At−1 wt−1 + yt xt . [sent-917, score-0.568]

98 Plugging this equality into the equation above gives 1 ′ ⊤ ′ (wt−1 xt )2 − (u⊤ xt )2 + yt (u⊤ − wt−1 )xt 2 1 ′ ⊤ = (yt − wt−1 xt )2 − (yt − u⊤ xt )2 , 2 αt = thereby proving (i). [sent-918, score-1.173]

99 ′ To prove (ii), we rewrite dt (wt−1 , wt ) as 1 ′ ′ (w − wt )⊤ At (wt−1 − wt ) 2 t−1 1 ′ ′ = (At wt−1 − At wt )⊤ At−1 (At wt−1 − At wt ) . [sent-919, score-0.956]

100 2 ′ dt (wt−1 , wt ) = ′ Using At wt = At−1 wt−1 + yt xt , the above becomes ′ dt (wt−1 , wt ) = 1 ′ (At − At−1 )wt−1 − yt xt 2 2685 ⊤ −1 At ′ (At − At−1 )wt−1 − yt xt . [sent-920, score-1.806]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('teachers', 0.526), ('teacher', 0.409), ('zt', 0.271), ('ct', 0.267), ('xt', 0.262), ('ht', 0.203), ('wt', 0.181), ('bt', 0.135), ('yt', 0.125), ('query', 0.121), ('jt', 0.107), ('eachers', 0.106), ('entile', 0.106), ('ridharan', 0.106), ('queries', 0.104), ('selective', 0.094), ('elective', 0.091), ('ingle', 0.091), ('ekel', 0.091), ('learner', 0.087), ('rt', 0.083), ('pt', 0.082), ('nt', 0.075), ('qt', 0.075), ('ultiple', 0.07), ('ampling', 0.07), ('kt', 0.063), ('active', 0.061), ('expertise', 0.06), ('dent', 0.057), ('regret', 0.057), ('nonoverlapping', 0.056), ('dt', 0.051), ('label', 0.047), ('adversary', 0.043), ('labels', 0.039), ('zi', 0.039), ('dence', 0.037), ('proactive', 0.035), ('strehl', 0.035), ('earning', 0.035), ('ut', 0.034), ('queried', 0.033), ('log', 0.033), ('overlapping', 0.031), ('logt', 0.03), ('rounds', 0.03), ('sampler', 0.03), ('issued', 0.029), ('baselines', 0.028), ('con', 0.028), ('lemma', 0.028), ('balcan', 0.027), ('orabona', 0.027), ('donmez', 0.026), ('azoury', 0.025), ('borderline', 0.025), ('lat', 0.025), ('gentile', 0.025), ('littman', 0.025), ('hanneke', 0.023), ('cavallanti', 0.023), ('online', 0.022), ('sgn', 0.022), ('info', 0.022), ('beygelzimer', 0.021), ('carbonell', 0.021), ('instances', 0.021), ('scenarios', 0.021), ('round', 0.021), ('papers', 0.02), ('dasgupta', 0.02), ('ri', 0.02), ('ajority', 0.02), ('receives', 0.02), ('urls', 0.019), ('sampling', 0.019), ('experts', 0.019), ('dekel', 0.018), ('bounds', 0.018), ('category', 0.018), ('expert', 0.017), ('pseudocode', 0.017), ('inf', 0.017), ('lemmas', 0.016), ('bound', 0.016), ('flat', 0.016), ('ull', 0.016), ('st', 0.015), ('eacher', 0.015), ('kqt', 0.015), ('telescoping', 0.015), ('multiple', 0.015), ('scenario', 0.015), ('unreliable', 0.014), ('icting', 0.014), ('engine', 0.014), ('margin', 0.014), ('argmax', 0.014), ('holds', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000017 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

2 0.20258798 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

Abstract: In this paper we propose efﬁcient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefﬁcient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which deﬁne the set K , be satisﬁed in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efﬁcient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisﬁed in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a ﬁnite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our ﬁrst algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality

3 0.14359429 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

Author: Trinh Minh Tri Do, Thierry Artières

Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efﬁcient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efﬁcient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random ﬁelds, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difﬁcult and requires a stronger and more disputable assumption. Yet we provide experimental results on artiﬁcial test problems, and on ﬁve standard and difﬁcult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk

4 0.13053641 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

Abstract: In recent years there has been a lot of interest in designing principled classiﬁcation algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufﬁcient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale

5 0.12392063 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

6 0.11389235 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

7 0.10585445 84 jmlr-2012-Online Submodular Minimization

8 0.10345326 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

9 0.10286146 13 jmlr-2012-Active Learning via Perfect Selective Classification

10 0.077407889 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

11 0.074457951 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

12 0.072590157 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

13 0.069210134 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

14 0.068244167 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

15 0.067196116 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

16 0.065113582 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

17 0.064188249 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

18 0.062220734 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

19 0.055071395 12 jmlr-2012-Active Clustering of Biological Sequences

20 0.052776765 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.241), (1, -0.367), (2, -0.101), (3, -0.05), (4, 0.013), (5, 0.039), (6, 0.174), (7, 0.05), (8, 0.035), (9, 0.069), (10, -0.061), (11, 0.07), (12, -0.037), (13, -0.088), (14, -0.031), (15, 0.131), (16, -0.119), (17, 0.037), (18, 0.065), (19, 0.004), (20, -0.018), (21, -0.126), (22, 0.054), (23, -0.039), (24, -0.082), (25, 0.042), (26, -0.031), (27, 0.021), (28, 0.073), (29, 0.15), (30, -0.041), (31, -0.042), (32, 0.055), (33, -0.002), (34, -0.062), (35, -0.011), (36, 0.125), (37, 0.046), (38, 0.078), (39, 0.003), (40, -0.007), (41, -0.092), (42, -0.058), (43, -0.091), (44, 0.062), (45, 0.092), (46, 0.049), (47, -0.067), (48, 0.033), (49, -0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95902675 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

2 0.56896889 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

Author: Trinh Minh Tri Do, Thierry Artières

3 0.53979391 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

4 0.46436715 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie

Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with signiﬁcant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We ﬁnd that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling

5 0.45465705 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

6 0.43282676 84 jmlr-2012-Online Submodular Minimization

7 0.42740318 13 jmlr-2012-Active Learning via Perfect Selective Classification

8 0.39414939 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

9 0.37799463 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

10 0.37606794 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

11 0.346255 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

12 0.34584898 97 jmlr-2012-Regularization Techniques for Learning with Matrices

13 0.34259009 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

14 0.31786591 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

15 0.29182994 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

16 0.28019524 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

17 0.27892026 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

18 0.27760524 12 jmlr-2012-Active Clustering of Biological Sequences

19 0.26938331 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

20 0.26721382 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.029), (21, 0.035), (26, 0.041), (27, 0.01), (29, 0.034), (35, 0.015), (49, 0.014), (56, 0.017), (57, 0.017), (64, 0.023), (69, 0.012), (75, 0.163), (77, 0.019), (79, 0.05), (92, 0.064), (95, 0.279), (96, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80820078 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Author: Vikas C. Raykar, Shipeng Yu

Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, deﬁned as annotators who assign labels randomly without actually looking at the instance. Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the ﬁnal consensus labels. In this paper we propose an empirical Bayesian algorithm called SpEM that iteratively eliminates the spammers and estimates the consensus labels based only on the good annotators. The algorithm is motivated by deﬁning a spammer score that can be used to rank the annotators. Experiments on simulated and real data show that the proposed approach is better than (or as good as) the earlier approaches in terms of the accuracy and uses a signiﬁcantly smaller number of annotators. Keywords: crowdsourcing, multiple annotators, ranking annotators, spammers

same-paper 2 0.73632807 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

3 0.59478426 44 jmlr-2012-Feature Selection via Dependence Maximization

Author: Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt

Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artiﬁcial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt independence criterion, Hilbert space embedding of distribution

4 0.58671349 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

Author: Michael Brückner, Christian Kanzow, Tobias Scheffer

Abstract: The standard assumption of identically distributed training and test data is violated when the test data are generated in response to the presence of a predictive model. This becomes apparent, for example, in the context of email spam ﬁltering. Here, email service providers employ spam ﬁlters, and spam senders engineer campaign templates to achieve a high rate of successful deliveries despite the ﬁlters. We model the interaction between the learner and the data generator as a static game in which the cost functions of the learner and the data generator are not necessarily antagonistic. We identify conditions under which this prediction game has a unique Nash equilibrium and derive algorithms that ﬁnd the equilibrial prediction model. We derive two instances, the Nash logistic regression and the Nash support vector machine, and empirically explore their properties in a case study on email spam ﬁltering. Keywords: static prediction games, adversarial classiﬁcation, Nash equilibrium

5 0.58273774 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

6 0.57895976 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

7 0.54621953 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

8 0.53015631 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

9 0.51545 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

10 0.50764418 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

11 0.50714153 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

12 0.50532532 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

13 0.50460142 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

14 0.50172442 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

15 0.50170785 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

16 0.50049412 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

17 0.49629548 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

18 0.49500772 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

19 0.49351472 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

20 0.49351454 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting