nips nips2010 nips2010-182 knowledge-graph by maker-knowledge-mining

182 nips-2010-New Adaptive Algorithms for Online Classification

Source: pdf

Author: Francesco Orabona, Koby Crammer

Abstract: We propose a general framework to online learning for classiﬁcation problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classiﬁcation algorithms. By optimizing the general bound we derive a new online classiﬁcation algorithm, called NAROW, that hybridly uses adaptive- and ﬁxed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This framework allows to design and prove relative mistake bounds for any generic loss function. [sent-7, score-0.279]

2 The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classiﬁcation algorithms. [sent-8, score-0.407]

3 By optimizing the general bound we derive a new online classiﬁcation algorithm, called NAROW, that hybridly uses adaptive- and ﬁxed- second order information. [sent-9, score-0.198]

4 1 Introduction Linear discriminative online algorithms have been shown to perform very well on binary and multiclass labeling problems [10, 6, 14, 3]. [sent-11, score-0.168]

5 Often, such update is taking place only on rounds where the online algorithm makes a prediction mistake or when the conﬁdence in the prediction is not sufﬁcient. [sent-14, score-0.404]

6 The aim of the classiﬁer is to minimize the cumulative loss it suffers due to its prediction, such as the total number of mistakes. [sent-15, score-0.153]

7 Recently [1, 8, 4, 12, 5, 9], researchers proposed to improve online learning algorithms by incorporating second order information. [sent-17, score-0.147]

8 For carefully designed algorithms, it is possible to bound the cumulative loss on any sequence of samples, even adversarially chosen [2]. [sent-28, score-0.234]

9 In particular, many of the recent analyses are based on the online convex optimization framework, that focuses on minimizing the sum of convex functions. [sent-29, score-0.188]

10 1 Two common view-points for online convex optimization are of regularization [15] or primal-dual progress [16, 17, 13]. [sent-30, score-0.166]

11 We also show how the use of widely used classiﬁcation losses, as the hinge loss, allows us to derive new powerful mistake bounds superior to existing bounds. [sent-34, score-0.239]

12 Moreover the framework introduced supports the design of aggressive algorithms, i. [sent-35, score-0.163]

13 All these algorithms maintain the cumulative second-moment of the input features, and its inverse, qualitatively speaking, is used as a learning rate. [sent-39, score-0.137]

14 We derive a bound for this algorithm and illustrate its properties using synthetic data simulations. [sent-46, score-0.14]

15 2 Online Learning for Classiﬁcation We work in the online binary classiﬁcation scenario where learning algorithms work in rounds. [sent-47, score-0.147]

16 At each round t, an instance xt ∈ Rd is presented to the algorithm, which then predicts a label yt ∈ {−1, +1}. [sent-48, score-0.414]

17 Then, the correct label yt is revealed, and the algorithm may modify its hypothesis. [sent-49, score-0.206]

18 ˆ The aim of the online learning algorithm is to make as few mistakes as possible (on any sequence of samples/labels {(xt , yt )}T ). [sent-50, score-0.591]

19 In this paper we focus on linear prediction functions of the form t=1 yt = sign(wt xt ). [sent-51, score-0.4]

20 ˆ We strive to design online learning algorithms for which it is possible to prove a relative mistakes bound or a loss bound. [sent-52, score-0.575]

21 Typical such analysis bounds the cumulative loss the algorithm suffers, T t=1 (wt , xt , yt ), with the cumulative loss of any classiﬁer u plus an additional penalty called T the regret, R(u) + t=1 (u, xt , yt ). [sent-53, score-1.12]

22 Given that we focus on classiﬁcation, we are more interested in relative mistakes bound, where we bound the number of mistakes of the learner with R(u) + T t=1 (u, xt , yt ). [sent-54, score-0.991]

23 Similarly, we denote by U the set of the margin error rounds, that is, rounds in which the algorithm updates its hypothesis and the prediction is correct, but the loss t (wt ) is different from zero. [sent-61, score-0.236]

24 Formally, M = {t : sign(wt xt ) = yt & wt = wt+1 }, and U = {t : sign(wt xt ) = yt & wt = wt+1 }. [sent-63, score-1.18]

25 An algorithm that updates its hypothesis only on mistake rounds is called conservative (e. [sent-64, score-0.317]

26 Following previous naming convention [3], we call aggressive an algorithm that updates is rule on rounds for which the loss t (wt ) is different from zero, even if its prediction was correct. [sent-67, score-0.323]

27 Strong convexity 2 turns out to be a key property to design online learning algorithms. [sent-75, score-0.152]

28 2 3 General Algorithm and Analysis We now introduce a general framework to design online learning algorithms and a general lemma which serves as a general tool to prove their relative regret bounds. [sent-76, score-0.28]

29 Our algorithm builds on previous algorithms for online convex programming with a one signiﬁcant difference. [sent-77, score-0.222]

30 Instead of using a ﬁxed link function as ﬁrst order algorithms, we allow a sequence of link functions ft (·), one for each time t. [sent-78, score-0.531]

31 Given a new examples it uses the current link function ft to compute a prediction weight vector wt . [sent-80, score-0.791]

32 After the target label is received it sets the new weight θt+1 to be the sum of θt and minus the gradient of the loss at wt . [sent-81, score-0.305]

33 The following lemma is a generalization of Corollary 7 in [13] and Corollary 3 in [9], for online learning. [sent-84, score-0.16]

34 Then, for any u ∈ S, and any λ > 0 we have T ηt zt t=1 1 wt − u λ T ≤ 2 ∗ ft 2 ηt zt fT (λu) + λ t=1 2λβt + 1 ∗ ∗ (f (θt ) − ft−1 (θt )) λ t . [sent-100, score-1.021]

35 This Lemma can appear difﬁcult to interpret, but we now show that it is straightforward to use the lemma to recover known bounds of different online learning algorithms. [sent-101, score-0.233]

36 In particular we can state the following Corollary that holds for any convex loss that upper bounds the 0/1 loss. [sent-102, score-0.133]

37 , T do 4: Receive xt ∗ 5: Set wt = ft (θt ) 6: Predict yt = sign(wt xt ) ˆ 7: Receive yt 8: if t (wt ) > 0 then 9: zt = ∂ t (wt ) 10: θt+1 = θt − ηt zt 11: else 12: θt+1 = θt 13: end if 14: end for T ∗ Corollary 1. [sent-110, score-1.765]

38 1 has the following bound on the maximum number of mistakes M , T M≤ t (u) t=1 + T zt 2t B fT (u) f∗ +η . [sent-113, score-0.502]

39 + η 2βt η t=1 (1) Moreover if ft (x) ≤ ft+1 (x), ∀x ∈ S, t = 0, . [sent-114, score-0.483]

40 We could use adaptive regularization methods, as the one proposed in [16, 18], but in this way we would lose the possibility to prove mistake bounds for second order algorithms, like the ones in [1, 5]. [sent-123, score-0.249]

41 1 Better bounds for linear losses The hinge loss, (u, xt , yt ) = max(1 − yt u xt , 0), is a very popular evaluation metric in classiﬁcation. [sent-126, score-0.862]

42 It has been used, for example, in Support Vector Machines [7] as well as in many online learning algorithms [3]. [sent-127, score-0.147]

43 Often mistake bounds are expressed in terms of the hinge loss. [sent-129, score-0.22]

44 One reason is that it is a tighter upper bound of the 0/1 loss compared to other losses, as the squared hinge loss. [sent-130, score-0.158]

45 However, this loss is particularly interesting for us, because it allows an automatic tuning of the bound in (1). [sent-131, score-0.139]

46 In particular it is easy to verify that it satisﬁes the following condition (u, xt , yt ) ≥ 1 + u ∂ t (wt ), ∀u ∈ S, wt : 3 t (wt ) >0. [sent-132, score-0.608]

47 Under the hypothesis of Lemma 1, if fT (λu) ≤ λ2 fT (u), and satisﬁes (2), then for any u ∈ S, and any λ > 0 we have ηt ≤ L + λfT (u) + t∈M∪U 1 λ where L = t∈M∪U ηt t (u), and B = optimal λ, we obtain ηt ≤ L + 2fT (u) B+ t∈M∪U T ∗ t=1 (ft (θt ) 2 ∗ ft − ηt wt zt , ∗ − ft−1 (θt )). [sent-135, score-0.886]

48 In particular, choosing the 2B + t∈M∪U t∈M∪U 2 ηt zt 2βt 2 ηt zt βt 2 ∗ ft − 2ηt wt zt . [sent-136, score-1.181]

49 In other words, wt and αwt (with α > 0) make exactly the same predictions, because only the sign of the prediction matters. [sent-138, score-0.276]

50 Exactly this independence in a scale factor allows us to improve the mistake bound (1) to the bound of (3). [sent-139, score-0.259]

51 Finally, note that when the hinge loss is used, the vector θt is updated as in an aggressive version of the Perceptron algorithm, with a possible variable learning rate. [sent-141, score-0.226]

52 1 to obtain an aggressive version of the p-norm algorithm [11]. [sent-145, score-0.156]

53 Set 1 ft (u) = 2(q−1) u 2 , that is 1-strongly convex w. [sent-146, score-0.52]

54 Moreover set ηt = 1 in mistake error rounds, so using the second bound of Corollary 2, and deﬁning R such that xt 2 ≤ R2 , we have p M ≤L+ ≤L+ u 2 q q−1 u 2 q q−1 2 ηt xt 2 p + 2ηt yt wt xt − t∈M∪U ηt t∈U 2 ηt xt M R2 + 2 p + 2ηt yt wt xt − ηt . [sent-151, score-2.007]

55 t∈U t∈U Solving for M we have M ≤L+ where L = 1 u q u 2 R2 + R √ q 2(q − 1) q−1 t∈M∪U ηt t (u), and D = 1 u 2 R2 + L + D − q 4(q − 1) 2 ηt x t t∈U 2 p +2ηt yt wt R2 xt ηt , (4) t∈U − ηt . [sent-152, score-0.59]

56 1 becomes the R2 −2yt wt xt p-norm algorithm and we recover its best bound [11]. [sent-155, score-0.543]

57 However if 0 ≤ ηt ≤ min ,1 xt 2 p we have that D is negative, and L ≤ t∈M∪U t (u). [sent-156, score-0.211]

58 Hence the aggressive updates gives us a better bound, thanks to last term that is subtracted to the bound. [sent-157, score-0.166]

59 In particular the minimum R2 /2−yt wt xt of D, under the constraint ηt ≤ 1, can be found setting ηt = min , 1 . [sent-159, score-0.429]

60 If R is equal xt 2 √ to 2, we recover the PA-I update rule, when C = 1. [sent-160, score-0.261]

61 However note that the mistake bound in (4) is better than the one proved for PA-I in [3] and the ones in [16]. [sent-161, score-0.194]

62 Hence the bound (4) provides the ﬁrst theoretical justiﬁcation to the good performance of the PA-I, and it can be seen as a general evidence supporting the aggressive updates versus the conservative ones. [sent-162, score-0.26]

63 Set ft (x) = 1 x At x, 2 x x where At = At−1 + tr t , r > 0 and A0 = I. [sent-165, score-0.483]

64 f The dual functions of ft (x), ft∗ (x), are equal to 1 x A−1 x, t 2 while x 2t is x A−1 x. [sent-170, score-0.508]

65 Denote by χt = xt A−1 xt and t t−1 f∗ mt = yt xt A−1 θt . [sent-171, score-0.94]

66 1 specialized in this case is yt wt xt = mt r+χt , on the other hand AROW predicts with mt . [sent-175, score-0.882]

67 The sign of the predictions is the same, but here the aggressive version is upr dating when mt r+χt ≤ 1, while AROW updates if mt ≤ 1. [sent-176, score-0.493]

68 To derive the bound, observe that using Woodbury matrix identity we have ft∗ (θt ) (x A−1 θt )2 t−1 ∗ ft−1 (θt ) − t = − 2(r+x t m2 A−1 xt ) t−1 t = − 2(r+χt ) . [sent-179, score-0.23]

69 Using the second bound in Corollary 2, and setting ηt = 1 we have M +U ≤L+ m2 t r + χt xt A−1 xt + 2yt wt xt − t u AT u t∈M∪U ≤L+ ≤L+ 2 u r u + 2 1 r (u xt )2 t∈M∪U t∈M∪U (u xt )2 + 2yt wt xt − r log(det(AT )) + log(det(AT )) + t∈M∪U t∈M∪U m2 t r + χt mt (2r − mt ) . [sent-180, score-2.059]

70 r(r + χt ) This bound recovers the SOP’s one in the conservative case, and improves slightly the one of AROW for the aggressive case. [sent-181, score-0.222]

71 Here we prove a mistake bound for the diagonal version of AROW, using Corollary 2. [sent-187, score-0.251]

72 We denote Dt = diag{At }, where At is deﬁned as in SOP and AROW, and ft (x) = 1 x Dt x. [sent-188, score-0.483]

73 The presence of a mistake bound allows us to theoretically analyze the cases where this algorithm could be advantageous respect to a simple Perceptron. [sent-190, score-0.217]

74 We set x x 1 ft (x) = 2 x At x, where At = At−1 + trt t , and A0 = I. [sent-206, score-0.483]

75 This is similar to the regularization used in AROW and SOP, but here we have rt > 0 changing over time. [sent-207, score-0.143]

76 Again, denote χt = xt A−1 xt , t−1 and set ηt = 1. [sent-208, score-0.422]

77 With this choices, we obtain the bound M +U ≤ t (u) + t∈M∪U λ u 2 2 + t∈M∪U mt (2rt − mt ) λ(u xt )2 χt rt − + 2rt 2λ(rt + χt ) 2λ(rt + χt ) , that holds for any λ > 0 and any choice of rt > 0. [sent-209, score-0.824]

78 We would like to choose rt at each step to 2 χ rt minimize the bound, in particular to have a small value of the sum λ(u rtxt ) + λ(rtt+χt ) . [sent-210, score-0.256]

79 Altough t we do not know the values of (u xt )2 and λ, still we can have a good trade-off setting rt = bχχ−1 t when χt ≥ 1 and rt = +∞ otherwise. [sent-211, score-0.467]

80 With this choice we have that b χt rt rt +χt = 1 , and b (u xt )2 rt M +U − ≤ t:bχt >1 λ u 2 = χt (u xt )2 b , rt +χt when χt ≥ 1 . [sent-213, score-0.934]

81 Tuning λ we have M +U ≤ t (u) t∈M∪U + u R 1 + log det(AT ) bR2 min (1, bχt ) − t∈M∪U bmt (2rt − mt ) rt + χt This algorithm interpolates between a second order algorithm with adaptive second order information, like AROW, and one with a ﬁxed second order information. [sent-215, score-0.395]

82 Middle: cumulative number of mistakes of four algorithms on data with no labels noise. [sent-222, score-0.414]

83 3 summarizes the cumulative number of mistakes averaged over 100 repetitions and the third row shows the cumulative number of mistakes where 10% artiﬁcial label noise was used. [sent-237, score-0.799]

84 All algorithms make few mistakes when receiving the ﬁrst half of the data - the easy examples. [sent-240, score-0.328]

85 Then all algorithms start to make more mistakes - PA-I the most, then AdaGrad and closely following NAROW, and AROW the least. [sent-241, score-0.31]

86 All algorithms follow a general trend of making mistakes in a linear rate and then stop making mistakes when the data is easy and there are many possible classiﬁers that can predict correctly. [sent-244, score-0.668]

87 Clearly, 7 AROW and NAROW stop making mistakes ﬁrst, then AdaGrad and PA-I last. [sent-245, score-0.307]

88 In both cases, all algorithms do not make many mistakes in the beginning, then at some point, close to the middle of the input sequence, they start making many mistakes for a while, and then they converge. [sent-248, score-0.602]

89 However, NAROW starts to make many mistakes before the other algorithms and takes more “examples” to converge until it stopped making mistakes. [sent-250, score-0.325]

90 7 Conclusion We presented a framework for online convex classiﬁcation, specializing it for particular losses, as the hinge loss. [sent-254, score-0.212]

91 This general tool allows to design theoretical motivated online classiﬁcation algorithms and to prove their relative mistake bound. [sent-255, score-0.313]

92 We have shown its utility proving better bounds for known online algorithms, and proposing a new algorithm, called NAROW. [sent-258, score-0.161]

93 Deﬁne by ft∗ the Fenchel dual of ft , and ∆t = ft∗ (θt+1 ) − ft−1 (θt ). [sent-268, score-0.508]

94 Moreover we have that ∆t = ft (θt+1 ) − ft (θt ) + ft∗ (θt ) + ∗ ∗ ft∗ (θt ) − ft−1 (θt ) ≤ ft∗ (θt ) − ft−1 (θt ) − ηt zt 2 ηt 2βt zt 2 ∗ ft , 1 λ where we used Theorem 6 T 1 ∗ in [13]. [sent-270, score-1.769]

95 Moreover using the Fenchel-Young inequality, we have that t=1 ∆t = λ fT (θT +1 ) ≥ T 1 1 u θT +1 − λ fT (λu) = − t=1 ηt u zt − λ fT (λu). [sent-271, score-0.16]

96 Hence putting all togheter we have T − ηt u z t − t=1 1 1 fT (λu) ≤ λ λ T ∆t ≤ t=1 1 λ T ∗ (ft∗ (θt ) − ft−1 (θt ) − ηt wt zt + t=1 2 ηt zt 2βt 2 ∗ ft ), where we used the deﬁnition of wt in Algorithm 1. [sent-272, score-1.239]

97 By convexity, (wt , xt , yt ) − (u, xt , yt ) ≤ zt (wt − u), so setting λ = 1 in Lemma 1 we have the stated bound. [sent-274, score-0.904]

98 For the additional statement, using Lemma 12 in [16] and ∗ ft (x) ≤ ft+1 (x) we have that ft∗ (x) ≥ ft+1 (x), so B ≤ 0. [sent-275, score-0.483]

99 Using it, we have that ft (x) ≤ ft+1 (x) implies that ft∗ (x) ≥ ∗ ft+1 (x), so we have that B ≤ 0. [sent-277, score-0.483]

100 Lemma 1, the condition on the loss (2), and the hypothesis on fT gives us T T ηt (1 − t (u)) ≤ − t=1 ηt u zt ≤ λfT (u) + t=1 1 λ T t=1 2 ηt zt 2βt 2 ∗ ft + B − ηt zt wt Note that λ is free, so choosing its optimal value we get the second bound. [sent-279, score-1.255]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ft', 0.483), ('arow', 0.441), ('narow', 0.292), ('mistakes', 0.277), ('adagrad', 0.243), ('wt', 0.218), ('xt', 0.211), ('yt', 0.161), ('zt', 0.16), ('mt', 0.146), ('mistake', 0.129), ('rt', 0.128), ('online', 0.114), ('aggressive', 0.113), ('cumulative', 0.104), ('sop', 0.097), ('corollary', 0.08), ('perceptron', 0.065), ('cw', 0.065), ('bound', 0.065), ('rounds', 0.058), ('crammer', 0.058), ('loss', 0.049), ('nlp', 0.049), ('bounds', 0.047), ('pa', 0.046), ('lemma', 0.046), ('conservative', 0.044), ('hinge', 0.044), ('adaptive', 0.04), ('koby', 0.039), ('orabona', 0.039), ('updates', 0.038), ('convex', 0.037), ('interpolates', 0.035), ('algorithms', 0.033), ('regret', 0.033), ('classi', 0.032), ('det', 0.032), ('sign', 0.03), ('examples', 0.03), ('signed', 0.03), ('ordered', 0.029), ('sentiment', 0.028), ('francesco', 0.028), ('prediction', 0.028), ('ordering', 0.027), ('losses', 0.027), ('recover', 0.026), ('dredze', 0.026), ('milano', 0.026), ('tuning', 0.025), ('hypothesis', 0.025), ('dual', 0.025), ('update', 0.024), ('hyperplane', 0.024), ('algorithm', 0.023), ('label', 0.022), ('dence', 0.022), ('fenchel', 0.021), ('multiclass', 0.021), ('round', 0.02), ('version', 0.02), ('coordinates', 0.019), ('derive', 0.019), ('yellow', 0.019), ('convexity', 0.019), ('design', 0.019), ('diagonal', 0.019), ('hazan', 0.019), ('eigenvalues', 0.019), ('prove', 0.018), ('trend', 0.018), ('easy', 0.018), ('plots', 0.018), ('synthetic', 0.017), ('informative', 0.017), ('rare', 0.017), ('framework', 0.017), ('moreover', 0.017), ('kakade', 0.016), ('illustrate', 0.016), ('dt', 0.016), ('hard', 0.016), ('receive', 0.016), ('focusing', 0.016), ('link', 0.016), ('sequence', 0.016), ('weight', 0.016), ('margin', 0.015), ('making', 0.015), ('thanks', 0.015), ('stop', 0.015), ('regularization', 0.015), ('builds', 0.015), ('third', 0.015), ('wrong', 0.014), ('rule', 0.014), ('fourth', 0.014), ('supports', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

2 0.30141094 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity ﬂow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1

3 0.266491 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: We develop a theory of online learning by deﬁning several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1

4 0.21562022 274 nips-2010-Trading off Mistakes and Don't-Know Predictions

Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum

Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefﬁcient, algorithm for general ﬁnite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present speciﬁc polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.

5 0.14456269 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efﬁcient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efﬁcient method for implementing mixture forecasting strategies. 1

6 0.14305504 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

7 0.12347706 162 nips-2010-Link Discovery using Graph Feature Tracking

8 0.11661258 15 nips-2010-A Theory of Multiclass Boosting

9 0.11069065 219 nips-2010-Random Conic Pursuit for Semidefinite Programming

10 0.095041499 61 nips-2010-Direct Loss Minimization for Structured Prediction

11 0.088630497 150 nips-2010-Learning concept graphs from text with stick-breaking priors

12 0.087201945 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

13 0.083761603 138 nips-2010-Large Margin Multi-Task Metric Learning

14 0.080158696 203 nips-2010-Parametric Bandits: The Generalized Linear Case

15 0.076715514 188 nips-2010-On Herding and the Perceptron Cycling Theorem

16 0.074552596 265 nips-2010-The LASSO risk: asymptotic results and real world examples

17 0.069814786 146 nips-2010-Learning Multiple Tasks using Manifold Regularization

18 0.069108412 212 nips-2010-Predictive State Temporal Difference Learning

19 0.068357013 243 nips-2010-Smoothness, Low Noise and Fast Rates

20 0.068125226 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.174), (1, -0.042), (2, 0.145), (3, 0.005), (4, 0.034), (5, 0.101), (6, -0.112), (7, -0.047), (8, -0.14), (9, 0.213), (10, 0.079), (11, -0.001), (12, 0.111), (13, 0.091), (14, -0.025), (15, 0.159), (16, -0.045), (17, -0.229), (18, -0.066), (19, -0.042), (20, -0.009), (21, 0.143), (22, 0.143), (23, 0.087), (24, -0.262), (25, -0.015), (26, 0.088), (27, 0.086), (28, -0.14), (29, 0.055), (30, 0.077), (31, 0.072), (32, -0.052), (33, 0.166), (34, -0.116), (35, 0.02), (36, 0.077), (37, 0.029), (38, 0.131), (39, -0.038), (40, -0.18), (41, 0.096), (42, 0.09), (43, 0.018), (44, 0.006), (45, 0.066), (46, -0.085), (47, -0.108), (48, -0.052), (49, -0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9464311 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

2 0.67715943 274 nips-2010-Trading off Mistakes and Don't-Know Predictions

Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum

3 0.65438485 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

4 0.6006853 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

5 0.54964322 188 nips-2010-On Herding and the Perceptron Cycling Theorem

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classiﬁers such as the voted perceptron and the discriminative RBM. 1

6 0.46843323 61 nips-2010-Direct Loss Minimization for Structured Prediction

7 0.42175597 219 nips-2010-Random Conic Pursuit for Semidefinite Programming

8 0.40868038 222 nips-2010-Random Walk Approach to Regret Minimization

9 0.39119124 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

10 0.36879095 15 nips-2010-A Theory of Multiclass Boosting

11 0.33727401 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

12 0.31338856 226 nips-2010-Repeated Games against Budgeted Adversaries

13 0.30347657 2 nips-2010-A Bayesian Approach to Concept Drift

14 0.29101905 19 nips-2010-A rational decision making framework for inhibitory control

15 0.28852159 212 nips-2010-Predictive State Temporal Difference Learning

16 0.27948985 138 nips-2010-Large Margin Multi-Task Metric Learning

17 0.27924103 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

18 0.27695784 202 nips-2010-Parallelized Stochastic Gradient Descent

19 0.26851308 192 nips-2010-Online Classification with Specificity Constraints

20 0.26832205 203 nips-2010-Parametric Bandits: The Generalized Linear Case

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.069), (27, 0.054), (30, 0.047), (35, 0.012), (39, 0.303), (45, 0.174), (50, 0.037), (52, 0.032), (60, 0.033), (77, 0.062), (78, 0.045), (90, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.73148167 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Author: Martha White, Adam White

Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of conﬁdence in the estimate. Accurate estimates of an agent’s conﬁdence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing conﬁdence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisﬁes traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when conﬁdence estimates are pivotal. In this work, we investigate how to compute robust conﬁdences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute conﬁdence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our conﬁdence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1

same-paper 2 0.72189695 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

3 0.72150749 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance

Author: Ulrike V. Luxburg, Agnes Radl, Matthias Hein

Abstract: The commute distance between two vertices in a graph is the expected time it takes a random walk to travel from the ﬁrst to the second vertex and back. We study the behavior of the commute distance as the size of the underlying graph increases. We prove that the commute distance converges to an expression that does not take into account the structure of the graph at all and that is completely meaningless as a distance function on the graph. Consequently, the use of the raw commute distance for machine learning purposes is strongly discouraged for large graphs and in high dimensions. As an alternative we introduce the ampliﬁed commute distance that corrects for the undesired large sample effects. 1

4 0.7048344 172 nips-2010-Multi-Stage Dantzig Selector

Author: Ji Liu, Peter Wonka, Jieping Ye

Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively reﬁnes the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the ﬁrst term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1

5 0.69217306 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

6 0.57667512 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

7 0.5746578 222 nips-2010-Random Walk Approach to Regret Minimization

8 0.57220966 117 nips-2010-Identifying graph-structured activation patterns in networks

9 0.57106358 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

10 0.57060874 265 nips-2010-The LASSO risk: asymptotic results and real world examples

11 0.5701912 63 nips-2010-Distributed Dual Averaging In Networks

12 0.56953001 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

13 0.56820965 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

14 0.56804711 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

15 0.56685001 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

16 0.56666034 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

17 0.56568265 158 nips-2010-Learning via Gaussian Herding

18 0.565382 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

19 0.56516284 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

20 0.56413662 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone