nips nips2010 nips2010-182 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
Reference: text
sentIndex sentText sentNum sentScore
1 This framework allows to design and prove relative mistake bounds for any generic loss function. [sent-7, score-0.279]
2 The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. [sent-8, score-0.407]
3 By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. [sent-9, score-0.198]
4 1 Introduction Linear discriminative online algorithms have been shown to perform very well on binary and multiclass labeling problems [10, 6, 14, 3]. [sent-11, score-0.168]
5 Often, such update is taking place only on rounds where the online algorithm makes a prediction mistake or when the confidence in the prediction is not sufficient. [sent-14, score-0.404]
6 The aim of the classifier is to minimize the cumulative loss it suffers due to its prediction, such as the total number of mistakes. [sent-15, score-0.153]
7 Recently [1, 8, 4, 12, 5, 9], researchers proposed to improve online learning algorithms by incorporating second order information. [sent-17, score-0.147]
8 For carefully designed algorithms, it is possible to bound the cumulative loss on any sequence of samples, even adversarially chosen [2]. [sent-28, score-0.234]
9 In particular, many of the recent analyses are based on the online convex optimization framework, that focuses on minimizing the sum of convex functions. [sent-29, score-0.188]
10 1 Two common view-points for online convex optimization are of regularization [15] or primal-dual progress [16, 17, 13]. [sent-30, score-0.166]
11 We also show how the use of widely used classification losses, as the hinge loss, allows us to derive new powerful mistake bounds superior to existing bounds. [sent-34, score-0.239]
12 Moreover the framework introduced supports the design of aggressive algorithms, i. [sent-35, score-0.163]
13 All these algorithms maintain the cumulative second-moment of the input features, and its inverse, qualitatively speaking, is used as a learning rate. [sent-39, score-0.137]
14 We derive a bound for this algorithm and illustrate its properties using synthetic data simulations. [sent-46, score-0.14]
15 2 Online Learning for Classification We work in the online binary classification scenario where learning algorithms work in rounds. [sent-47, score-0.147]
16 At each round t, an instance xt ∈ Rd is presented to the algorithm, which then predicts a label yt ∈ {−1, +1}. [sent-48, score-0.414]
17 Then, the correct label yt is revealed, and the algorithm may modify its hypothesis. [sent-49, score-0.206]
18 ˆ The aim of the online learning algorithm is to make as few mistakes as possible (on any sequence of samples/labels {(xt , yt )}T ). [sent-50, score-0.591]
19 In this paper we focus on linear prediction functions of the form t=1 yt = sign(wt xt ). [sent-51, score-0.4]
20 ˆ We strive to design online learning algorithms for which it is possible to prove a relative mistakes bound or a loss bound. [sent-52, score-0.575]
21 Typical such analysis bounds the cumulative loss the algorithm suffers, T t=1 (wt , xt , yt ), with the cumulative loss of any classifier u plus an additional penalty called T the regret, R(u) + t=1 (u, xt , yt ). [sent-53, score-1.12]
22 Given that we focus on classification, we are more interested in relative mistakes bound, where we bound the number of mistakes of the learner with R(u) + T t=1 (u, xt , yt ). [sent-54, score-0.991]
23 Similarly, we denote by U the set of the margin error rounds, that is, rounds in which the algorithm updates its hypothesis and the prediction is correct, but the loss t (wt ) is different from zero. [sent-61, score-0.236]
24 Formally, M = {t : sign(wt xt ) = yt & wt = wt+1 }, and U = {t : sign(wt xt ) = yt & wt = wt+1 }. [sent-63, score-1.18]
25 An algorithm that updates its hypothesis only on mistake rounds is called conservative (e. [sent-64, score-0.317]
26 Following previous naming convention [3], we call aggressive an algorithm that updates is rule on rounds for which the loss t (wt ) is different from zero, even if its prediction was correct. [sent-67, score-0.323]
27 Strong convexity 2 turns out to be a key property to design online learning algorithms. [sent-75, score-0.152]
28 2 3 General Algorithm and Analysis We now introduce a general framework to design online learning algorithms and a general lemma which serves as a general tool to prove their relative regret bounds. [sent-76, score-0.28]
29 Our algorithm builds on previous algorithms for online convex programming with a one significant difference. [sent-77, score-0.222]
30 Instead of using a fixed link function as first order algorithms, we allow a sequence of link functions ft (·), one for each time t. [sent-78, score-0.531]
31 Given a new examples it uses the current link function ft to compute a prediction weight vector wt . [sent-80, score-0.791]
32 After the target label is received it sets the new weight θt+1 to be the sum of θt and minus the gradient of the loss at wt . [sent-81, score-0.305]
33 The following lemma is a generalization of Corollary 7 in [13] and Corollary 3 in [9], for online learning. [sent-84, score-0.16]
34 Then, for any u ∈ S, and any λ > 0 we have T ηt zt t=1 1 wt − u λ T ≤ 2 ∗ ft 2 ηt zt fT (λu) + λ t=1 2λβt + 1 ∗ ∗ (f (θt ) − ft−1 (θt )) λ t . [sent-100, score-1.021]
35 This Lemma can appear difficult to interpret, but we now show that it is straightforward to use the lemma to recover known bounds of different online learning algorithms. [sent-101, score-0.233]
36 In particular we can state the following Corollary that holds for any convex loss that upper bounds the 0/1 loss. [sent-102, score-0.133]
37 , T do 4: Receive xt ∗ 5: Set wt = ft (θt ) 6: Predict yt = sign(wt xt ) ˆ 7: Receive yt 8: if t (wt ) > 0 then 9: zt = ∂ t (wt ) 10: θt+1 = θt − ηt zt 11: else 12: θt+1 = θt 13: end if 14: end for T ∗ Corollary 1. [sent-110, score-1.765]
38 1 has the following bound on the maximum number of mistakes M , T M≤ t (u) t=1 + T zt 2t B fT (u) f∗ +η . [sent-113, score-0.502]
39 + η 2βt η t=1 (1) Moreover if ft (x) ≤ ft+1 (x), ∀x ∈ S, t = 0, . [sent-114, score-0.483]
40 We could use adaptive regularization methods, as the one proposed in [16, 18], but in this way we would lose the possibility to prove mistake bounds for second order algorithms, like the ones in [1, 5]. [sent-123, score-0.249]
41 1 Better bounds for linear losses The hinge loss, (u, xt , yt ) = max(1 − yt u xt , 0), is a very popular evaluation metric in classification. [sent-126, score-0.862]
42 It has been used, for example, in Support Vector Machines [7] as well as in many online learning algorithms [3]. [sent-127, score-0.147]
43 Often mistake bounds are expressed in terms of the hinge loss. [sent-129, score-0.22]
44 One reason is that it is a tighter upper bound of the 0/1 loss compared to other losses, as the squared hinge loss. [sent-130, score-0.158]
45 However, this loss is particularly interesting for us, because it allows an automatic tuning of the bound in (1). [sent-131, score-0.139]
46 In particular it is easy to verify that it satisfies the following condition (u, xt , yt ) ≥ 1 + u ∂ t (wt ), ∀u ∈ S, wt : 3 t (wt ) >0. [sent-132, score-0.608]
47 Under the hypothesis of Lemma 1, if fT (λu) ≤ λ2 fT (u), and satisfies (2), then for any u ∈ S, and any λ > 0 we have ηt ≤ L + λfT (u) + t∈M∪U 1 λ where L = t∈M∪U ηt t (u), and B = optimal λ, we obtain ηt ≤ L + 2fT (u) B+ t∈M∪U T ∗ t=1 (ft (θt ) 2 ∗ ft − ηt wt zt , ∗ − ft−1 (θt )). [sent-135, score-0.886]
48 In particular, choosing the 2B + t∈M∪U t∈M∪U 2 ηt zt 2βt 2 ηt zt βt 2 ∗ ft − 2ηt wt zt . [sent-136, score-1.181]
49 In other words, wt and αwt (with α > 0) make exactly the same predictions, because only the sign of the prediction matters. [sent-138, score-0.276]
50 Exactly this independence in a scale factor allows us to improve the mistake bound (1) to the bound of (3). [sent-139, score-0.259]
51 Finally, note that when the hinge loss is used, the vector θt is updated as in an aggressive version of the Perceptron algorithm, with a possible variable learning rate. [sent-141, score-0.226]
52 1 to obtain an aggressive version of the p-norm algorithm [11]. [sent-145, score-0.156]
53 Set 1 ft (u) = 2(q−1) u 2 , that is 1-strongly convex w. [sent-146, score-0.52]
54 Moreover set ηt = 1 in mistake error rounds, so using the second bound of Corollary 2, and defining R such that xt 2 ≤ R2 , we have p M ≤L+ ≤L+ u 2 q q−1 u 2 q q−1 2 ηt xt 2 p + 2ηt yt wt xt − t∈M∪U ηt t∈U 2 ηt xt M R2 + 2 p + 2ηt yt wt xt − ηt . [sent-151, score-2.007]
55 t∈U t∈U Solving for M we have M ≤L+ where L = 1 u q u 2 R2 + R √ q 2(q − 1) q−1 t∈M∪U ηt t (u), and D = 1 u 2 R2 + L + D − q 4(q − 1) 2 ηt x t t∈U 2 p +2ηt yt wt R2 xt ηt , (4) t∈U − ηt . [sent-152, score-0.59]
56 1 becomes the R2 −2yt wt xt p-norm algorithm and we recover its best bound [11]. [sent-155, score-0.543]
57 However if 0 ≤ ηt ≤ min ,1 xt 2 p we have that D is negative, and L ≤ t∈M∪U t (u). [sent-156, score-0.211]
58 Hence the aggressive updates gives us a better bound, thanks to last term that is subtracted to the bound. [sent-157, score-0.166]
59 In particular the minimum R2 /2−yt wt xt of D, under the constraint ηt ≤ 1, can be found setting ηt = min , 1 . [sent-159, score-0.429]
60 If R is equal xt 2 √ to 2, we recover the PA-I update rule, when C = 1. [sent-160, score-0.261]
61 However note that the mistake bound in (4) is better than the one proved for PA-I in [3] and the ones in [16]. [sent-161, score-0.194]
62 Hence the bound (4) provides the first theoretical justification to the good performance of the PA-I, and it can be seen as a general evidence supporting the aggressive updates versus the conservative ones. [sent-162, score-0.26]
63 Set ft (x) = 1 x At x, 2 x x where At = At−1 + tr t , r > 0 and A0 = I. [sent-165, score-0.483]
64 f The dual functions of ft (x), ft∗ (x), are equal to 1 x A−1 x, t 2 while x 2t is x A−1 x. [sent-170, score-0.508]
65 Denote by χt = xt A−1 xt and t t−1 f∗ mt = yt xt A−1 θt . [sent-171, score-0.94]
66 1 specialized in this case is yt wt xt = mt r+χt , on the other hand AROW predicts with mt . [sent-175, score-0.882]
67 The sign of the predictions is the same, but here the aggressive version is upr dating when mt r+χt ≤ 1, while AROW updates if mt ≤ 1. [sent-176, score-0.493]
68 To derive the bound, observe that using Woodbury matrix identity we have ft∗ (θt ) (x A−1 θt )2 t−1 ∗ ft−1 (θt ) − t = − 2(r+x t m2 A−1 xt ) t−1 t = − 2(r+χt ) . [sent-179, score-0.23]
69 Using the second bound in Corollary 2, and setting ηt = 1 we have M +U ≤L+ m2 t r + χt xt A−1 xt + 2yt wt xt − t u AT u t∈M∪U ≤L+ ≤L+ 2 u r u + 2 1 r (u xt )2 t∈M∪U t∈M∪U (u xt )2 + 2yt wt xt − r log(det(AT )) + log(det(AT )) + t∈M∪U t∈M∪U m2 t r + χt mt (2r − mt ) . [sent-180, score-2.059]
70 r(r + χt ) This bound recovers the SOP’s one in the conservative case, and improves slightly the one of AROW for the aggressive case. [sent-181, score-0.222]
71 Here we prove a mistake bound for the diagonal version of AROW, using Corollary 2. [sent-187, score-0.251]
72 We denote Dt = diag{At }, where At is defined as in SOP and AROW, and ft (x) = 1 x Dt x. [sent-188, score-0.483]
73 The presence of a mistake bound allows us to theoretically analyze the cases where this algorithm could be advantageous respect to a simple Perceptron. [sent-190, score-0.217]
74 We set x x 1 ft (x) = 2 x At x, where At = At−1 + trt t , and A0 = I. [sent-206, score-0.483]
75 This is similar to the regularization used in AROW and SOP, but here we have rt > 0 changing over time. [sent-207, score-0.143]
76 Again, denote χt = xt A−1 xt , t−1 and set ηt = 1. [sent-208, score-0.422]
77 With this choices, we obtain the bound M +U ≤ t (u) + t∈M∪U λ u 2 2 + t∈M∪U mt (2rt − mt ) λ(u xt )2 χt rt − + 2rt 2λ(rt + χt ) 2λ(rt + χt ) , that holds for any λ > 0 and any choice of rt > 0. [sent-209, score-0.824]
78 We would like to choose rt at each step to 2 χ rt minimize the bound, in particular to have a small value of the sum λ(u rtxt ) + λ(rtt+χt ) . [sent-210, score-0.256]
79 Altough t we do not know the values of (u xt )2 and λ, still we can have a good trade-off setting rt = bχχ−1 t when χt ≥ 1 and rt = +∞ otherwise. [sent-211, score-0.467]
80 With this choice we have that b χt rt rt +χt = 1 , and b (u xt )2 rt M +U − ≤ t:bχt >1 λ u 2 = χt (u xt )2 b , rt +χt when χt ≥ 1 . [sent-213, score-0.934]
81 Tuning λ we have M +U ≤ t (u) t∈M∪U + u R 1 + log det(AT ) bR2 min (1, bχt ) − t∈M∪U bmt (2rt − mt ) rt + χt This algorithm interpolates between a second order algorithm with adaptive second order information, like AROW, and one with a fixed second order information. [sent-215, score-0.395]
82 Middle: cumulative number of mistakes of four algorithms on data with no labels noise. [sent-222, score-0.414]
83 3 summarizes the cumulative number of mistakes averaged over 100 repetitions and the third row shows the cumulative number of mistakes where 10% artificial label noise was used. [sent-237, score-0.799]
84 All algorithms make few mistakes when receiving the first half of the data - the easy examples. [sent-240, score-0.328]
85 Then all algorithms start to make more mistakes - PA-I the most, then AdaGrad and closely following NAROW, and AROW the least. [sent-241, score-0.31]
86 All algorithms follow a general trend of making mistakes in a linear rate and then stop making mistakes when the data is easy and there are many possible classifiers that can predict correctly. [sent-244, score-0.668]
87 Clearly, 7 AROW and NAROW stop making mistakes first, then AdaGrad and PA-I last. [sent-245, score-0.307]
88 In both cases, all algorithms do not make many mistakes in the beginning, then at some point, close to the middle of the input sequence, they start making many mistakes for a while, and then they converge. [sent-248, score-0.602]
89 However, NAROW starts to make many mistakes before the other algorithms and takes more “examples” to converge until it stopped making mistakes. [sent-250, score-0.325]
90 7 Conclusion We presented a framework for online convex classification, specializing it for particular losses, as the hinge loss. [sent-254, score-0.212]
91 This general tool allows to design theoretical motivated online classification algorithms and to prove their relative mistake bound. [sent-255, score-0.313]
92 We have shown its utility proving better bounds for known online algorithms, and proposing a new algorithm, called NAROW. [sent-258, score-0.161]
93 Define by ft∗ the Fenchel dual of ft , and ∆t = ft∗ (θt+1 ) − ft−1 (θt ). [sent-268, score-0.508]
94 Moreover we have that ∆t = ft (θt+1 ) − ft (θt ) + ft∗ (θt ) + ∗ ∗ ft∗ (θt ) − ft−1 (θt ) ≤ ft∗ (θt ) − ft−1 (θt ) − ηt zt 2 ηt 2βt zt 2 ∗ ft , 1 λ where we used Theorem 6 T 1 ∗ in [13]. [sent-270, score-1.769]
95 Moreover using the Fenchel-Young inequality, we have that t=1 ∆t = λ fT (θT +1 ) ≥ T 1 1 u θT +1 − λ fT (λu) = − t=1 ηt u zt − λ fT (λu). [sent-271, score-0.16]
96 Hence putting all togheter we have T − ηt u z t − t=1 1 1 fT (λu) ≤ λ λ T ∆t ≤ t=1 1 λ T ∗ (ft∗ (θt ) − ft−1 (θt ) − ηt wt zt + t=1 2 ηt zt 2βt 2 ∗ ft ), where we used the definition of wt in Algorithm 1. [sent-272, score-1.239]
97 By convexity, (wt , xt , yt ) − (u, xt , yt ) ≤ zt (wt − u), so setting λ = 1 in Lemma 1 we have the stated bound. [sent-274, score-0.904]
98 For the additional statement, using Lemma 12 in [16] and ∗ ft (x) ≤ ft+1 (x) we have that ft∗ (x) ≥ ft+1 (x), so B ≤ 0. [sent-275, score-0.483]
99 Using it, we have that ft (x) ≤ ft+1 (x) implies that ft∗ (x) ≥ ∗ ft+1 (x), so we have that B ≤ 0. [sent-277, score-0.483]
100 Lemma 1, the condition on the loss (2), and the hypothesis on fT gives us T T ηt (1 − t (u)) ≤ − t=1 ηt u zt ≤ λfT (u) + t=1 1 λ T t=1 2 ηt zt 2βt 2 ∗ ft + B − ηt zt wt Note that λ is free, so choosing its optimal value we get the second bound. [sent-279, score-1.255]
wordName wordTfidf (topN-words)
[('ft', 0.483), ('arow', 0.441), ('narow', 0.292), ('mistakes', 0.277), ('adagrad', 0.243), ('wt', 0.218), ('xt', 0.211), ('yt', 0.161), ('zt', 0.16), ('mt', 0.146), ('mistake', 0.129), ('rt', 0.128), ('online', 0.114), ('aggressive', 0.113), ('cumulative', 0.104), ('sop', 0.097), ('corollary', 0.08), ('perceptron', 0.065), ('cw', 0.065), ('bound', 0.065), ('rounds', 0.058), ('crammer', 0.058), ('loss', 0.049), ('nlp', 0.049), ('bounds', 0.047), ('pa', 0.046), ('lemma', 0.046), ('conservative', 0.044), ('hinge', 0.044), ('adaptive', 0.04), ('koby', 0.039), ('orabona', 0.039), ('updates', 0.038), ('convex', 0.037), ('interpolates', 0.035), ('algorithms', 0.033), ('regret', 0.033), ('classi', 0.032), ('det', 0.032), ('sign', 0.03), ('examples', 0.03), ('signed', 0.03), ('ordered', 0.029), ('sentiment', 0.028), ('francesco', 0.028), ('prediction', 0.028), ('ordering', 0.027), ('losses', 0.027), ('recover', 0.026), ('dredze', 0.026), ('milano', 0.026), ('tuning', 0.025), ('hypothesis', 0.025), ('dual', 0.025), ('update', 0.024), ('hyperplane', 0.024), ('algorithm', 0.023), ('label', 0.022), ('dence', 0.022), ('fenchel', 0.021), ('multiclass', 0.021), ('round', 0.02), ('version', 0.02), ('coordinates', 0.019), ('derive', 0.019), ('yellow', 0.019), ('convexity', 0.019), ('design', 0.019), ('diagonal', 0.019), ('hazan', 0.019), ('eigenvalues', 0.019), ('prove', 0.018), ('trend', 0.018), ('easy', 0.018), ('plots', 0.018), ('synthetic', 0.017), ('informative', 0.017), ('rare', 0.017), ('framework', 0.017), ('moreover', 0.017), ('kakade', 0.016), ('illustrate', 0.016), ('dt', 0.016), ('hard', 0.016), ('receive', 0.016), ('focusing', 0.016), ('link', 0.016), ('sequence', 0.016), ('weight', 0.016), ('margin', 0.015), ('making', 0.015), ('thanks', 0.015), ('stop', 0.015), ('regularization', 0.015), ('builds', 0.015), ('third', 0.015), ('wrong', 0.014), ('rule', 0.014), ('fourth', 0.014), ('supports', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
2 0.30141094 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
3 0.266491 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
4 0.21562022 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
5 0.14456269 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.14305504 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
7 0.12347706 162 nips-2010-Link Discovery using Graph Feature Tracking
8 0.11661258 15 nips-2010-A Theory of Multiclass Boosting
9 0.11069065 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
10 0.095041499 61 nips-2010-Direct Loss Minimization for Structured Prediction
11 0.088630497 150 nips-2010-Learning concept graphs from text with stick-breaking priors
12 0.087201945 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
13 0.083761603 138 nips-2010-Large Margin Multi-Task Metric Learning
14 0.080158696 203 nips-2010-Parametric Bandits: The Generalized Linear Case
15 0.076715514 188 nips-2010-On Herding and the Perceptron Cycling Theorem
16 0.074552596 265 nips-2010-The LASSO risk: asymptotic results and real world examples
17 0.069814786 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
18 0.069108412 212 nips-2010-Predictive State Temporal Difference Learning
19 0.068357013 243 nips-2010-Smoothness, Low Noise and Fast Rates
20 0.068125226 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
topicId topicWeight
[(0, 0.174), (1, -0.042), (2, 0.145), (3, 0.005), (4, 0.034), (5, 0.101), (6, -0.112), (7, -0.047), (8, -0.14), (9, 0.213), (10, 0.079), (11, -0.001), (12, 0.111), (13, 0.091), (14, -0.025), (15, 0.159), (16, -0.045), (17, -0.229), (18, -0.066), (19, -0.042), (20, -0.009), (21, 0.143), (22, 0.143), (23, 0.087), (24, -0.262), (25, -0.015), (26, 0.088), (27, 0.086), (28, -0.14), (29, 0.055), (30, 0.077), (31, 0.072), (32, -0.052), (33, 0.166), (34, -0.116), (35, 0.02), (36, 0.077), (37, 0.029), (38, 0.131), (39, -0.038), (40, -0.18), (41, 0.096), (42, 0.09), (43, 0.018), (44, 0.006), (45, 0.066), (46, -0.085), (47, -0.108), (48, -0.052), (49, -0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.9464311 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
2 0.67715943 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
3 0.65438485 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
4 0.6006853 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
5 0.54964322 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
6 0.46843323 61 nips-2010-Direct Loss Minimization for Structured Prediction
7 0.42175597 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
8 0.40868038 222 nips-2010-Random Walk Approach to Regret Minimization
9 0.39119124 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.36879095 15 nips-2010-A Theory of Multiclass Boosting
11 0.33727401 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
12 0.31338856 226 nips-2010-Repeated Games against Budgeted Adversaries
13 0.30347657 2 nips-2010-A Bayesian Approach to Concept Drift
14 0.29101905 19 nips-2010-A rational decision making framework for inhibitory control
15 0.28852159 212 nips-2010-Predictive State Temporal Difference Learning
16 0.27948985 138 nips-2010-Large Margin Multi-Task Metric Learning
17 0.27924103 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
18 0.27695784 202 nips-2010-Parallelized Stochastic Gradient Descent
19 0.26851308 192 nips-2010-Online Classification with Specificity Constraints
20 0.26832205 203 nips-2010-Parametric Bandits: The Generalized Linear Case
topicId topicWeight
[(13, 0.069), (27, 0.054), (30, 0.047), (35, 0.012), (39, 0.303), (45, 0.174), (50, 0.037), (52, 0.032), (60, 0.033), (77, 0.062), (78, 0.045), (90, 0.032)]
simIndex simValue paperId paperTitle
1 0.73148167 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
same-paper 2 0.72189695 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.72150749 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
Author: Ulrike V. Luxburg, Agnes Radl, Matthias Hein
Abstract: The commute distance between two vertices in a graph is the expected time it takes a random walk to travel from the first to the second vertex and back. We study the behavior of the commute distance as the size of the underlying graph increases. We prove that the commute distance converges to an expression that does not take into account the structure of the graph at all and that is completely meaningless as a distance function on the graph. Consequently, the use of the raw commute distance for machine learning purposes is strongly discouraged for large graphs and in high dimensions. As an alternative we introduce the amplified commute distance that corrects for the undesired large sample effects. 1
4 0.7048344 172 nips-2010-Multi-Stage Dantzig Selector
Author: Ji Liu, Peter Wonka, Jieping Ye
Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the first term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1
5 0.69217306 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
6 0.57667512 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
7 0.5746578 222 nips-2010-Random Walk Approach to Regret Minimization
8 0.57220966 117 nips-2010-Identifying graph-structured activation patterns in networks
9 0.57106358 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
10 0.57060874 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.5701912 63 nips-2010-Distributed Dual Averaging In Networks
12 0.56953001 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
13 0.56820965 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
14 0.56804711 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.56685001 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
16 0.56666034 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
17 0.56568265 158 nips-2010-Learning via Gaussian Herding
18 0.565382 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
19 0.56516284 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
20 0.56413662 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone